Deep learning hardware architecture and software deploy with docker

深度學習硬體架設與 Docker 佈署
雲端開發環境
Allen.YL Lee
https://docs.google.com/presentation/d/1BOKvaSVk-l7E246ci
TXJDNdtc8RYofM3MK_77kvA2-Q/edit?usp=sharing

About me
● Exprience:
Jan 2018 - Apr 2018: Student @ Taiwan AI Acedemy
Feb 2017 - Oct 2017: SSD firmware Engineer @ Liteon
Feb 2014 - Feb 2017: BIOS firmware Engineer @ Pegatron
Mar 2013 - Jan 2014: Computer Game Software Engineer @ IGS
● Education:
M.S. in Physics, Nano Magnetic Lab, NTNU
B.S. in Physics with minor in Computer Science, NTNU
Linkedin: https://www.linkedin.com/in/allenyllee/
Github: https://github.com/allenyllee

Outline
● A Brief Introduction to Computer
Architecture
○ CPU vs. GPU
● GPGPU for Deep Learning
○ OpenCL and CUDA
○ A simple program: vector addition
○ The Bottleneck: Storage
● Other ICs for Deep Learning
○ SoC FPGA
○ Google TPU
○ ARM & Nvidia
● Go cloud or build a self-hosted GPU
server
○ Lists of some GPU cloud providers
○ Building your first Deep Learning Box
● What is Docker
○ Why do I need Docker for Deep Learning
○ Demo: Setup your environment with Docker
● What is Mesos
○ Demo: TensorFlow on Mesos
○ 比較 Docker、Mesos、Kubernate
○ About HyperPilot
● What is Contineous Integration(CI)
○ CI for your Machine Learning project
○ Demo: Setup your Jenkins CI Tool
● What is Blockchain
○ Blockchain and AI
● What is Quantum Computing
○ Quantum Computing for Machine Learning
● Recap

A Brief Introduction to Computer Architecture
● The bandwidth of PCIe 3.x with 16-lane
(x16) is about 15.75 GB/s
● Bandwidth needed for 1080p: (CPU decode)
1920*1080*32bits*30fps = 240 MB/s
(GPU decode) 8~12 Mbps=1~1.5MB/s
● According to the PCIe topology of your
system and neural network structure, more
GPUs will begins to have falloff in scaling.
● Generally with "weak scaling" where the
batch size increases with the number of
GPUs, so with 4 GPUs, you will see 3.5x
scaling or so.
PCI Express - Wikiwand https://www.wikiwand.com/en/PCI_Express
caffe-yolo/multigpu.md at master · yeahkun/caffe-yolo
https://github.com/yeahkun/caffe-yolo/blob/master/docs/multigpu.md

CPU vs. GPU (1)
● CPU
○ 序列運算
○ CISC(複雜指令集計算)
○ Control Unit 與Cache較多
○ 適合複雜邏輯運算
● GPU
○ 平行運算
○ RISC(精簡指令集計算)
○ ALU運算單元較多
○ 適合大量簡單運算
How a GPU Works
https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf

CPU vs. GPU (2)
● CPU
○ 快速的邏輯、整數、倍精度運算 (64bits)
○ 快速純量運算
○ 一個核心只有一個執行緒
○ 較佳的隨機存取
○ MIMD (Multiple Instruction / Multiple Data)
● GPU
○ 高速浮點運算(32bits)
○ 高速向量運算(同時計算多個分量)
○ 同時管理數萬個執行緒
○ 較高記憶體頻寬
○ SIMD (Single Instruction / Multiple Data)
How a GPU Works
https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf

CPU vs. GPU (3)
● CPU 是對latency最佳化；GPU 是對
bandwidth最佳化
● 假設CPU 是一台法拉利，而GPU 是一台大
卡車。有個工作是要將貨物從任意的A地點
載到B地點。CPU (法拉利)雖然很快，但一
次只能裝少量，必須來回很多次；而
GPU(大卡車)雖然慢，但一次可以裝大量，
只要來回一次。
● 但GPU(大卡車)卸貨很慢(latency很大)，此
時GPU的多執行緒(分批卸貨)隱藏卸貨之
間的等待時間。另一方面，這也是為什麼
CPU(法拉利)做多執行緒(分批卸貨)效果不
大的原因。
Tim Dettmers's answer to Why are GPUs well-suited to deep learning? - Quora
https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning/answer/Tim-Dettmers-1

CPU vs. GPU (4)
● GPU的另一個優勢是它的每個串流運算單元
(Stream Processor Unit) 都配有一個L1
Cache，而且距離很近(越近速度越快)。
● 再加上NVIDIA 對自家的compiler 做了暫存
器(register)配置的優化，使得編譯出來的
GPU code 可以更有效利用暫存器Cache
● 結論: 為什麼GPU 這麼適合Deep learning?
○ (1) High bandwidth main memory,
○ (2) hiding memory access latency under thread
parallelism, and
○ (3) large and fast register and L1 memory which is
easily programmable
Tim Dettmers's answer to Why are GPUs well-suited to deep learning? - Quora
https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning/answer/Tim-Dettmers-1

GPGPU for Deep Learning
● General-Purpose computing on GPU
● 利用處理圖形任務的圖形處理器來計算原本由中央處理器處理的通用計算任
務。
● GPGPU concept
○ Arrays = textures (紋理、材質)
○ Kernels = shaders (著色器)
○ Computing = drawing (渲染)
○ Feedback = new textures (運算後的紋理)
● 傳統的GPGPU的開發方法，都是透過OpenGL 或 Direct3D這一類現有的圖形函
式庫，以編寫shading language 的方法，控制 shader 來想辦法做到自己想要的
計算
GPGPU Tutorials https://www.seas.upenn.edu/~cis565/fbo.htm

OpenCL/OpenGL/DirectX/CUDA https://www.zybuluo.com/johntian/note/673607
OpenCL 與 CUDA
● OpenCL 全名 Open Computing
Language，由 Apple 發起的通用平行
運算API 開放標準，適用於多核心
CPU、GPU、DSP...等平行處理器
● CUDA 是一個專用於 Nvidia GPU 的
平行計算框架，包含了 ISA 硬體指令
集及C語言編譯器
● OpenCL 屬於硬體底層API，開發難度
較高；CUDA 使用高階C語言，並得到
廠商大力支持，開發難度較低

A simple program: vector addition
C void vectorAdd(const float *a, const float *b, const float*c){
for(int i=0;i<MAX;i++) c[i] = a[i] +b[i];}
C use
OpenGL
void vectorAdd(sampler2D textureA, sampler2D textureB){
vec4 A = texture2D(textureA, gl_TexCoord[0].st);
vec4 B = texture2D(textureB, gl_TexCoord[0].st);
gl_FragColor = A+B;}
C for CUDA __global__ void vectorAdd(const float * a, const float * b, float * c) {
// Vector element index
int nIndex = blockIdx.x * blockDim.x + threadIdx.x;
c[nIndex] = a[nIndex] + b[nIndex]; }
OpenCL __kernel void vectorAdd(__global const float * a, __global const float * b, __global float * c) {
// Vector element index
int nIndex = get_global_id(0);
c[nIndex] = a[nIndex] + b[nIndex]; }

The Bottleneck: Storage (1)
● NVIDIA 提供一種技術叫
GPUDirect，可以讓GPU 透過PCIe
直接存取其他裝置，無須透過CPU
● 可以讓GPU 直接從SSD把 data 搬
到GPU memory嗎？Yes!
● Github 上有人實作了Linux 下的
NVMe Driver，對底層有興趣的同
學請參考~~
enfiskutensykkel/ssd-gpu-dma: Build userspace
NVMe drivers and storage applications with CUDA
support
https://github.com/enfiskutensykkel/ssd-gpu-dma
NVIDIA GPUDirect | NVIDIA Developer
https://developer.nvidia.com/gpudirect

The Bottleneck: Storage (2)
● 假設你需要在ImageNet (約1TB) 上train一個model
● 你只有一台2T 的HDD 跟 256GB 的 SSD
● 如果你把資料存在HDD，隨機存取的話，需要花上667小時才能讀完.....
● 如果你夠有錢，換顆大SSD把資料全塞進去，就只需要6.7小時......
● 我們發現循序存取很快，但無法直接這樣做，因為每個batch都要random shuffle.....
● 解決辦法:
○ 拆分list，將 images 轉成jpg 分別包成數個二進位檔，每個二進位檔都是可以循序讀取的
○ 每個batch 就從list 中random shuffle，再分別從數個二進位檔中循序讀取出來
Training Deep Net on 14 Million Images by Using A Single Machine
http://dmlc.ml/mxnet/2015/10/27/training-deep-net-on-14-million-images.html

Other ICs for Deep Learning
● There are other ICs for Deep Learning which
have its own scenario respectively
○ FPGA
○ TPU
○ ARM & Nvidia
○ ….

SoC FPGA (1)
● 簡單來說,就是在FPGA可程式邏輯晶片嵌入
了一個「硬核」處理器系統 -- SoC (包含了
ARM處理器、記憶體控制器、 I/O週邊)，在
Intel (Altera)把這個SoC稱作是HPS (Hard
Processor System) 。
● HPS 和 FPGA 有自己的Bus系統, HPS 的
ARM 是 AXI Bus , FPGA 是 Avalon Bus, ,
故需要設計一個Bridge 讓兩邊系統能夠溝
通。
● 從先前對GPU 的分析來看，影響效能最大因
素在於SoC 跟 FPGA 之間Bridge 的頻寬
● 軟體支援的缺乏...
IT Robotics Lab: SoC FPGA 嵌入式系統晶片?
http://blog.ittraining.com.tw/2017/09/soc-fpga.html

SoC FPGA (2)
● 軟體支援的救星？
● Intel 推出開源Deep Learning
Compiler 套件nGraph，目標
實現不同框架對各種不同硬
體的支援(主要還是intel自家
的 Nervana)
● 但晶片何其多，仍然需要各家
IC 供應商的支持...
nGraph: A New Open Source Compiler for Deep Learning Systems - Intel
AI
https://ai.intel.com/ngraph-a-new-open-source-compiler-for-deep-learning
-systems/?

Google TPU(1)
● TPU (Tensor Processing Unit) 是 google 為了深度學習
專門設計的ASCI ( Application-Specific Integrated
Circuit)晶片
● TPUv1 只能做推論，不能做訓練...
○ The TPU is connected to its host via a PCIe Gen3 x16 bus that
provides 12.5GB/s of effective bandwidth.
○ TPU delivered 15–30X higher performance and 30–80X higher
performance-per-watt than contemporary CPUs and GPUs.
● 核心技術為1982年Kung 提出的 systolic array (脈動陣
列)
○ 256 x 256 8bit multiply-add computational units. That’s a grand total
of 65,536 processors capable of cranking out 92 trillion (9*
10^12)operations per second!
[1]An in-depth look at Google’s first Tensor Processing Unit (TPU) | Google Cloud Big Data and Machine Learning Blog | Google Cloud
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
[2]download;jsessionid=EA259B539AA3B7DCB8D2C11D1397C767 - 1982-kung-why-systolic-architecture.pdf
http://www.eecs.harvard.edu/~htk/publication/1982-kung-why-systolic-architecture.pdf

Google TPU(2)
● TPUv2 開始可以做訓練了，但架構
也更像GPU
○ 似乎並未沿用第一代 systolic array，而改
用更通用的scarlar/vector units
○ 軟體支援上，目前只能使用自家的
tensorflow….
● 只能在Google Cloud 租用
https://cloud.google.com/tpu/
[1]dean-nips17.pdf
http://learningsys.org/nips17/assets/slides/dean-nips17.pdf
[2]【Hot Chips 29】淺談 Google 的 TPU | TechNews 科技新報
https://technews.tw/2018/01/02/about-google-tpu/

ARM & Nvidia
● Nvidia和ARM展開合作，將Nvidia的開放來源硬體導入推論作業，成為ARM機
器學習產品計劃的一部份。
GTC2018：Nvidia全面升級AI軟硬體 - EE Times Taiwan 電子工程專輯網
https://www.eettaiwan.com/news/article/20180328NT01-Nvidia-Taps-Memory-Switch-for-AI?utm_source=EETT%20Article
%20Alert&utm_medium=Email&utm_campaign=2018-03-29

Go cloud or build a self-hosted GPU server(1)
● Paperspace 的CP值最高，適合偶爾進行
DL實驗的使用者
● 重度使用者，可使用月租制專用伺服器如:
LeaderGPU, Hetzner 壓低平均價格
● 如果想要集成其他雲端服務，AWS跟GCE
是很好的選擇
● 除非你的工作要跑好幾天，否則儘量選擇
low-end signle GPU 方案
● High-end GPUs 的投報率太低，除非你們
的研發必須跟時間賽跑重於硬體開銷，否
則不要使用(這樣還不如自己DIY一台！)
Machine learning mega-benchmark: GPU providers (part 2)
https://rare-technologies.com/machine-learning-benchmarks-hardware-providers-gpu-part-2/

● 另外一篇評測，多了一個FloydHub，結果跟前面差不多
● 結論: Paperspace CP值最高
AWS vs Paperspace vs
FloydHub : Choosing your
cloud GPU partner
https://medium.com/@rupak.t
hakur/aws-vs-paperspace-vs-
floydhub-choosing-your-cloud
-gpu-partner-350150606b39

● 關於 Google Colab的免費GPU....
○ 对于初学者来说，这是一个利好消息，毕竟可以无偿使用服务器级别的GPU来做学习。但对于进阶的AI研
究或中度使用，这个平台的意义有限。
○ 首先，这个系统提供的是K80型号的GPU，这款GPU是在2014年末推出的，属于Kepler架构，Kepler之后
新的架构依次是Maxwell、Pascal和Volta，换句话说，K80已经是4代以前的GPU了。
○ 第二，通过‘device_lib.list_local_devices()’命令看到用户只被分配了一块GPU卡，这对深度学习应用来说
是远远不够的。
○ 第三，这个平台相对于独立的服务器或集群来说灵活性较低，想查看GPU的利用率等信息会比较费力，基
本不要指望用这个平台做程序性能优化研究。
○ 此外，猜测系统对GPU做了虚拟化，用户不会独占一块GPU，每块GPU应该都是多用户复用的。
○ Google需要付出的是有限的老几代架构的GPU，一个需要维护的规模有限的系统及电费等消耗，而收获
的则是一群新用户对于Google平台使用习惯的养成，后者的价值可能大于前者的消耗。
如何评价 Google Colab 提供的免费 GPU？ - 知乎
https://www.zhihu.com/question/266242493
Google Colab Free GPU Tutorial – Deep Learning Turkey – Medium
https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d

● Vectordash 是一個共享GPU平台，可
將家中閒置的GPU提供出來租給別人
使用，號稱價格較AWS等雲端服務更
便宜
● 採用容器化技術將使用者資料隔離
● 使用比特幣進行線上支付
共享GPU來了！投身去中心化機器學習，比挖礦多賺3倍 - 幫趣
http://bangqu.com/N96d1m.html

Lists of some GPU cloud providers
● Amazon’s AWS
● Google Compute Engine
● Google Colab (Free GPU)
● Paperspace
● Hetzner (dedicated server)
● IBM's Softlayer
● LeaderGPU
● FloydHub
● Microsoft Azure
● ……
● Note: Nvidia GPU Cloud (NGC) is just an all-in-one software solution, as a partner of above cloud
services, not compete with them.
GPU-Accelerated Cloud (NGC) for Deep Learning & HPC | NVIDIA
https://www.nvidia.com/en-us/gpu-cloud/

Building your first Deep Learning Box (1)
● GPU: (參考下頁)
● CPU: 每顆GPU至少配2條執行緒；完整 40 個 PCIe lane和正確的PCIe spec；> 2GHz；Cache不重要
● RAM: 至少要大於GPU RAM
● Hard drive/SSD: 壓縮你的數據，使用非同步方式讀取，用HDD還可以，除非你用32位浮點數資料集
且輸入維度很高
● PSU: 瓦數要足: GPUs + CPU 的瓦數再加 (100-300瓦) 以及足夠的PCIe 連接線(6+8pin)保留給未
來升級
● Cooling: 單顆GPU 記得到BIOS 打開散熱風扇；多顆 GPU 建議用水冷
● Motherboard: 多幾個PCIe 3.0 插槽，保留未來升級的空間 (一個GPU佔兩個插槽，一個系統最多 4個
GPU)
A Full Hardware Guide to Deep Learning - Tim Dettmers
http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/

● 影響GPU效能最重要的因素是？
CUDA核心數？時脈？記憶體大
小？
● 決定GPU效能最重要的關鍵，其
實是記憶體頻寬
● 所以，相同架構下，選記憶體頻寬
越大的效能越好
● 不同架構的GPU會因為不同製程
對記憶體頻寬的利用率不同而造
成些微差異，但基本上，記憶體頻
寬還是一個不錯的指標。
Which GPU(s) to Get for Deep Learning
http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/

● Best GPU overall (by a small margin): Titan Xp
● Cost efficient but expensive: GTX 1080 Ti, GTX 1070, GTX 1080
● Cost efficient and cheap: GTX 1060 (6GB)
● I work with data sets > 250GB: GTX Titan X (Maxwell), NVIDIA Titan X Pascal, or NVIDIA Titan Xp
● I have little money: GTX 1060 (6GB)
● I have almost no money: GTX 1050 Ti (4GB)
● I do Kaggle: GTX 1060 (6GB) for any “normal” competition, or GTX 1080 Ti for “deep learning competitions”
● I am a competitive computer vision researcher: NVIDIA Titan Xp; do not upgrade from existing Titan X (Pascal or
Maxwell)
● I am a researcher: GTX 1080 Ti. In some cases, like natural language processing, a GTX 1070 or GTX 1080 might
also be a solid choice — check the memory requirements of your current models
● I want to build a GPU cluster: This is really complicated, you can get some ideas here
● I started deep learning and I am serious about it: Start with a GTX 1060 (6GB). Depending of what area you choose
next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate
● I want to try deep learning, but I am not serious about it: GTX 1050 Ti (4 or 2GB)
Which GPU(s) to Get for Deep Learning
http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/

● 有時也要注意浮點數對效能造成的影響
○ half precision: FP16
○ single precision: FP32
○ double precision: FP64
● Precision越高越貴，但效能不一定更好，因為deep learning 用不到 double precision，
最多只會用到single precision，甚至有人使用half precision，因為位元數少一半，代表
memory 能塞下多一倍大小的batch size
○ With 16bit precision (FP16), you can load more data for training equivalent to 16GB of FP32 (either
bigger batch sizes or more layers).
● 但1080 的架構造成FP16 會比FP32 的1/64倍慢，反而把數據縮小的效果抵銷了...
○ The 1080 has indeed the 16-bit support but these instructions are 1/64th slower than the 32-bits’s
What are the downsides of the NVIDIA's GTX 1080 vs Titan X when used for deep learning? - Quora
https://www.quora.com/What-are-the-downsides-of-the-NVIDIAs-GTX-1080-vs-Titan-X-when-used-for-dee
p-learning

● Pascal GP100 核心導入了一種新型的
FP32 CUDA core，可以在一個cycle內同時
執行兩個FP16 的指令(必須相同Op)，因此
吞吐量比Maxwell 或 Kepler架構多兩倍
● 但Nvidia 在消費級顯卡(包含1080)使用
Pascal GP104 核心，每128個FP32
CUDA core 才配置了一個FP16X2 CUDA
core，限制FP16 的指令速率為FP32的
1/128，吞吐量再乘以2，只有FP32的1/64
FP16 Throughput on GP104: Good for Compatibility (and Not Much Else) - The NVIDIA GeForce GTX 1080 & GTX 1070
Founders Editions Review: Kicking Off the FinFET Generation
https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5

● 如果你想在筆電上跑....
● 有人推薦 Acer Predator G9-593
● 灌好Ubuntu 跟一些套件之後，就可以
開始跑 pix2pix-tensorflow 啦！
Early Experiences with Deep Learning on a Laptop with Nvidia GTX 1070 GPU – part 1
https://amundtveit.com/2017/02/28/deep-learning-on-a-laptop-with-nvidia-gtx-1070-gpu-part-1/

● 我的Deep Learning Box
● 主機：ASRock DeskMini250 GTX (~$46400)
Oct 07, 2017
○ 空機 (主機板+網卡) (~$13900)
○ GPU：NVIDIA GTX1070 (MXM介面) (~$22500)
○ CPU：Intel Pentium G4600 (3.6Ghz) (效能近i3-7100,
差在不支援AVX 指令集) (~$2150)
○ RAM：Crucial DDR4-2400 16G (~$4400)
○ SSD：Plextor M8PeG 256G NVMe (~$3450)
○ HDD：Seagate 2T SATA (2.5inch) (~$3450)
● 螢幕：LG 22MP58VQ-P 22型AH-IPS寬螢幕
(~$2588)

What is Docker(1)
● Docker 是一種輕量級的作業系統虛擬
化解決方案，其所創造出的虛擬環境
又稱為容器(container)。
● 傳統的虛擬化技術，如VirtureBox等，
主要是用軟體虛擬出一整套硬體，然
後在其上運行一套完整的OS。可想而
知，運行效率相當低落。
● Docker的創新在於想到: 既然原本就
有OS 了，為何不好好利用呢？由於
Linux OS 不同版本間的 kernel 是可
以共用的，只要在上面實作一層虛擬
API，應用程式看起來就好像跑在不同
機器上了呢！

What is Docker(2)
● 上圖：User space 中的process
直接呼叫system call 跟Kernel
space 要資源
● 下圖：Container 本身也是一個
User Space的process，負責呼
叫system call 向 kernel space
要資源給在container中執行的
process
Architecting Containers Part 1: Why Understanding User Space vs. Kernel Space Matters – Red Hat Enterprise Linux Blog
https://rhelblog.redhat.com/2015/07/29/architecting-containers-part-1-user-space-vs-kernel-space/

What is Docker(3)
● DiskIO
What is the runtime performance cost of a Docker container - Stack Overflow
https://stackoverflow.com/questions/21889053/what-is-the-runtime-performance-cost-of-a-docker-container
● CPU Overhead

Why do I need Docker for Deep Learning
● 聽起來不錯，但這跟我跑Deep Learning 有甚麼關係？
● 因為跑Deep Learning之前，你需要安裝一大堆東西...
○ Nvidia Graphics Driver, CUDA library, python, jupyter notebook, tensorflow, and many other
libraries and settings…..
○ 明明已經按照說明安裝了，為何還是出現錯誤？
○ 好不容易裝好了，不小心按到升級，整個環境都被打亂了....
○ 幫新同事安裝開發環境，怎麼裝就是少個東西....
○ 想實驗新的版本看看效能會不會比較好，又怕把環境弄亂回不去了....
○ …..
● 你需要Docker幫你搞定這一切！

Demo: Setup your environment with Docker (1)
● 知難行易.....直接來做一次就知道了！
● 首先，在你的新機上安裝 Ubuntu 16.04 LTS (2016年4月發行的長期支援版)
https://www.ubuntu.com/download/desktop
○ 安裝 Docker https://github.com/allenyllee/server_setup/blob/master/ubuntu_setup/ubuntu_setup.sh#L23-L59
○ 安裝 Nvidia driver https://github.com/allenyllee/server_setup/blob/master/ubuntu_setup/ubuntu_setup.sh#L159-L187
○ 安裝 nvidia-docker https://github.com/allenyllee/server_setup/blob/master/ubuntu_setup/ubuntu_setup.sh#L188-L218
○ 設定 docker compose https://github.com/allenyllee/server_setup/blob/master/ubuntu_setup/ubuntu_setup.sh#L233-L262
● 使用別人已經build 好的image: deepo
○ docker pull ufoym/deepo:all-py36-jupyter
○ nvidia-docker run -it -p 8888:8888 ufoym/deepo:all-py36-jupyter jupyter notebook
--no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token= --notebook-dir='/root'
○ http://localhost:8888

● N種願望一次滿足，是不是好棒棒~
● 還可以多開分身，只要改變對外的port number:
○ nvidia-docker run -it -p [port]:8888
ufoym/deepo:all-py36-jupyter jupyter notebook
--no-browser --ip=0.0.0.0 --allow-root
--NotebookApp.token= --notebook-dir='/root'
● 如果你有dataset，可以共享資料夾
○ nvidia-docker run -it -p [port]:8888 --volume [外部資料
夾]:[內部資料夾] ufoym/deepo:all-py36-jupyter jupyter
notebook --no-browser --ip=0.0.0.0 --allow-root
--NotebookApp.token= --notebook-dir='/root'

● 使用 Portainer 來管理容器
○ docker volume create portainer_data
○ docker run -d -p 9000:9000 -v /var/run/docker.sock:/var/run/docker.sock -v
portainer_data:/data portainer/portainer
● 一秒架設 gitlab
○ sudo docker run --detach --hostname gitlab.example.com --publish 443:443 --publish 80:80
--publish 22:22 --name gitlab --restart always --volume /srv/gitlab/config:/etc/gitlab --volume
/srv/gitlab/logs:/var/log/gitlab --volume /srv/gitlab/data:/var/opt/gitlab gitlab/gitlab-ce:latest
● 還有很多很多....大部分的軟體都有包裝成docker 的版本，也可以自己包裝上傳
到 DockerHub

● 操作容器
○ docker run ubuntu:14.04 /bin/echo 'Hello world'
○ docker run -t -i ubuntu:14.04 /bin/bash #交互模式
○ docker run ubuntu:17.10 /bin/sh -c "while true; do echo hello world; sleep 1; done" #後台模式
○ docker container ls #列出容器
○ docker exec -i [container_id] bash #進入容器
○ docker container stop [container_id] #終止容器
○ docker container rm [container_name] #刪除容器
● 操作鏡像
○ 在 Dockerfile 文件所在目录执行：docker build -t [image_name]:[tag] . #製作鏡像
○ docker pull [image]:[tag] #從Dockerhub拉取鏡像
前言 · Docker —— 从入门到实践
https://yeasy.gitbooks.io/docker_practice/content/

● dockerfile template
https://github.com/allenyllee/server_setup/blob/master/Dockerfile/template/D
ockerfile
● anaconda with dlib and nvidia gpu support
○ dockerfile https://github.com/allenyllee/condaD/blob/master/condad-gpu.Dockerfile
○ dockerhub https://hub.docker.com/r/allenyllee/condad-gpu/
●

What is Mesos(1)
● Mesos 是一個 Datacenter Operating
System (DCOS)，可以將整個數據中心
的資源（包括CPU、GPU、記憶體、儲存
空間、網路等）進行抽象和調度
● Mesos 將所有硬體資源放入一個資源池
，使所有主機行為看起來像一個大計算
機，當上層的應用需要硬體資源時
，Mesos 就負責調度資源
● 與Docker 結合後，就能達成軟體定義的
資源動態管理
DCOS到底是啥？看完这篇你就懂了~ - 51CTO.COM
http://cloud.51cto.com/art/201603/506805.htm

What is Mesos(2)
● 假設有四台機器分別佈署service，系統
管理員必須了解每個Server上有哪些資
源，當有新的需求加入，就從四台機器中
選擇一台資源使用率較低的server 進行
部屬
● 之後host1 有可能因為故障導致Service
無法啟動，大家只能連到host2 導致
host2 流量過大記憶體不足而當機
Day1: 使用Apache Mesos的目的為何？ - iT 邦幫忙::一起幫忙解決難題，拯救IT 人的一天
https://ithelp.ithome.com.tw/articles/10184643

What is Mesos(3)
● 改成Apache Mesos 架構，可以針對每個service 所需要的資源，寫成json透過
RESTful API 傳給Marathon 就可啟動service，Mesos Master 就會自動根據需
求進行調度
Day1: 使用Apache Mesos的目的為何？ - iT 邦幫忙::一起幫忙解決難題，拯救IT 人的一天
https://ithelp.ithome.com.tw/articles/10184643

What is Mesos(4)
● Master 負責協調全部Slave，
計算每個節點的資源，並向
註冊到Master 的Framework
發出資源邀約
● Framework 根據應用程式的
需求，決定是否接受邀約
● 一旦接受邀約，Master 即協
調Framework 和Slave，調度
參與節點上的任務，並在容
器中執行
DCOS到底是啥？看完这篇你就懂了~ - 51CTO.COM
http://cloud.51cto.com/art/201603/506805.htm

What is Mesos(5)
● 使用 docker 部屬Mesos
○ Mesos Master #給節點分配任務
○ Mesos Slave #結點
○ Marathon #啟動、監控、擴展容器
○ ZooKeeper #幫助查找Master 位址
Deploy a Mesos Cluster with 7 Commands Using Docker
https://medium.com/@gargar454/deploy-a-mesos-cluster-with-7-commands-using-docker-57951e020586
通过Docker来部署Mesos集群 - DockOne.io
http://dockone.io/article/136
bobrik/mesos-compose: Mesos cluster in one command
https://github.com/bobrik/mesos-compose
Sebastien Goasguen: 1 Command to Mesos with Docker Compose
https://sebgoa.blogspot.tw/2015/03/1-command-to-mesos-with-docker-compose.html

Demo: TensorFlow on Mesos(1)
● Benefits of running Distributed TensorFlow on DC/OS
○ Simplify the deployment of distributed TensorFlow
○ Share infrastructure across teams
○ Deploy different TensorFlow versions on the same cluster
○ Allocate GPUs dynamically
○ Focus on model development, not deployment
○ Automate failure recovery
○ Deploy job configuration parameters securely at runtime
Distributed TensorFlow with GPU Support on Mesosphere DC/OS
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
基于 Mesos、Docker 和 Nvidia GPU 的深度学习平台实践-DockerInfo
http://www.dockerinfo.net/3697.html

Demo: TensorFlow on Mesos(2)
Training Your Custom Model with the DC/OS TensorFlow package - Mesosphere
https://mesosphere.com/blog/tensorflow-custom-model/
通过Docker来部署Mesos集群 - DockOne.io

比較 Docker、Mesos、Kubernate
● Docker:
○ 主流的容器管理工具
○ 容器可包含一個軟體運行時所需要的所有 libraries, packages
● Mesos:
○ 一個輕量的資源共享層，可擴展並適用於各種框架
○ 提供資源調度和負載平衡
○ 偏底層
● Kubernate:
○ 容器的編排系統，使用 label、pod 將容器劃分成邏輯單元
○ 可同時佈署大量容器，形成一個服務
○ 偏應用
巅峰对决之Swarm、Kubernetes、Mesos - DockOne.io

About HyperPilot(1)
● Hyperpilot 利用機器學習技術中的Bayesian Optimization 快速sample出較佳的
VM 配置
● 當你需要在雲端部署Docker容器叢集時，首要考量的是應該要如何租用VM，必
須同時考量VM規格、容器配置和應用程式的配置需求等多項複雜變因，來衡量
成本
● 這套工具還提供了一個資源瓶頸分析工具HyperPath，可以從CPU、記憶體、網
路、I/O的來評估資源瓶頸...等
Hyperpilot open sourced 100% of its products – Timothy Chen – Medium
https://medium.com/@tnachen/hyperpilot-open-sourced-100-of-its-products-18d0e018fe45

About HyperPilot(2)

About HyperPilot(3)

What is Contineous Integration(CI)(1)
● 持續整合(Contineous Integration)的目的為：針對軟體系統每個變動，能持續且
自動地進行驗證。
○ 建置 (build)
○ 測試 (test)
○ 程式碼分析 (source code analysis)
○ 其他相關工作
● 驗證完成後，進一步可以整合自動化發佈或部署 (Continuous Delivery /
Continuous Deployment) 。透過此流程可以確保軟體品質，不會因為一個錯誤變
動而產生錯誤結果或崩潰(Crash)。
持續性整合與發佈 (Continuous Integration / Continuous Delivery) 之相關應用 - 91APP
https://blog.91app.com/continuous-integration-delivery/

● 其實硬體/韌體開發更需要使用CI
● 過去做BIOS 的經驗......
○ 部門沒有使用任何版本控管，也沒有自動化測試，每發一版 BIOS 就要花費QA人員至少3天的測
試時間(不眠不休!)
○ 由於缺乏自動化測試，也發生過改錯東西造成整條產線的NB無法開機的慘劇
○ 到客戶那邊做支援，同事在公司發了一個加新 feature的測版BIOS，燒上去結果不能開機，我手
邊卻沒有他的feature code，沒有版本控管，只能透過通訊軟體問對發改了哪裡，現場debug，客
戶在旁邊急跳腳.....
● 其實只要使用 Git+CI 就能避免上述情況！
● 過去做SSD的經驗......
○ SSD必須要接在機台上才能進行 firmware 燒錄，而且需要對硬體底層有控制權
○ 修改Docker 讓它可以直接存取硬體！

● 用一張圖簡單說明CI
● 概念：
○ 每天的小修改都上傳到
git，不要等到deadline 才
上傳一大包
○ 上傳後自動測試，隔天即
可得到測試報告
○ 根據測試報告了解開發進
度，即時發現bug
○ 隨時都有一版最新可以執
行的code
Three amigios | CI - CD - Test Automation
https://go-gaga-over-testing.blogspot.tw/2017/10/three-amigios-ci-cd-test-automation.html

CI for your Machine Learning project(1)
● CI 可以如何應用在AI 產品開發？
● Trditional CI workflow:
○ Build a Docker image
○ Run our unit/integration test (within a
instance of that Docker image)
○ Run acceptance tests (end-to-end, may
require some orchestration)
○ Deploy to staging and production
Continuous Integration for ML Projects – Onfido Tech – Medium
https://medium.com/onfido-tech/continuous-integration-for-ml-project
s-e11bc1a4d34f

CI for your Machine Learning project(2)
● CI 可以如何應用在AI 產品開發？
● CI for your Machine Learning product
○ Allow docker image building step to
resolve the model dependencies
○ Run unit/integration tests (fast to fail)
○ Run acceptance test (usually slower than
the previous set)
○ Download the test dataset (currently
using S3 to store this information)
○ Trigger the accuracy tests (speed can
vary greatly depending on hardware,
sample size, etc.)
Continuous Integration for ML Projects – Onfido Tech – Medium
https://medium.com/onfido-tech/continuous-integration-for-ml-project
s-e11bc1a4d34f

Demo: Setup your Jenkins CI Tool
● https://github.com/allenyllee/server_setup/blob/master/jenkins/jenkins_tutori
al.md
● http://localhost:8081
● MOUNTPOINT=/mnt/docker-srv/jenkins_home
docker run -d
-p 8081:8080
-p 50000:50000
-v $MOUNTPOINT:/var/jenkins_home
-v /var/run/docker.sock:/var/run/docker.sock
-v $(which docker):/usr/bin/docker
-v /usr/lib/x86_64-linux-gnu/libltdl.so.7:/usr/lib/x86_64-linux-gnu/libltdl.so.7
--name jenkins
--restart=always
jenkins/jenkins:lts

What is Blockchain(1)
● 區塊鏈(Blockchain) 是比特幣(Bitcoin)底層網路的基礎演算法
● 比特幣之所以與過去其他虛擬貨幣不同，正是因為區塊鏈(Blockchain)解決了需要信
任第三方的問題，徹底實現去中心化交易
● 傳統上，網路交易存在著重複支付、虛假交易等問題，因此需要透過信任第三方(如:
銀行、行動支付業者)來為交易做擔保
● 要在一個可能充滿惡意錯誤資訊的網路中，使任兩點間達成共識，並且不需要任何第
三方仲裁者，這個問題等同於電腦科學上的「拜占庭將軍問題」
● 「拜占庭將軍問題」：拜占庭為東羅馬帝國首都，為了防禦敵人，每個軍隊分隔很遠，將
軍與將軍只能靠信差通信。惟有當所有將軍達成一致共識，才能決定是否攻打敵人。
然而將軍中也存在叛徒，會釋放假消息，信差也有可能中途陣亡，造成軍隊間無法
達成一致。

● 中本聰(比特幣的創造者)提出了解決辦法：
「工作量證明鏈」(Proof-of-work chain)，後
被人們稱為區塊鏈(Blockchain)
● 工作量證明(Proof-of-work):
○ 讓每位將軍求解一個需要十分鐘計算的數學問題，
限制網路中每個時刻每位將軍能提出進攻時刻的數
目
○ 每當有位將軍算出答案，便將此「工作量證明」附加
到前一位的工作量證明上，形成一道長鏈，上面包含
著每位將軍提出的進攻時刻及總體名錄
○ 根據這條長鏈，得出安全進攻時刻
區塊鏈如何運作? – Ben Z.W. Jian – Medium
https://medium.com/@benzwjian/%E5%8D%80%E5%A1%8A%E9%8F%88%E5%A6%82%E4%
BD%95%E9%81%8B%E4%BD%9C-b7c8d4131a0e

● 智能合約(smart contract): 是以太坊(Ethereum)在區
塊鏈上實作的一種可程式化物件
● 簡單來說，智能合約就是一段程式碼，存在區塊鏈上，
被網路上的所有節點執行
● 因此，任何人都可以寫一段程式叫整個網路上的機器
幫你執行！(當然，需要付出以太幣為代價)
● 例如：自動轉帳、期貨交易、執行合約、借貸、保險、智
慧財產授權....
learning-blockchain/smart-contracts.md at master · OSE-Lab/learning-blockchain
https://github.com/OSE-Lab/learning-blockchain/blob/master/ethereum/smart-contracts.md
什麼是智能合約(Smart Contract)? | 蓋索林 Gasolin
https://blog.gasolin.idv.tw/2017/09/02/what-is-smart-contract/

Blockchain and AI
● 區塊鏈解決了共識問題，能夠帶來甚麼AI 相關應用呢？
● 使用霧運算(Fog Computing) 進行深度學習
○ 概念上就是，每個人可以將自己家中閒置的 GPU，透過區塊鏈智慧合約出租給別人
○ 可以家裡放顆GPU賺點外快，也可以需要的時候跟別人租
○ SONM: https://sonm.com/
● Blockchain 本身就是一個超級大dataset，可使用AI進行分析
● 工廠的AIoT裝置，彼此透過Blockchain 進行溝通，並自主決策，實現工業自動
化
● 神經網路透過Blockchain 彼此溝通不斷進化......(駭客任務？)
The convergence of AI and Blockchain: what’s the deal?
https://medium.com/@Francesco_AI/the-convergence-of-ai-and-blockchain-whats-the-deal-60c618e3accc
How Blockchains could transform Artificial Intelligence - Dataconomy
http://dataconomy.com/2016/12/blockchains-for-artificial-intelligence/

What is Quantum Computing(1)
● CES 2018 會場上，Intel 推出49-Qbit 量子
晶片
● 2018/03/05 Google 在美國物理年會報告
正在測試72-Qbit 的量子計算機

What is Quantum Computing(2)
● 量子位元的特性
○ 疊加性(Superposition)
○ 纏結性(Entanglement)
○ 不可複製性(No-Cloning)
● 量子霸權(Quantum Supremacy)
○ 目前最強大的超級電腦只能模擬46-Qbit
○ 若要模擬N-Qbit系統，就需要2^N bit 大小的記憶體
(2^46 bit = 64 TB)
○ 超過這個數量的量子位元，其計算能力就超越地表所
有超級電腦，實現量子霸權
○ 可用於密碼破解、研發新藥、分子動力模擬、有機電池
材料研發......等
● 量子不穩定性是一大弱點，但微軟提出拓撲量子試
圖解決這問題
○ 【重磅】微软量子计算重大突破：量子系统或存在天使粒
子，一个稳定的量子比特强过 1 万个
https://zhuanlan.zhihu.com/p/35065555
● 量子程式語言
○ 微軟的Q#
https://docs.microsoft.com/en-us/quantum/qua
ntum-qr-intro
量子机器学习入门科普：解读量子力学和机器学习的共生关系
https://zhuanlan.zhihu.com/p/33173860
量子平行運算
量子加密

Quantum Computing for Machine Learning(1)
How quantum effects could improve artificial intelligence
https://phys.org/news/2016-10-quantum-effects-artificial-intelligence.html
Quantum computers could greatly accelerate machine learning
https://phys.org/news/2015-03-quantum-greatly-machine.html

Quantum Computing for Machine Learning(2)
Quantum algorithm could help AI think faster
https://phys.org/news/2018-02-quantum-algorithm-ai-faster.html

Recap
● 雖然現在有很多公司宣稱要製造AI 晶片，除了運算速度外，更應該根據使用場
景和軟體支援來進行選擇
● 新創公司開發產品應該善用OpenSource 工具如Git、Docker、Jenkins 來進行管
理，根據GPU 的使用狀況、租用雲端的費用或自行架設管理GPU server的費用
，選擇合適的方案
● 大型企業欲導入AI，最好由高階主管成立新部門推動，透過專業的深度學習架構
師提供解決方案
● 個人開發者可善用一些免費資源，如Google Colab，或租用計時制雲端如
paperspcace 降低初期投入成本
● AI只是一個統稱，本質上是各種新舊技術的結合，能善用各種OpenSource 工具
，改善流程，創造新價值才是關鍵！

Deep learning hardware architecture and software deploy with docker

Recommandé

Contenu connexe

Similaire à Deep learning hardware architecture and software deploy with docker

Similaire à Deep learning hardware architecture and software deploy with docker (20)

Deep learning hardware architecture and software deploy with docker