13. Without global data communication
Better precision of summation
(same as B2)
Systolic output path
(or use next row in 2D)
Nodes are activated half of the time.
27. • insert()
• delete()
• extract_min()
Priority Queue Operations
For n operations:
O(n log n) O(n)
Key : One operation
can be issued after
another in time.O(1)
28. Priority Queue Operations
insert(k) delete(k) extract_min()
Sink down the
element with key k
A) Sink down a fake
element with key k
to find target.
B) Remove the target.
C) Bubble up the
below ones.
A) Take first element
B) Bubble up the below
ones.
35. Cloud TPU
Google Cloud Platform Blog
https://cloud.google.com/tpu/
TPU V3TPU V2
TPU V2 Pod
36. TPU Programming
• A cloud TPU has 4 chips x 2
cores x 1 or 2 MXU
• MXU
• 128x128 systolic array
• 16K MAC / cycle
• bfloat16
• TPU memory prefer 8 bytes
alignment.
• 8 or 16GB HBM2 / core
https://cloud.google.com/tpu/docs/tpus
https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/
Titan X has
3.5K cuda cores
So, each TPU V3 card has
4 chips x 2 cores x 2 MXU x 16K MAC / cycle
= 256K MAC / cycle at most.
38. TPU Programming
• XLA compiler for TensorFlow programs.
• Tiling => Need reshape
• Shape => No dynamic batch
• Padding => under utilize TPU, more memory usage
• op_profile tool
39. TPU Programming
• Dense vector and matrix computations are fast
• M x M, M x v, Convolution
• Data movement on PCIe is slow.
• Only dense parts of the model, loss and gradient subgraphs are on TPU.
• I/O, reading data, writing checkpoint, preprocessing data is on CPU.
• decoding compressed images, randomly sampling/cropping, assembling training minibatches
• Non-matrix operations will likely not achieve high MXU utilization.
• add, reshape, or concatenate
• feature dimension => 128 x
• Batch dimension => 8 x
40. TPUEstimator
• TPUEstimator provides a graph operator to build and run
a replicated computation
https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimator