18. Little’s Law
L=λW
The long-term average number of customers in a stable system L
is equal to the long-term average effective arrival rate, λ, multiplied
by the average time a customer spends in the system, W; or
expressed algebraically: L = λW.
http://en.wikipedia.org/wiki/Little's_law
29. Externalize Configuration
Hard-coded values require
recompilation/repackaging.
conf.setNumWorkers(3);
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Values from external config.
No repackaging!
conf.setNumWorkers(props.get(“num.workers"));
builder.setSpout("spout", new RandomSentenceSpout(), props.get(“spout.paralellism”));
builder.setBolt("split", new SplitSentence(), props.get(“split.paralellism”)).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), props.get(“count.paralellism”)).fieldsGrouping("split", new Fields("word"));
36. Parallelism == Manifold
Take input from one big pipe and
distribute it to many smaller pipes
The bigger the size difference, the
more parallelism you will need
41. Sizeup — Fire
What are my water
sources? What GPM
can they support?
How many lines (hoses)
do I need?
How much water will I
need to flow to put this
fire out?
42. Sizeup — Storm
What are my input
sources?
At what rate do they
deliver messages?
What size are the
messages?
What's my slowest data
sink?
53. Example
10 Worker Nodes
16 Cores / Machine
(10 * 16) - 10 = 150 “Parallelism Units” available
54. Example
10 Worker Nodes
16 Cores / Machine
(10 * 16) - 10 = 150 “Parallelism Units” available (* 10-100 if I/O bound)
Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.
60. Key Settings
topology.max.spout.pending
Spout/Bolt API: Controls how many tuples are in-flight (not ack’ed)
Trident API: Controls how many batches are in flight (not committed)
63. Key Settings
topology.message.timeout.secs
Controls how long a tuple tree (Spout/Bolt API) or batch (Trident API) has to
complete processing before Storm considers it timed out and fails it.
Default value is 30 seconds.
64. Key Settings
topology.message.timeout.secs
Q: “Why am I getting tuple/batch failures for no apparent reason?”
A: Timeouts due to a bottleneck.
Solution: Look at the “Complete Latency” metric. Increase timeout and/or
increase component parallelism to address the bottleneck.
69. Nimbus
Generally light load
Can collocate Storm UI service
m1.xlarge (or equivalent) should suffice
Save the big metal for Supervisor/Worker machines…
78. ZooKeeper Considerations
Use dedicated machines, preferably
bare-metal if an option
Start with 3 node ensemble
(can tolerate 1 node loss)
I/O is ZooKeeper’s main bottleneck
Dedicated disk for ZK storage
SSDs greatly improve performance
79. Recap
Know/track your latencies and code appropriately
Externalize configuration
Scaling is a factor of balancing the I/O and CPU requirements of your use
case
Dev + DevOps + Ops coordination and collaboration is essential