9. Data is more than Information
Not all information is equal.
Some information is derived from
other pieces of information.
#DV13-#rtbigdata
@nathan_gs
10. Data is more than Information
Eventually you will reach the most
‘raw’ form of information.
This is the information you hold true, simple
because it exists.
Let’s call this ‘data’, very similar to ‘event’.
#DV13-#rtbigdata
@nathan_gs
20. Number of people living in each city.
Person
Location
Time
Location
Count
Nathan
Antwerp
2005-01-01
Ghent
2
Geert
Dendermond
e
2011-10-08
Dendermonde
1
John
Ghent
2010-05-02
Nathan
Ghent
2013-02-03
#DV13-#rtbigdata
@nathan_gs
35. MapReduce
MAP
1. Take a large data set and divide it into subsets
…
2. Perform the same function on all subsets
REDUC
E
DoWork()
DoWork()
DoWork()
…
3. Combine the output from all subsets
#DV13-#rtbigdata
…
Output
@nathan_gs
37. Serialization & Schema
Catch errors as quickly as they happen.
Validation on write vs on read.
#DV13-#rtbigdata
@nathan_gs
38. Serialization & Schema
CSV is actually a serialization language
that is just poorly defined.
#DV13-#rtbigdata
@nathan_gs
39. Serialization & Schema
• Use a format with a schema.
•
•
•
Thrift
Avro
Protobuffers
• Added bonus: it’s faster & uses less space.
#DV13-#rtbigdata
@nathan_gs
49. Speed Layer
Storing a limited window of
data.
Compensating for the last few hours of data.
#DV13-#rtbigdata
@nathan_gs
50. Speed Layer
All the complexity is isolated in the
Speed layer.
If anything goes wrong, it’s auto-corrected.
#DV13-#rtbigdata
@nathan_gs
51. CAP
You have a choice between:
Availability
•
•
Queries are eventual consistent.
• Consistency
•
Consistency
Queries are consistent.
#DV13-#rtbigdata
Partition
Tolerance
Availability
@nathan_gs
52. Eventual accuracy
Some algorithms are hard to
implement in real time. For those
cases we could estimate the results.
#DV13-#rtbigdata
@nathan_gs
60. Speed Layer Views
• The views are stored in Read & Write database.
•
•
•
•
•
•
Cassandra
Hbase
Redis
MySQL
ElasticSearch
…
• Much more complex than a read only view.
#DV13-#rtbigdata
@nathan_gs
71. Lambda Architecture
Can discard any view, batch and real
time, and just recreate everything from
the master data.
#DV13-#rtbigdata
@nathan_gs
72. Lambda Architecture
Mistakes are corrected via recomputation.
Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.
#DV13-#rtbigdata
@nathan_gs
76. DataCrunchers
We enable companies in envisioning, defining and
implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.
#DV13-#rtbigdata
@nathan_gs
How much data do you have? 44 times as much data in the next decade, 15 Zb in 2015Data silos (erp, crm, …)Traditionele systemen kunnen dit volume niet aan.Turn 12 terabytes of Tweets created each day into improved product sentiment analysisConvert 350 billion annual meter readings to better predict power consumption
Real timeTime sensitivedecisiontakingFrauddetectionEnergy allocationMarketing campaignsMarket transactionsSolution:Real-time solutions in combination with batch (hadoop)Nosql systems
StructuredUnstructured80% is unstructured data, A key drawback of using traditional relational database systems is that they're not good at handling variable data. A flexible data modelWord, email, foto, text, video, APIs, …?What are your needs regarding variety?The end result: bringingstructureintounstructured dataMonitor 100’s of live video feeds from surveillance cameras to target points of interestExploit the 80% data growth in images, video and documents to improve customer satisfaction
We can afford to keep Immutable Copies of lots of data.We NEED immutability to Coordinate with fewer challenges.Semaphores & Locks are the things to avoid: Instruction opportunities lost waiting for a semaphore increase with more cores…
The # of followers on Twitter = all follows & unfollows combined.Account balance
Data = event = atomicIn an ever changing world we found some stabilityEverything we do generates events:Pay with Credit CardCommit to GitClick on a webpageTweet
It is easier to store all data in a cost effective way.Compare to DWH world.
Immutability greatly restricts the range of errors that can cause data loss or data corruption.Ex. Only CR, no more CRUD.Information might of course change.Fault ToleranceData lossHuman error, Hardware failureData CorruptionParallel met functioneelprogrammeren.
Allows state regeneration. Eg. What was my bank balance on 1 may 2005?
Queries as pure functions that take all data as input is the most general formulation.Different functions may look at different portions and aggregate information in different ways.
Too slow; might be petabyte scaleImpala/Drill: why not
The batch layer can calculate anything (given enough time).
The batch layer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
Not vertically
It’s OK to croak and restart
Is something really immutable when it’s name can change.
Doesn’t have to be Hadoop. The importance here is a Distributed FS combined with a processing framework.Spark,
http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=nsValue of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruptionOtherwise you’ll detect corruption issues at read-time
Maarkanopgelostworden, door bvb ES je views op voorhandtegenereren.
In some circumstances.
All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)http://codahale.com/you-cant-sacrifice-partition-tolerance/Hbasavs Cassandra
Eg. Unique countsML
Nimbus:Manages the clusterWorker Node:Supervisor:Manages workers; restarts them if neededExecuterPhysical JVM process.Execute tasks (those are spread evenly across the workers)TasksEach in his own Thread. Is the actual Bolt or Spout.Processes the stream.
Tuple:Named list of valuesDynamicly typedStreamSequence of Tuples
SpoutSource of StreamsSometimes replayableBoltStream transformationsAt least 1 input stream0 - * output streams
The serving layer needs to be able to answer any query in a short amount of time.
AVG = sum + count; preaggregate, but not everything is possible.
Command Query Responsibility Segregation (CQRS) applies the CQS principle by using separate Query and Command objects to retrieve and modify data respectivelymultiple representations of information. The change that CQRS introduces is to split that conceptual model into separate models for update and display, which it refers to as Command and Query respectivelyA method should either change state of an object, or return a result, but not both.Presentation Tom Michiels, Monday Evening at one of the BOFs
Lambda first named by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.