Brief summary of modern software available today to provide the core infrastructure to provide collection and analysis of big data collected from sensors (internet of everything). Presented at the Dec 2015 Trillion Sensors Summit in Orlando FL.
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Overview of modern software ecosystem for big data analysis
1. Overview of Modern Software Ecosystem
for Big Data Analysis
Michael Bryzek
mbryzek@alum.mit.edu / @mbryzek
Co-Founder and Chairman Flow Commerce
Co-Founder and ex-CTO Gilt Groupe
Trillion Sensors Summit - Dec 9 2015
2. Overview of modern practices related to software architecture for high volume
big data applications
Encourage reuse of infrastructure that has already been built so you can
focus on analysis and information
Goals
3. Representational State Transfer (REST)
a uniform connector interface
● Resources - “nouns”
● Clear set of limited methods
● Standard (e.g. authorization)
Cost of integration of nth
service approaches 0
Roy Thomas Fielding’s Dissertation - https://www.ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf
examples
Stripe
Twilio
Github
4. Frameworks for REST
API first - the most critical design element
● http://apidoc.me *
● http://swagger.io
● http://apiary.io
Some companies (incl. Amazon) focus on API
and care very little about the implementation.
* my personal open source project
6. Javascript Object Notation (JSON)
It’s just javascript - the most widely adopted
programming language in the world
Pros Cons
Simple
Readable
Dense
Still verbose
No strong typing
CPU overhead
7.
8. Binary Protocols - Ideal for sensor data
Key Features
● Language to describe schema
● Space efficient
● Fast serialization / deserialization
Leading Protocols
● Protocol Buffers https://developers.google.com/protocol-buffers
● Avro https://avro.apache.org/ - tight integration with Hadoop
● Thrift https://thrift.apache.org/
11. Data Platforms
● https://aws.amazon.com/iot/ - Amazon Kinesis, S3, Redshift, IOT -
● http://influxdata.com -open source time series database + analytics platform*
● http://confluent.io - data pipeline / real time processing built by Jay Kreps
● http://spark.apache.org/ - UC Berkeley / Cloudera led effort
Currently seeing high activity and investment in both open source and commercial
ventures.
* I am an investor in influx
12. Summary and Recommendation
Learning from history of evolution of software on internet…
● Define standards for interconnectivity (ala REST)
○ Avoid standards for data types (e.g. ECG)
● Choose simplicity as number one requirement
○ Avoid XML
● Adopt existing binary protocols, w/ code generation at boundaries
○ Avoid creating new protocols focused on last 5-10% improvement
● Adopt existing messaging / storage platforms for large data sets
Keeping up to date: https://www.thoughtworks.com/radar