Have someone introduce me.
Thank audience (tie to morning activities), sponsors, HP, etc.
We’re here because this is the biggest thing that has happened to Hadoop…
Here at the conference we’re talking about data science. But before we can appreciate the changes happening in data science, we must first talk about Data. Data is doubling every two years. The fast growing volume, variety and velocity of data is overwhelming traditional systems and approaches. A revolutionary approach is required to leverage this data. And with this new technology, Data science as we know, is undergoing tremendous change.
To give you a sense of the data volumes that we’re talking about, I’ve included this chart that shows why a revolutionary approach is needed. You can see the amount of data growth moving from 1.8 Zettabytes to 44 Zettabytes in just over 5 years. To put this into perspective a large datawarehouse contains terabytes of data. A zettabye is 1 billion terabytes.
Numbers in chart are from two IDC reports (sponsored by emc).
http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf
http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
What is the source of this data growth? While structured data growth has been relatively modest, the growth in unstructured data has been exponential.
Source of statistic: http://link.springer.com/chapter/10.1007/978-3-642-39146-0_2
sensor data, social media, clickstream, genomic data, location information, video files, etc.
19
20
Many organizations now want to unlock the data in Hadoop and make it accessible to a broader audience within their organizations. That’s easier said than done. While we’ve largely solved the infrastructure scalability challenge, the massive volume, variety and velocity of this data introduces serious challenges on the human side, such as how to prepare all that data and make it available to users, how to make operational data available in real-time for analytics, etc. We need better technology to empower users to take advantage of these massive volumes of data.
Past: Enable organizations to capture the data.
Future: Enable organizations to more easily extract value from all this captured data.
What does the future of Hadoop look like?
The problem
I’m sure many of you have experienced this (just like the quotes)
Why we want to solve it
Here’s what we’re doing about it
One of the challenges with Hadoop as well as traditional data management tools is the business user’s “distance from the data”.
The dependency on IT (or additional development) increases time to value and reduces agility. It also creates a burden on IT at a time when IT is already overworked. The red arrows in this illustration can represent significant backlogs and delays (often many months).
Many of you are likely having to spend a lot of time on plumbing development and data preparation. How many have had to do this? (show hand)
“Data modeling and transformations” may seem easy, but when you look at a real-world environment, you could have thousands of data sets.
Opportunity
This is the opportunity.
The audience should feel like this is their chance to become heroes by bringing this to their companies.
They have to feel (be emotional) about the problem at this point.
IT-driven = months of delay, unnecessary work (data is no longer relevant, etc.)
The so-what needs to be conveyed. Why does it matter that it’s not needed.
6 months -> 3 months -> 3 months -> day zero
So imagine now what you can get…
Data Agility is needed for Business Agility
>>> Stand still during slide, move in at the punchline (why does this matter to YOU)
Need an example or analogy to explain self-describing data.
All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before.
If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
TODO: Add Impala and Splunk logos
What I want you to see now is how easy is it to ….
Is there something from Israel?
With other technologies you have to do this, then this, then this, …
Key takeaways
Core message – We are revolutionizing Hadoop
Call to action – get involved, and enjoy the conference as we have great speakers
If doing Q&A, set boundaries (time - how much time we have, topic – what questions can I answer about this revolution), back pocket question (someone asked me this morning)
-