This presentation was given by MapR CMO Jack Norris at Gartner BI and Analytics Summit in las Vegas on April 2, 2014.
Hadoop revolutionizes how data is stored processed and analyzed. Hadoop represents a new data and compute stack that provides huge operational advantages and is being used to change how organizations compete. This session will provide an overview of how customers are using Hadoop today through details on initial uses and a glimpse of how this new platform is providing organizations 10X performance at 1/10 the cost
Hadoop: Revolutionizing Analytics AND Operations.Hadoop revolutionizes how data is stored processed and analyzed. Hadoop represents a new data and compute stack that provides huge operational advantages and is being used to change how organizations compete. This session will provide an overview of how customers are using Hadoop today through details on initial uses and a glimpse of how this new platform is providing organizations 10X performance at 1/10 the costOverview of Big Data Data driven companies Use cases….examples of data driven 2 to 3. Show importance of leveraging data… Existing systems getting overrun Examples of what this means ….Size of data, Oracle hitting the wall…Analytic speed…. Hadoop is at the center What is Hadoop Additional proof points???3 Realities Relieves the pressure Processing example in terms of how it scales Cost example… You don’t need to know the questions you’re going to ask ahead of time…. Small Rapid Decisions Examples of Operational Hadoop Rubicon 3 to 4 Follow up with the Use case… Architecture Matters Why is this the case The Results Where do you start…. Offloading examples…. Cisco – DW IRI – Mainframe offload
The first trend is that the industry leaders have shown how to use big data to compete and win in their markets. It’s no longer a nice to have – you need big data to competeGoogle pioneered MapReduce processing on commodity hardware and used that to catapult themselves to into the leading search engine even though they were 19th in the marketYahoo! Leveraged these ideas to create Hadoop to keep up with Google and many mainstream companies have followed with new data-driven applications such as “people you may know” (started by LinkedIN and now used by Facebook, Twitter, and every social application), product recommendation engines, contextual and personalized music services (beats), measuring digital media effectiveness (comScore), serving more relevant/targeted ads(Comcast, rubicon project), fraud and risk detection, healthcare efficacy, and moreWhat makes the difference? A lot of attention is given to data science and developing sophisticated new algorithms, but in many cases just having more data beats better algorithms. (make point on collecting more consumer interaction as well as transaction data, as an example). In addition, competitive advantage is decided by very small percentages. Just 1% improvement in fraud can mean hundreds $millions in savings. A ½% lift in advertising effectiveness means millions in new product sales and profitability. The same can be applied to customer churn, disease diagnosis, and more.
Doctors, particularly oncologists, are faced with an enormous amount of data regarding patient treatments, outcomes, and disease states. Hadoop is having an impact across the health care industry but for this minute we will focus on its use for developing better treatments. In one minute Hadoop can analyze more than 20,000 genes across hundreds of thousands of patients. The outcome of this analysis is to get a better understanding of genomic factors and integrate imaging and clinical analytics to better understand, predict, and impact survival. In any given minute our cluster is sequencing 422,000 genes per minute.
Beats headphones by Dr. Dre have swept the audio market. Beats has launched a new Beats Music service thatis able to personalize music selections and select the perfect song in a minute from over 20 million songs. It joins a crowded space for online music, but now by using MapR Beats is able to provide a completely new personalized service from over 20 million songs in their library.It’s not about delivering 20Million songs, but providing a continuously-updating, personalized and tailored experience to users.
A second trend in enterprise architecture has been big data overwhelming the existing workload-specific systems which are in production. (list of requirements for each of these on the side in text)People started with mainframes or operational systems which run ERP, finance, CRM and other mission-critical applications. They require… (pick out attributes you want to stress on the left)You also have data warehouses, marts, data mining, and other analytical systems which pull data from these operational and other systems for providing insights to the business for decision makingThe amount/variety of data has been overloading these systems. You reach a certain point as you try to ingest new types of data when these systems are not cost-effective to scale to terabytes or petabytes of data
Hadoop has become the defacto big data platform which allows organizations to keep up with big data and feed data-driven applications and processesThis chart shows the percentage growth of jobs from Indeed.com.Compared to other popular technologies such as MongoDB and Cassandra, Hadoop is not only the fastest growing big data technology it’s one of the fastest growing technologies period. Hadoop has the most robust ecosystem and momentum and is the big data platform of choice for industry-leading, data-driven companies(Also of interest is that Indeed.com (which is a subsidiary of a Japanese-owned company) is a customer of MapR – they harness and analyze all of the job trends data using MapR)
As implemented, MapReduce is actually a collection of complementary techniques and strategies that include employing commoditized hardware and software, specialized underlying file systems, and parallel processing methodologies. Many of the benefits arise from the fact that computation can be done on the same machines where data resides and from the fact that individual pieces of the overall computation can be recomputed if necessary due to hardware failure or other delays. This is a revolutionary architectural philosophy that shelters the average developer from the overwhelming complexity that had formerly been required to properly carry out parallel processing. But as we’ll see later, the implementation of MapReduce laid the foundation for significant problems now being experienced by many enterprises that are seeking to put it to work.
Map Reduce is a paradigm shift. It’s moving the processing to the data.Apache Hadoop is a software framework that supports data-intensive distributed applications. Hadoop was inspired by a published Google MapReduce whitepaper. Apache Hadoop provides a new platform to analyze and process Big Data. With data growth exploding and new unstructured sources of data expanding a new approach is required to handle the volume, variety and velocity of this growing data. Hadoop clustering exploits commodity servers and increasingly less expensive compute, network and storage.Google is the Poster Child for the power of MapReduce. They were the 19th search engine to enter the market. There were 18 companies more successful and within 2 years, Google was the dominant player. That’s the power of the MapReduce framework.---------------------------Long versionA poster child for this is Google. We now take Google’s dominance for granted, but when Google launched their beta in 1998 they were late. They were at least the 19 search engines on the market. Yahoo was dominant, there was infoseek, excite, Lycos, Ask Jeeves, AltaVista (which had the technical cred). It wasn’t until Google published a paper in 2003 that we got a glimpse at their back end architecture. Google was able to reach dominance because they recognized early on the paradigm shift and they were able to index more data, get better results and do it much much more efficiently and cost effectively than their competitors. They went from 19th to first in a few short years because of MapReduce.A Yahoo engineer by the name of Doug Cutting read that same paper in 2003 and developed a Java implementation of MapReduce named after his son’s stuffed elephant that became the basis for the open source Hadoop project. Now when we say Hadoop we’re talking about a robust ecosystem. There are now multiple commercial versions of Hadoop. There’s a complete stack that includes job management, development tools, schedulers, machine learning libraries, etc. MapR’s co-founder and CTO was at Google he was in charge of the BigTable group and understands MapReduce at scale. Our charter was to fix the underlying flaws of the hadoop implementation to make it appropriate more a broader set of applications and work for most organizations.
Need a Platform that serves the broadest sets of use cases….
The first reality is that as people put Hadoop into production, to relieve the pressure from other systems in their enterprise architecture it needs to reliable . Hadoop needs to be held to the same enterprise standards as your Oracle, SAP, Teradata, NetApp storage, or any other enterprise system.Many organizations are putting Hadoop into their data center to provide (list of use cases underneath) … it can do all of this and more, butFor hadoop to act as a system of record , it must provide the same guarantees for SLA’s, performance, data protection, and moreMost importantly, Hadoop has the potential for both analytics AND operations. It can be used to optimize the data warehouse provide batch data refining or storage. But Hadoop can provide many operational analytics or database operations/jobs when done right.
Choosing the right big data architecture is critical for success with your Hadoop projects and business applicationsOne analogy is building a sky scraper. Before you can start building up, you have to lay a rock-solid foundation. This building is the new Wilshire Grand project in Los Angeles. In Feb of this year they set a Guinness World Record for pouring a 21,000 cubic yard (16,000 cubic meters) foundation over 26 hours (http://www.theguardian.com/cities/2014/feb/14/world-largest-concrete-pour-la-trucks-los-angeles) When completed in 2017, the building will be the tallest in the US outside of NY and Chicago.
This analogy applies as well to building a data platform – you have to architect for the future. This allows you to build higher, stronger, and faster, without retrofitting later down the road (anyone who has added a second story to their house can attest to the additional cost and construction delays if you have to reinforce a foundation which wasn’t designed to hold the stress)For business-critical applications you must have data protection and security (availability, data protection, and recovery), high performance (with random read-write system), multi-tenancy (to support multiple business units, isolate applications or user data,…), provide good resource and workload management to support multiple applications, and open standards to integrate with the rest of the enterprise data architectureThis data foundation allows you to support new data-driven applications (both operational and analytical) , maintain service level agreements with the business, provide information you can trust and count on being there when you need it, and ultimately being the best TCO for the long-run. Supporting enterprise systems without retrofits or multiple clusters to work around platform deficiencies (e.g., to support operational/online applications in Hadoop today, you need a separate HBase cluster – separate from the rest of your Hadoop cluster/investment)
In a recent article by Tom Davenport (http://www.cmswire.com/cms/big-data/5-things-to-lessen-your-anxiety-about-big-data-024382.php) – he says“Big data’s biggest wins come from making many small decisions vs. one that’s huge. The majority of big data driven decisions will be recurring, made at speed (in milliseconds), and at scale; actions will be taken automatically (vs. reviewed and approved by an individual). Examples include ad platforms making many constant adjustments, fraud detection on millions of transactions that are based on individual patterns, fleet management and routing taking into account current conditions….This requires a Hadoop platform that can go beyond batch and support streaming writes so data can be constantly writing to the system while analysis is being conducted. High performance to meet the business needs and real-time operations the ability to perform online database operations to react to the business situation and impact business as it happens not report on it one week, month or quarter later.To do this requires THE RIGHT ARCHITECTURE
One great example is the Rubicon Project, who recently filed their S1 to go public. They bet their business on data with Hadoop as the cornerstone of their business and developed pioneering technology that created a new model for the advertising industry – similar to what NASDAQ did for stock trading. Rubicon Project’s uses MapR for their automated advertising platform that processes over 100B ad auctions a day and provides the most extensive ad reach in the industry touching 96% of internet users in the US. They use MapR because of the superior system reliability, and performance and ability to run in their “lights out datacenters”. They switched from one of our competitors after experiencing a Namenode failure and constant up and down. This was fine in development, but Hadoop needed to be a production system in 2011, which is when they switched to MapR
In India, there is no social security card. It’s difficult for the average citizen to set up a bank account, access benefit programs, and enjoy economic mobility. It’s difficult for the government as well with over a $1B of government aid classified as leakage, the result of fraud and corruption. The Aadhaar program is poised to change all that by leveraging the unique IDs that all people are born with to create the largest biometric database in the world The program aims to get fingerprints and retina scans for all 1.2 billion citizens. The scale of this project required MapR’sin-Hadoop database that is capable of 200 millisecond response times while supporting millions of concurrent look-ups.
They ran the MinuteSort benchmark, a test which shows how much data you can sort in 1 minute. The Minutesort world record was set by Yahoo by sorting 1.6 terabytes with 2200 nodes. This MapR customer broke the record by sorting 1.65TB with 298 nodes. That’s 1/7th the hardware – that translates into tremendous cost, space, and management savings….
Because only MapR can reliably run both operational and analytical applications on one platform/cluster, MapR enables a faster closed-loop process between operational applications and analytics. This means:interactive marketers and algorithms can update the rules engines more quickly and provide more real-time targeting of offers and relevant content to consumersFraud models are kept more up to date with the latest patterns to better detect anomalies and take action more quickly on bad actors
MapR creates a new opportunity for enterprises. The Opportunity to revolutionize the enterprise data architectureFrom... ‘redundant processing silos’ and ‘data science experiments’. Where you need separate Hadoop clusters for streaming, HDFS/Hive, Hbase and more To… ‘
To… ‘converged data & processing hub’ that provides a TRUE PRODUCTIon enterprise data hub.This allows you to consolidate operational and analytical workloads. Not only across Hadoop use cases and applications, but for optimizing your enterprise data architecture