Les promesses du Big Data sont séduisantes. Encore, faut-il savoir maîtriser l’écosystème d’Hadoop, son architecture et la configuration d’un cluster adapté aux besoins métiers. Dans ce petit-déjeuner, pas de théorie uniquement des retours d’expérience de projets en France, avec OCTO et aux USA avec Cloudera.
Les thèmes abordés seront :
Quels projets pilotes Hadoop lancés en 2013? YARN, Impala, MapReduce, HCatalog,...
Quels composants logiciels pour compléter le puzzle Hadoop pour offrir une solution Big Data utilisable par les métiers?
Comment dimensionner et configurer un cluster Hadoop adapté aux besoins?
Comment benchmarker les performances d’un cluster?
Quelles sont les best practices et les pièges à éviter en matière de développement
Retours d’expérience projets en France et aux USA
Au terme de ce petit-déjeuner :
Vous aurez une vision claire de ce qu'est Hadoop et son écosystème en 2013
Vous connaîtrez les best practices de dimensionnement de cluster
Vous saurez sélectionner les outils de l'écosystème correspondant à vos besoins
Vous saurez, au travers d'un retour d'expérience du terrain, comment réussir votre projet Big Data avec Hadoop
Link to opportunity record in SFDC (valid for SFDC employees only): https://na6.salesforce.com/0068000000eoHgTA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor most of its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. The Teradata system was supporting over 330,000 applications that run monthly and 6,000 databases.Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. It was taking 7 days to complete ETL processing, so the Teradata environment could only be used for analysis during brief periods each month. And they were spending millions every year just to back up all of their data. Regulatory compliance requires them to store 7 years’ data, and it would take 5 weeks just to make historical data available for analysis.The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offload raw data to tape backup and rely on small data samples and aggregations for analytics.IBM and EMC had attempted to alleviate this pain but failed. The strategic data warehouse group within the bank initiated a research project with Georgia Tech students to look into data warehousing projects, which led a student to reach out to Cloudera. This ultimately initiated an in-depth POC.During the POC, the bank looked at several different operational systems and the transformations that needed to take place to that data to prepare it for use in the data warehouse. They found they’d scaled past what their traditional ETL tools could deliver, so they were just using those ETL tools to move data into the data warehouse and then doing transformations within the warehouse (ELT). The system was spending 44% of its resources on everyday operations such as running canned BI reports and 42% on ETL processing (or ELT in this case), leaving only 11% for advanced analytics and data discovery that drives ROI from new opportunities. This is a very costly use of the data warehouse platform and not what it was meant for. They were able to quantify how much space and compute power was being used for each ELT process in data warehouse supporting hundreds of applications. This information helped to quantify how much effort (man hours) it would take to implement these processes in Hadoop, and which applications would most benefit in terms of financial and time-related ROI by migrating to Hadoop. They decided to start with SQL-based transformations, and implemented 2 applications from start to finish as part of the POC..Solution: After a very in-depth POC involving 30+ representatives from the bank, they deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, freeing up space on the EDW so it could focus on its real purpose: performing high value operational and data discovery analytics. They didn’t migrate the entire system at once -- they started with the applications that would deliver the most value and save the most Teradata resources. The bank initially deployed a small cluster, demonstrating that they could meet Teradata’s performance at a fraction of cost.Results: Cloudera delivers value to this bank through our low cost per terabyte, low cost of implementation, compute savings, and the flexibility offered by Hadoop. The bank was able to justify the ROI of Cloudera very easily from a cost perspective, with Teradata as the incumbent. They were spending over $180,000 per terabyte on Teradata (which is unusually high -- most Teradata customers probably pay closer to $40,000 per TB). Cloudera offers $1,000 per terabyte.By offloading data processing and storage onto Cloudera, the bank avoided spending millions to expand their Teradata infrastructure, while reclaiming the 7 days every month that Teradata was spending on data transformations. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.In addition, Cloudera delivered technical value through its flexible scalability. The bank could deploy and test on a small cluster of 15 nodes to see how performance scales linearly with growth, versus having to buy in large chunks as they do with Teradata.
The quant risk LOB within a multinational bank saves millions through better risk exposure and fraud prevention analysis, while avoiding expanding their data warehouse footprint. Background: With the movement from in-person to online banking, a multinational bank processes increasingly more transactions -- 2 billion per month. Increased transactions translate into growing data volumes, and greater potential to use that data for better, more data driven fraud prevention. Challenge: While opening the door to better fraud prevention, today’s frequent banking transactions also necessitate constant revisions to risk profiles which is data processing intensive. And detecting fraud is a complex, difficult process that requires a continuous cycle of sampling a subset of data, building a data model, finding an outlier that breaks the model, going back and rebuilding the model, and so forth. The bank’s existing Teradata warehouse was optimized for logical analysis and reporting and had reached its capacity. It would be very costly to expand the current environment, but to continue operating within that environment would necessitate more sampling, aggregations, or moving data to offline tape backup. Doing this would mean the bank had to ignore the opportunity to create better risk and fraud detection models presented by its growing, digital data volumes. Solution: The bank deployed Cloudera Enterprise as its data factory for fraud detection and prevention and risk analysis across home loans, insurance and online banking. Results: With the new environment, this bank has avoided expanding their expensive Teradata footprint while eliminating data sampling and improving fraud detection and risk analysis models. Now, they can look at every incidence of fraud for each person over a 5 year history. And they’ve been able to offload data processing to Hadoop in order to conserve the expensive Teradata CPU for analytical tasks.
A large semiconductor manufacturer has improved the accuracy of their yield predictions by running models on a larger data set: 10 years of data instead of 9 months. Background: A large semiconductor manufacturer uses yield models to predict which chips are likely to fail. Those predictions allow the company to take action -- they can adjust designs and thus minimize failures. Those predictive yield models were run on Oracle, based on 9 months of historical data. Challenge: The company wanted to improve the accuracy of their models by using a larger data set containing longer history and more granular information. But they couldn’t afford to store more than 9 months’ data on Oracle. Solution: The semiconductor manufacturer deployed the Dell | Cloudera solution for Apache Hadoop with HBase, which gives them unlimited scale and more flexible data capture and analysis at 10x lower TCO than traditional data warehouse environments. The company runs a 53-node cluster today, and expects to store up to 10 years data on CDH -- this will amount to about 10PB of data. The manufacturer can now collect and process data from every phase of the manufacturing process. Results: Since deploying the Dell | Cloudera solution, the manufacturer met its goal of improving the accuracy of their predictive yield models so they could optimize operations. When problems occur with chips, they can answer questions like: Where and why did the problem occur?Which manufacturing plant did this chip come from?Which components were used?Ultimately, this manufacturer is improving its operational efficiency with the Dell | Cloudera solution for Apache Hadoop.
Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000l7XjiBlackBerry realized ROI on their Cloudera investment through storage savings alone, while reducing ETL code by 90%.Background: BlackBerry transformed the mobile devices market in 1999 with their introduction of the BlackBerry smartphone. Since then, other industry innovators have introduced devices that compete against BlackBerry, and the company must leverage all of the data it can collect in order to understand its customers, what they need and want in mobile devices, and how to remain an industry leader. Challenge: BlackBerry Services generate ½ PB of data every single day -- or 50-60TB compressed. They couldn’t afford to store all of this data on their relational database, so their analytics were limited to a 1% data sample which reduced the accuracy of those analytic insights. And it took a long time to try to access data in the archive. Their incumbent system couldn’t cope with the multiplying growth of data volumes or constant access requests -- BlackBerry had to pipeline their data flows to prevent the data from hitting disk.Solution: BlackBerry deployed Cloudera Enterprise to provide a queryable data storage environment that would allow them to put all of their data to use. Today, BlackBerry has a global dataset of ~100 PB stored on Cloudera. The platform collects device content, machine-generated log data, audit details and more. BlackBerry has also converted ETL processes to run in Cloudera, and Cloudera feeds data into the data warehouse. Hadoop components in use include Flume, Hive, Hue, MapReduce, Pig and Zookeeper. Results: BlackBerry’s investment in Cloudera was justified through data storage cost savings alone. And by moving data processing over to Hadoop, their ETL code base has been reduced by 90%. They no longer have to rely on a 1% data sample for analytics; they can query all of their data -- faster, on a much larger data set, and with greater flexibility before. One ad hoc query that used to take 4 days to run now finishes in 53 minutes on Cloudera. BlackBerry’s new environment allowed them to do things like predict the impact that the London Olympics would have on their network so they could take proactive measures and prevent a negative customer experience.
Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000y2z1Y?srPos=0&srKp=001A leading manufacturer of mobile devices and technology identified a hidden software bug that was causing a spike in mobile phone returns. Background: Leading manufacturer of mobile devices and technology develops products that connect seamlessly so consumers have the best content at their fingertips 24x7. The company’s engineering department is responsible for manufacturing mobile phones and for developing a popular mobile platform. In recent years, consumers’ use of mobile phones has evolved from making calls to checking emails, taking photos and videos, buying things online and more. Mobile devices today actually make up more than 20% of all web traffic in the US.Challenge: The volumes of data that need to be collected, stored, explored and analyzed are exploding. Every device generates a massive stream of unstructured data from texts, photos, videos, web browsing, and so on. And today’s competitive market requires the company to not only find a way to capture more data more data volumes than ever before, but they also need to be able to process that data and act on it rapidly in order to stay innovative. The company’s Oracle RAC enterprise data warehouse couldn’t keep up. Solution: This company today leverages Cloudera Enterprise Core with RTD in conjunction with Oracle RAC; the two platforms work together for a closed loop analytical process. The company offloads data processing and historical storage from Oracle to CDH, and moves data as needed back into Oracle for reporting and analysis. They process 1TB of data every day. Oracle houses a few months of recent data which is available to business analysts for immediate reporting — both ad hoc and canned reports — whereas CDH is used for historical trend analysis (via Hive) of up to 25 years’ history. Oracle contains aggregated data; CDH captures all of the detailed data.Results: Hadoop’s ability to run large-scale, complex analysis is helping this company gain insights that would otherwise be hidden. In one case, a carrier that had been selling a popular phone noticed a sudden spike in returns. The carrier brought this issue to attention, and the manufacturer’s R&D team started investigating. After collecting a lot of data spread across numerous systems and conducting intensive research in CDH, they found a correlation between when they’d starting using a new hardware supplier for one component in the device and when returns of that device started to spike. The new hardware component had the same specs and was actually a better quality product, with a more narrow standard deviation for error. It turns out that the larger deviation in the original component actually allowed the software to work properly; when the quality of the component was stricter, a software bug manifested itself. By using Hadoop to combine carrier data with manufacturing data, this company was able to identify the problem and fix the software bug.
YP (YellowPages.com, previously AT&T Interactive) offloads data processing to Cloudera, which in turn enables new services that are valuable to publishers.Background: With the movement from print (publishing the YellowPages books) to predominant usage of the web (YellowPage.com), YP’s business relies on display ads that are purchased by publishers and vendors. In order to keep publishers buying ads, YP needs to be able to offer near real time analytics so the publishers can monitor how their campaigns are doing and make adjustments on the fly. Challenge: YP’s incumbent SQL Server data warehouse was not a scalable solution, and with increasing data volumes, performance was poor. YP generates 260 million billable web traffic events and 600 million non-billable events every day, and the business was demanding they keep 13 months of billable history and 90 days non-billable history in the data warehouse so that data would be available for analysis.Solution: YP replaced their SQL Server data warehouse with HP Vertica and Cloudera Enterprise. Cloudera serves as the core production traffic processing system that helps the company understand its network quality and traffic, and uses Vertica for reporting and analysis. YP currently has 315 CDH nodes and about 30 TB on Vertica. Results: With their new system, YP’s data processing is completed in hours vs. days in the previous environment. This has ultimately enabled YP to launch several new business functions that increase the value they offer publishers including: Real-time publisher portalsFaster behavioral targetingReal-time traffic analysisNetwork quality analyticsWith the faster data processing enabled by Cloudera, YP is better equipped to identify areas they should invest in as a business which are likely to drive revenues.