Lambda Architecture The Hive

•Download as PPTX, PDF•

0 likes•878 views

Altan Khendup

2
Background of Lambda Architecture
Background
– Reference architecture for Big Data systems
– Designed by Nathan Marz (Twitter)
– Defined as a system that runs arbitrary functions on arbitrary
data
– “query = function(all data)”
Design Principles
– Human fault-tolerant, Immutability, Computable
Lambda Layers
– Batch - Contains the immutable, constantly growing master
dataset.
– Speed - Deals only with new data and compensates for the
high latency updates of the serving layer.
– Serving - Loads and exposes the combined view of data so
that they can be queried.

4 © 2014 Teradata
USE CASE – MAYO CLINIC

Mayo Clinic History
Every year, more than a million people from all 50 states
and nearly 150 countries come for care
Dozens of locations in several states with major
campuses in Rochester, Minn.; Scottsdale and Phoenix,
Ariz.; and Jacksonville, Fla.
Mayo Clinic Rochester, Minn. recognized as the top
hospital in the nation for 2014-2015 by U.S. News &
World Report

Why Big Data?
Challenges in Medical Data
Health data tends to be “wide”, not “deep”
New data types are becoming more important
Unstructured
Real-time streaming
A challenge to generally move from retrospective “BI”
viewing to event-based and predictive analytics usage
Multiple layers
Lots of events, data
Complex
Lots of different languages and data structures
Difficult to maintain
Lots of moving pieces/components/technologies
Lots of changes in the business

Data Discovery
Many “Big Data” stories start with data discovery
The Data Lake, etc.
But, data discovery is not predictable!
Mayo Clinic needed to define a real operational need
that a “Big Data” technology stack could fulfill

Project
Optimize an existing Natural Language Processing
pipeline in support of critical Colorectal Surgery
(Move to tens of thousands of documents processed)
Replace an existing free-text search facility used by
Clinical Web Service for colorectal cancer
(Move search to milliseconds)

10
• Current Storm throughput up to 1.5 million documents per hour
• Average of 140,000 HL7 messages actually processed per day with
average latency of 60 milliseconds from ingest to persistence
• Average of 50,000 documents passed through annotators per day
versus 5,000 historically
• Actual annotations of documents up to 6 times faster than previously
accomplished
• Free-text search use cases that took over 30 minutes on old
infrastructure completing in milliseconds in ElasticSearch
Operational Statistics

11
• Benefits
– An architecturally-driven, internally-owned technology stack that blends:
- An event-based/”real-time” processing fabric
- A multi-destination distillation hub
- A foundation for “Classic” BI delivery techniques
- A foundation for “Services-based” delivery techniques
- A “serendipitous” discovery environment
– Mutually supportive components that combine in delivering novel clinical
solutions
– Data continuity
- Historical data can be assessed as algorithms change over time
Summary

12
Thank you! We’re Hiring!
thinkbigcareers.teradata.com
Altan Khendup (@madmongol)
Altan.khendup@teradata.com
Ron Bodkin (@ronbodkin)
Ron.bodkin@thinkbiganalytics.com

What's hot

Mendeley Open Repositories 2011 PaperWilliam Gunn

Cloud DataverseMerce Crosas

From Data to Visualization: Emerging Tools for Research / Jan JohanssonPVC.ASIST

Top 10 data science technologiesBrainware University

ieee cloud 2015 keynote talkMicrosoft Azure for Research

Machine Learning and Cultural Heritage: What Is It Good Enough For?John Stack

Data GovRexNige

Wehc - Linked Data for Economic-Social historiansBram van den Hout

QB'er demonstrationCLARIAH

Closing Keynote - AWS Executive Summit 2014 IndiaAmazon Web Services

RSpace - Rory Macneil at Repository Fringe 2015Repository Fringe

December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...DeVonne Parks, CEM

Chris Armit at IDW2018: Democratising Data Publishing: A Global PerspectiveGigaScience, BGI Hong Kong

Adding value to researchers' dataAndrew Treloar

Building data networks: exploring trust and interoperability between authoris...Repository Fringe

Big dataMuhammad Noman Fazil

Husar Report to UIC PlenaryRudolf Husar

Open Data from the European LibraryTU Delft, Netherlands

The pulse of cloud computing with bioinformatics as an exampleEnis Afgan

Hadoop TutorialUjjwal Gupta

What's hot (20)

Mendeley Open Repositories 2011 Paper

Cloud Dataverse

From Data to Visualization: Emerging Tools for Research / Jan Johansson

Top 10 data science technologies

ieee cloud 2015 keynote talk

Machine Learning and Cultural Heritage: What Is It Good Enough For?

Data Gov

Wehc - Linked Data for Economic-Social historians

QB'er demonstration

Closing Keynote - AWS Executive Summit 2014 India

RSpace - Rory Macneil at Repository Fringe 2015

December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...

Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective

Adding value to researchers' data

Building data networks: exploring trust and interoperability between authoris...

Big data

Husar Report to UIC Plenary

Open Data from the European Library

The pulse of cloud computing with bioinformatics as an example

Hadoop Tutorial

Viewers also liked

Hive & HBase For Transaction ProcessingDataWorks Summit

Spotify's Music Recommendations Lambda ArchitectureEsh Vckay

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit

Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit

Spark Summit EU talk by Bas GeerdinkSpark Summit

Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.

AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNSAmazon Web Services Japan

AWS Black Belt Techシリーズ Amazon SNS / Amazon SQSAmazon Web Services Japan

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson

Solution Architecture – Approach to Rapidly Scoping The Initial Solution OptionsAlan McSweeney

Structured Approach to Solution ArchitectureAlan McSweeney

Viewers also liked (12)

Hive & HBase For Transaction Processing

Spotify's Music Recommendations Lambda Architecture

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...

Implementing the Lambda Architecture efficiently with Apache Spark

Spark Summit EU talk by Bas Geerdink

Apache Sqoop: A Data Transfer Tool for Hadoop

AWS Black Belt Tech シリーズ 2016 - Amazon SQS / Amazon SNS

AWS Black Belt Techシリーズ Amazon SNS / Amazon SQS

Real time Analytics with Apache Kafka and Apache Spark

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Solution Architecture – Approach to Rapidly Scoping The Initial Solution Options

Structured Approach to Solution Architecture

Similar to Lambda Architecture The Hive

Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe

HPE and Hortonworks join forces to Deliver Healthcare TransformationHortonworks

Big Data, The Community and The Commons (May 12, 2014)Robert Grossman

Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe

Met soc15 roccaserra-biocrates-datasharingPhilippe Rocca-Serra

Is one enough? Data warehousing for biomedical researchGreg Landrum

Big Data Putchong Uthayopas

Big Data in Clinical ResearchMike Hogarth, MD, FACMI, FACP

Using The Hadoop Ecosystem to Drive Healthcare InnovationDan Wellisch

Starting the Hadoop Journey at a Global Leader in Cancer ResearchDataWorks Summit/Hadoop Summit

High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox

Instrument of Change: Creating the next generation of Laboratory MiddlewareTodd Winey

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks

High Performance Computing and the Opportunity with Cognitive TechnologyIBM Watson

Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Ola Spjuth

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz

Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Data

Big Data Vendor Panel - Data StaxMikan Associates

Jisc's new shared data centreJisc

Similar to Lambda Architecture The Hive (20)

Data Harmonization for a Molecularly Driven Health System

HPE and Hortonworks join forces to Deliver Healthcare Transformation

Big Data, The Community and The Commons (May 12, 2014)

Data Harmonization for a Molecularly Driven Health System

Met soc15 roccaserra-biocrates-datasharing

Is one enough? Data warehousing for biomedical research

Big Data

Big Data in Clinical Research

Using The Hadoop Ecosystem to Drive Healthcare Innovation

Starting the Hadoop Journey at a Global Leader in Cancer Research

High Performance Data Analytics and a Java Grande Run Time

Instrument of Change: Creating the next generation of Laboratory Middleware

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...

High Performance Computing and the Opportunity with Cognitive Technology

Analyzing Big Data in Medicine with Virtual Research Environments and Microse...

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...

Yellowbrick Webcast with DBTA for Real-Time Analytics

Big Data Vendor Panel - Data Stax

Jisc's new shared data centre

Lambda Architecture The Hive

1. Lambda Architecture Use Case: Mayo Clinic FEBRUARY 2015 Altan Khendup – Leader, UDA Architecture COE

2. 2 Background of Lambda Architecture Background – Reference architecture for Big Data systems – Designed by Nathan Marz (Twitter) – Defined as a system that runs arbitrary functions on arbitrary data – “query = function(all data)” Design Principles – Human fault-tolerant, Immutability, Computable Lambda Layers – Batch - Contains the immutable, constantly growing master dataset. – Speed - Deals only with new data and compensates for the high latency updates of the serving layer. – Serving - Loads and exposes the combined view of data so that they can be queried.

3. 3 Overview of Lambda Architecture

5. Mayo Clinic History Every year, more than a million people from all 50 states and nearly 150 countries come for care Dozens of locations in several states with major campuses in Rochester, Minn.; Scottsdale and Phoenix, Ariz.; and Jacksonville, Fla. Mayo Clinic Rochester, Minn. recognized as the top hospital in the nation for 2014-2015 by U.S. News & World Report

6. Why Big Data? Challenges in Medical Data Health data tends to be “wide”, not “deep” New data types are becoming more important Unstructured Real-time streaming A challenge to generally move from retrospective “BI” viewing to event-based and predictive analytics usage Multiple layers Lots of events, data Complex Lots of different languages and data structures Difficult to maintain Lots of moving pieces/components/technologies Lots of changes in the business

7. Data Discovery Many “Big Data” stories start with data discovery The Data Lake, etc. But, data discovery is not predictable! Mayo Clinic needed to define a real operational need that a “Big Data” technology stack could fulfill

8. Project Optimize an existing Natural Language Processing pipeline in support of critical Colorectal Surgery (Move to tens of thousands of documents processed) Replace an existing free-text search facility used by Clinical Web Service for colorectal cancer (Move search to milliseconds)

9. 9 Overall Architecture

10. 10 • Current Storm throughput up to 1.5 million documents per hour • Average of 140,000 HL7 messages actually processed per day with average latency of 60 milliseconds from ingest to persistence • Average of 50,000 documents passed through annotators per day versus 5,000 historically • Actual annotations of documents up to 6 times faster than previously accomplished • Free-text search use cases that took over 30 minutes on old infrastructure completing in milliseconds in ElasticSearch Operational Statistics

11. 11 • Benefits – An architecturally-driven, internally-owned technology stack that blends: - An event-based/”real-time” processing fabric - A multi-destination distillation hub - A foundation for “Classic” BI delivery techniques - A foundation for “Services-based” delivery techniques - A “serendipitous” discovery environment – Mutually supportive components that combine in delivering novel clinical solutions – Data continuity - Historical data can be assessed as algorithms change over time Summary

12. 12 Thank you! We’re Hiring! thinkbigcareers.teradata.com Altan Khendup (@madmongol) Altan.khendup@teradata.com Ron Bodkin (@ronbodkin) Ron.bodkin@thinkbiganalytics.com

Editor's Notes

Lambda = architectural pattern to talk about the complexity of dealing with real-time and historical datasets Overall use Prescriptive/Predictive uses rely on some dimension of real-time Use cases CPG – consumer goods looking at what customers are doing in real-time and making adjustments Medical – real-time medical sensors and treatment and labs for critical patient care Financial – credit risk and transaction fraud Manufacturers – IoT/Telematics getting information from their plants and logistics, cross referencing to inventory, and making adjustments to supply chain
General architecture that covers how Lambda works overall Able to address real-time and historical data Layers Speed – real-time/current data streams; spark, storm, etc. Batch – historical data layer Serving – ability to take the current data and historical and merge the results and provide that to the organization Real-world experience/strategy Do not tackle all of the data but rather necessary segments of business functionality called queries Data can be tackled per query hence the idea of “query focused datasets” or qfds Allows for more focused results/faster speed gains
Data Discovery & Analytics
HL7 actual processing based on “pull” requests from users not actual processing power HL7 are large xml-based documents Much larger than say JSON or others (roughly 800k-900k in size) Contains significant data related to medical information End goal An architecturally-driven, internally-owned technology stack that blends: An event-based processing fabric A real-time processing framework A multi-destination distillation hub “Classic” BI delivery techniques “Services-based” delivery techniques A “serendipitous” discovery environment Mutually supportive components that combine in delivering novel clinical solutions.
Challenges Cross industry similarities
We’re hiring! Ron Bodkin + my contact

Lambda Architecture The Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Lambda Architecture The Hive

Similar to Lambda Architecture The Hive (20)

Lambda Architecture The Hive

Editor's Notes