SlideShare une entreprise Scribd logo
1  sur  23
hadoopsphere.com                         View in Full Screen mode for better readability

                                                    Components that
                                                      constitute the
                                                       open source
                                                     Apache Hadoop
                                                        ecosystem
                                                                    -

                                                       Summary and categorization of
                                                  components available as Apache (ASF)
                                                  projects/sub-projects and serving the
                                                    Hadoop ecosystem. The document
                                                  does not include other open source or
                                                      commercial projects/products


                   Contributed by : Sachin Ghai |@sachinghai
hadoopsphere.com                                                                                       Distribution,
                                                                                                           Financial,
                                                                                                           Government,
                                                                                                           Heavy Industry,




                                                                   ‘Atmospheric ’ Layers
                                                                                                           Internet, Oil &
                                                                                           Application     Energy, Research,
                                                                                           Domains         Telecom

                                                                                           Discovery &     Lucene, Blur,
                                                                                           Visualization   Giraph
                                                                                           Analytics &
                                                                                           Intelligence    Mahout, Drill
                                                                                                           Pig, Hive,
                                                                                           Data            HCatalog, Tez,
                                                                                           Interactions    Gora
                                                                                           Hardware (&
                                                                                           Appliances)     Commodity H/w
                                                                                           Distribution    Apache
                                                                                           Secure          Knox
                         Persist
                                                                                                           Oozie, Zookeeper,
                                                                                                           Crunch, MRUnit,
                                                                                                           HDT, Ambari,




                                                                   ‘Core ‘ Layers
                                                                                                           Vaidya, BigTop,
                                                                                           Manage          Whirr
                                                                                                           MapReduce,
                                                                                           Run             YARN, Hama
                                                                                                           HDFS, HBase,
                                                                                                           Cassandra,
                                                                                                           Accumulo, Avro,
                                                                                           Persist         Trevni, Thrift
                                                                                                           Flume, Sqoop,
                                                                                           Transfer        Chukwa, Kafka
                       Contributed by : Sachin Ghai |@sachinghai
M
hadoopsphere.com



                   CORE LAYERS
                    which constitute
                      the Apache
                   Hadoop ecosystem




                                3
hadoopsphere.com



                                   PERSIST :
                             File System & Data
                                    Store –
                             • HDFS - Distributed file system that
                             provides high-throughput access.
                             Comprises of NameNode, Secondary
                             NameNode and DataNodes
                             • HBase - Distributed, scalable, big
                   Persist   data store
                             • Cassandra - Highly scalable,
                             eventually consistent, distributed,
                             structured key-value store
                             • Accumulo - Sorted, distributed
                             key/value data storage and retrieval
                             system




                                                            4
hadoopsphere.com



                                  PERSIST :
                                 Serialization –
                             • Avro - Data serialization system


                             • Trevni - A Column File format to
                             permit compatible, independent
                             implementations that read and/or
                             write files in this format
                   Persist   • Thrift - Framework, for scalable
                             cross-language services
                             development




                                                         5
hadoopsphere.com



                                       RUN:
                                Job Execution –
                             • MapReduce - Framework for
                             performing distributed data
                             processing. Comprises of JobTracker,
                             TaskTracker and JobHistoryServer
                             • YARN - Framework that facilitates
                             writing arbitrary distributed
                             processing frameworks and
                   Persist
                             applications.
                             • Hama - Pure BSP (Bulk Synchronous
                             Parallel) computing framework for
                             massive scientific computations such
                             as matrix, graph and network
                             algorithms




                                                          6
hadoopsphere.com



                                MANAGE :
                                       Work –
                             • Oozie - Workflow/coordination
                             system to manage Hadoop jobs

                             • Zookeeper - Centralized service
                             for maintaining configuration
                             information, naming, providing
                             distributed synchronization, and
                   Persist   providing group services




                                                       7
hadoopsphere.com



                                MANAGE :
                                         Dev –
                             • Crunch - Framework for writing,
                             testing, and running MapReduce
                             pipelines
                             • MRUnit - Java library that helps
                             developers unit test Apache
                             Hadoop MapReduce jobs
                             • HDT – Hadoop Development
                   Persist   Tools (HDT) comprise Eclipse
                             based tools for developing
                             applications on the Hadoop
                             platform




                                                         8
hadoopsphere.com



                                MANAGE :
                                         Ops –
                             • Ambari - Web-based tool for
                             provisioning, managing, and
                             monitoring Apache Hadoop
                             clusters
                             • Vaidya - Performance diagnostic
                             tool for MapReduce jobs
                             • BigTop - Project for the
                   Persist   development of packaging and
                             tests and ensuring interoperability
                             among Apache Hadoop related
                             projects
                             • Whirr - Set of libraries for
                             running cloud services like running
                             Hadoop clusters on EC2



                                                         9
hadoopsphere.com



                                  SECURE :
                             • Knox - System that provides a
                             single point of secure access for
                             Apache Hadoop clusters




                   Persist




                                                         10
hadoopsphere.com



                               TRANSFER :
                             • Flume - Distributed, reliable, and
                             available service for efficiently
                             collecting, aggregating, and
                             moving large amounts of log data
                             • Sqoop - Tool designed for
                             efficiently transferring bulk data
                             between Apache Hadoop and
                   Persist   structured datastores such as
                             relational databases.
                             • Chukwa - Open source data
                             collection system for monitoring
                             large distributed systems
                             • Kafka - Distributed publish-
                             subscribe messaging system



                                                         11
hadoopsphere.com
                             ATMOSPHERIC
                                  LAYERS
                                 which build
                                   up the
                                 capabilities
                                 beyond the
                                   core of
                   Persist




                                   Apache
                                   Hadoop
                                 ecosystem
                                         12
hadoopsphere.com
                                                       HARDWARE :
                                                                    • Commodity Hardware -
                                                                    Low-cost, easily available
                                                                    hardware working in
                                                                    parallel
                                        C
                                        o
                                        r
                                        e

                                        L   Atm
                                        a   osp
                                        y   heri
                   Persist              e   c
                                        r   Laye
                                        s   rs




                             Note: no appliances known to run on pure Apache Hadoop distribution;
                             SSD and other cheap hardware options not mentioned separately here

                                                                                        13
hadoopsphere.com
                                     DATA
                             INTERACTIONS:
                                • Pig - Platform for
                                analyzing large data sets
                                that consists of a high-
                                level language for
                                expressing data analysis
                                programs, coupled with
                                infrastructure for
                                evaluating these
                                programs
                   Persist      • Hive - Data warehouse
                                system that facilitates
                                easy data summarization,
                                ad-hoc queries and
                                analysis of large datasets
                                stored in Hadoop
                                compatible file systems



                                                  14
hadoopsphere.com
                                                DATA
                                        INTERACTIONS:
                                           • HCatalog - Table and
                                           storage management
                                           service for data created
                                           using Apache Hadoop
                             C             • Tez - Generic
                             o
                             r
                                           application framework
                             e             which can be used to
                             L   Atm       process complex data-
                             a   osp
                             y   heri      processing task DAGs and
                             e   c
                   Persist
                             r   Laye      runs natively on Apache
                             s   rs
                                           Hadoop YARN
                                           •Gora - Framework for
                                           in-memory data model
                                           and persistence with
                                           MapReduce support




                                                            15
hadoopsphere.com
                               ANALYTICS &
                             INTELLIGENCE :
                                • Mahout - Scalable
                                machine learning and
                                data mining algorithm
                                library. Supports
                                Recommendation mining,
                                Clustering, Classification
                                and Frequent itemset
                                mining

                   Persist      • Drill - Distributed
                                system for interactive
                                analysis of large-scale
                                datasets. Comprises of
                                user interface (CLI, REST),
                                pluggable query language
                                and pluggable data
                                source.


                                                   16
hadoopsphere.com
                                DISCOVERY &
                             VISUALIZATION :
                                  • Lucene - Open-source
                                  search software including
                                  Java based indexing and
                                  search component
                                  Lucene Core and high
                                  performance search
                                  server component Solr

                                  • Blur - Search engine
                   Persist        capable of querying
                                  massive amounts of
                                  structured data at
                                  incredible speeds in a
                                  cloud computing
                                  environment




                                                    17
hadoopsphere.com
                                         DISCOVERY &
                                      VISUALIZATION :
                                                      • Giraph - Graph-
                                                      processing framework
                                                      leveraging existing
                                                      Hadoop infrastructure.
                                                      Follows bulk synchronous
                                                      parallel model to run
                                                      large scale algorithms.
                                                      Supports directed,
                                                      undirected, weighted,
                   Persist                            unweighted and
                                                      multigraphs




                             Note: no pure visualization projects currently as part of
                                                                                 ASF

                                                                            18
hadoopsphere.com
                             APPLICATION
                               DOMAINS :
                               • Distribution - Includes
                               applications in Travel,
                               Transport, FMCG, supply
                               chain e.g. Expedia
                               • Financial - Includes
                               applications in financial,
                               banking, insurance e.g.
                               Visa
                               • Government - Includes
                   Persist     applications in
                               government and public
                               sector e.g. Aadhar (India
                               ID card)
                               • Heavy Industry -
                               Includes applications in
                               heavy industrial business
                               including electronics,
                               auto, aircraft e.g. Hitachi

                                                  19
hadoopsphere.com
                                        APPLICATION
                                          DOMAINS :
                                          • Internet - Includes new
                                          age internet applications
                                          including social media,
                                          content distribution e.g.
                             C            Facebook
                             o
                             r
                                          • Oil & Energy - Includes
                             e            applications in
                             L   Atm      upstream/downstream
                             a   osp
                             y   heri     oil, gas business along
                                 c
                   Persist   e
                             r   Laye     with those in Energy
                             s   rs
                                          sector. e.g. Chevron
                                          • Research - Includes
                                          applications in new
                                          research e.g. network
                                          analysis & security
                                          • Telecom - Includes
                                          applications in Telecom
                                          business e.g. Korea
                                          Telecom
                                                             20
hadoopsphere.com



Reference :
• www.apache.org
• http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop

Image courtesy:
• Slide 1 : Getty Images #84480368 Dorling Kindersley
  (free thumbnail copy)
• Other images: Original source could not be established




                                                      21
hadoopsphere.com



About the document :
• Voluntarily contributed by: Sachin Ghai (@sachinghai)
• Publisher : hadoopsphere.com
• Version : 1.0
• Date : 11 March 2013
• Copyright: 2013, All Rights Reserved
• Note: The document does not use official lingo in part
• Contact : Use ‘Contact’ menu option on
  www.hadoopsphere.com
• Disclaimer: The project names mentioned in this document
  are either registered trademarks or trademarks of the Apache
  Software Foundation in the United States. The Apache
  Software Foundation has no affiliation with and does not
  endorse or review the materials provided in this document.

                                                           22
hadoopsphere.com



Subscribe to hadoopsphere.com:
• Newsletter on e-mail subscription

• RSS Feed for posts

• Follow on Twitter

• Like on Facebook

Contenu connexe

Dernier

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 

Dernier (20)

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 

Apache Hadoop ecosystem - March 2013

  • 1. hadoopsphere.com View in Full Screen mode for better readability Components that constitute the open source Apache Hadoop ecosystem - Summary and categorization of components available as Apache (ASF) projects/sub-projects and serving the Hadoop ecosystem. The document does not include other open source or commercial projects/products Contributed by : Sachin Ghai |@sachinghai
  • 2. hadoopsphere.com Distribution, Financial, Government, Heavy Industry, ‘Atmospheric ’ Layers Internet, Oil & Application Energy, Research, Domains Telecom Discovery & Lucene, Blur, Visualization Giraph Analytics & Intelligence Mahout, Drill Pig, Hive, Data HCatalog, Tez, Interactions Gora Hardware (& Appliances) Commodity H/w Distribution Apache Secure Knox Persist Oozie, Zookeeper, Crunch, MRUnit, HDT, Ambari, ‘Core ‘ Layers Vaidya, BigTop, Manage Whirr MapReduce, Run YARN, Hama HDFS, HBase, Cassandra, Accumulo, Avro, Persist Trevni, Thrift Flume, Sqoop, Transfer Chukwa, Kafka Contributed by : Sachin Ghai |@sachinghai M
  • 3. hadoopsphere.com CORE LAYERS which constitute the Apache Hadoop ecosystem 3
  • 4. hadoopsphere.com PERSIST : File System & Data Store – • HDFS - Distributed file system that provides high-throughput access. Comprises of NameNode, Secondary NameNode and DataNodes • HBase - Distributed, scalable, big Persist data store • Cassandra - Highly scalable, eventually consistent, distributed, structured key-value store • Accumulo - Sorted, distributed key/value data storage and retrieval system 4
  • 5. hadoopsphere.com PERSIST : Serialization – • Avro - Data serialization system • Trevni - A Column File format to permit compatible, independent implementations that read and/or write files in this format Persist • Thrift - Framework, for scalable cross-language services development 5
  • 6. hadoopsphere.com RUN: Job Execution – • MapReduce - Framework for performing distributed data processing. Comprises of JobTracker, TaskTracker and JobHistoryServer • YARN - Framework that facilitates writing arbitrary distributed processing frameworks and Persist applications. • Hama - Pure BSP (Bulk Synchronous Parallel) computing framework for massive scientific computations such as matrix, graph and network algorithms 6
  • 7. hadoopsphere.com MANAGE : Work – • Oozie - Workflow/coordination system to manage Hadoop jobs • Zookeeper - Centralized service for maintaining configuration information, naming, providing distributed synchronization, and Persist providing group services 7
  • 8. hadoopsphere.com MANAGE : Dev – • Crunch - Framework for writing, testing, and running MapReduce pipelines • MRUnit - Java library that helps developers unit test Apache Hadoop MapReduce jobs • HDT – Hadoop Development Persist Tools (HDT) comprise Eclipse based tools for developing applications on the Hadoop platform 8
  • 9. hadoopsphere.com MANAGE : Ops – • Ambari - Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters • Vaidya - Performance diagnostic tool for MapReduce jobs • BigTop - Project for the Persist development of packaging and tests and ensuring interoperability among Apache Hadoop related projects • Whirr - Set of libraries for running cloud services like running Hadoop clusters on EC2 9
  • 10. hadoopsphere.com SECURE : • Knox - System that provides a single point of secure access for Apache Hadoop clusters Persist 10
  • 11. hadoopsphere.com TRANSFER : • Flume - Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data • Sqoop - Tool designed for efficiently transferring bulk data between Apache Hadoop and Persist structured datastores such as relational databases. • Chukwa - Open source data collection system for monitoring large distributed systems • Kafka - Distributed publish- subscribe messaging system 11
  • 12. hadoopsphere.com ATMOSPHERIC LAYERS which build up the capabilities beyond the core of Persist Apache Hadoop ecosystem 12
  • 13. hadoopsphere.com HARDWARE : • Commodity Hardware - Low-cost, easily available hardware working in parallel C o r e L Atm a osp y heri Persist e c r Laye s rs Note: no appliances known to run on pure Apache Hadoop distribution; SSD and other cheap hardware options not mentioned separately here 13
  • 14. hadoopsphere.com DATA INTERACTIONS: • Pig - Platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs Persist • Hive - Data warehouse system that facilitates easy data summarization, ad-hoc queries and analysis of large datasets stored in Hadoop compatible file systems 14
  • 15. hadoopsphere.com DATA INTERACTIONS: • HCatalog - Table and storage management service for data created using Apache Hadoop C • Tez - Generic o r application framework e which can be used to L Atm process complex data- a osp y heri processing task DAGs and e c Persist r Laye runs natively on Apache s rs Hadoop YARN •Gora - Framework for in-memory data model and persistence with MapReduce support 15
  • 16. hadoopsphere.com ANALYTICS & INTELLIGENCE : • Mahout - Scalable machine learning and data mining algorithm library. Supports Recommendation mining, Clustering, Classification and Frequent itemset mining Persist • Drill - Distributed system for interactive analysis of large-scale datasets. Comprises of user interface (CLI, REST), pluggable query language and pluggable data source. 16
  • 17. hadoopsphere.com DISCOVERY & VISUALIZATION : • Lucene - Open-source search software including Java based indexing and search component Lucene Core and high performance search server component Solr • Blur - Search engine Persist capable of querying massive amounts of structured data at incredible speeds in a cloud computing environment 17
  • 18. hadoopsphere.com DISCOVERY & VISUALIZATION : • Giraph - Graph- processing framework leveraging existing Hadoop infrastructure. Follows bulk synchronous parallel model to run large scale algorithms. Supports directed, undirected, weighted, Persist unweighted and multigraphs Note: no pure visualization projects currently as part of ASF 18
  • 19. hadoopsphere.com APPLICATION DOMAINS : • Distribution - Includes applications in Travel, Transport, FMCG, supply chain e.g. Expedia • Financial - Includes applications in financial, banking, insurance e.g. Visa • Government - Includes Persist applications in government and public sector e.g. Aadhar (India ID card) • Heavy Industry - Includes applications in heavy industrial business including electronics, auto, aircraft e.g. Hitachi 19
  • 20. hadoopsphere.com APPLICATION DOMAINS : • Internet - Includes new age internet applications including social media, content distribution e.g. C Facebook o r • Oil & Energy - Includes e applications in L Atm upstream/downstream a osp y heri oil, gas business along c Persist e r Laye with those in Energy s rs sector. e.g. Chevron • Research - Includes applications in new research e.g. network analysis & security • Telecom - Includes applications in Telecom business e.g. Korea Telecom 20
  • 21. hadoopsphere.com Reference : • www.apache.org • http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop Image courtesy: • Slide 1 : Getty Images #84480368 Dorling Kindersley (free thumbnail copy) • Other images: Original source could not be established 21
  • 22. hadoopsphere.com About the document : • Voluntarily contributed by: Sachin Ghai (@sachinghai) • Publisher : hadoopsphere.com • Version : 1.0 • Date : 11 March 2013 • Copyright: 2013, All Rights Reserved • Note: The document does not use official lingo in part • Contact : Use ‘Contact’ menu option on www.hadoopsphere.com • Disclaimer: The project names mentioned in this document are either registered trademarks or trademarks of the Apache Software Foundation in the United States. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided in this document. 22
  • 23. hadoopsphere.com Subscribe to hadoopsphere.com: • Newsletter on e-mail subscription • RSS Feed for posts • Follow on Twitter • Like on Facebook