This document provides an overview of real-time processing capabilities on Hortonworks Data Platform (HDP). It discusses how a trucking company uses HDP to analyze sensor data from trucks in real-time to monitor for violations and integrate predictive analytics. The company collects data using Kafka and analyzes it using Storm, HBase and Hive on Tez. This provides real-time dashboards as well as querying of historical data to identify issues with routes, trucks or drivers. The document explains components like Kafka, Storm and HBase and how they enable a unified YARN-based architecture for multiple workloads on a single HDP cluster.
Why do we now have electric cars at prodcution scale, quad copters and drones cheap enough for the home hobbyist, and VR displays being bought by companies like Facebook?
Because the technology is there now, thanks to advances made in other industries, solving problems at scale in a big marketplace.
At Scale (in this case):
$270 bn smartphone mkt in 2014
$120 bn internet advertising (proj 2015)
Before we dive into Hadoop and its role within the modern data architecture, let’s set the context for why Hadoop has become important.
Existing approaches for data management have become both technically and commercially impractical.
Technically - these systems were never designed to store or process vast quantities of data
Commercially – the licensing structures with the traditonal approach are no longer feasible.
These two challenges combined with rate at which data is being produce predicated a need for a new approach to data systems. If we fast-forward another 3 to 5 years, more than half of the data under management within the enterprise will be from these new data sources.
Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
Single focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type
In the open, partner for success.
Everything in the open.
Joint deep engineering Microsoft (HD Insight), HP, SAP and Teradata
In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo!
This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type.
Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies.
This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
Our product, the Hortonworks Data Platform (or HDP for short) is a completely open source, enterprise-grade data platform that’s comprised of dozens of Apache open source projects including Apache Hadoop and YARN at its center.
We have a comprehensive engineering, testing, and certification process that integrates and packages all of these components into a cohesive platform that the enterprise can consume and deploy at scale. And our model enables us to proactively manage new innovations and new open source projects into HDP as they emerge.
To ensure the highest quality, we have a test suite, unique to Hortonworks, that is comprised of 10’s of thousands of system and integration tests that we run at scale on a regular basis including on the world’s largest Hadoop clusters at Yahoo! as part of our co-development relationship.
While our pure-play competitors focus on proprietary components for security, operations, and governance, we invest in new open source projects that address these areas.
For example, earlier in 2014, we acquired a small company called XA Secure that provided a comprehensive security and administration product. We flipped the technology in wholesale into open source as Apache Ranger.
Since our security, operations and governance technologies are open source projects, our partners are able to work with us on those projects to ensure deep integration within our joint solution architectures.
Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
Elastic Search Flume Sink does exist
Elastic Search Flume Sink does exist
The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers subscribe to one or more topics
A Kafka topic is just a sharded write-ahead log
Messages are not deleted when they are read but retained with some configurable SLA (say a few days or a week). This allows usage in situations where the consumer of data may need to reload data.
It also makes it possible to support space-efficient publish-subscribe as there is a single shared log no matter how many consumers; in traditional messaging systems there is usually a queue per consumer, so adding a consumer doubles your data size. This makes Kafka a good fit for things outside the bounds of normal messaging systems such as acting as a pipeline for offline data systems such as Hadoop. These offline systems may load only at intervals as part of a periodic ETL cycle, or may go down for several hours for maintenance, during which time Kafka is able to buffer even TBs of unconsumed data if needed
Replication for HA/fault tolerance is built in
Pull based system for consumers instead of pushed base
Crude benchmark:
Basically, single threaded synchronous messages are 400k per second when using 6 "datanode-ish" servers. This goes up to 2+ MM when using partitions and asynchronous messages. Server specs in the benchmark:
Intel Xeon 2.5 GHz processor with six cores
Six 7200 RPM SATA drives
32GB of RAM
1Gb Ethernet
A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order.
http://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
Elastic Search Flume Sink does exist
Elastic Search Flume Sink does exist
Elastic Search Flume Sink does exist
This all changed with the introduction of Hadoop 2 and YARN. Introduced in October, 2013 it changed everything.
Introduced in MR-279 by Arun Murthy in 2009, Arun and the team at Hortonworks architected and led it’s development as the core change in Hadoop 2. Our view was that to truly enable Hadoop as a component of a broad data architecture, YARN was the fundamental requirement as it turns Hadoop from a single application data system to a multi application data system. This is foundational to our approach of innovating from the core outwards to build Enterprise Hadoop.
With YARN it is now possible to land all data in one cluster and then access it in multiple ways: from batch to interactive to real-time.
Today, YARN, at the core of Hadoop is the center of our focus on innovation in and around Hadoop. It is clearly the enabling technology that has started a transition to a data lake within organizations.
Simply stated… Hortonworks Architected & led development of YARN in order to enable the Modern Data Architecture
Elastic Search Flume Sink does exist
Data is ingested, it’s on the dashboard, and it’s in HDFS.