SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Ankara Cloud Meetup – July 2016
Serkan ÖZAL
BIG DATA ON AWS
AGENDA
• What is Big Data?
• Big Data Concepts
• Big Data on AWS
• Data Storage on AWS
• Data Analytics & Querying on AWS
• Data Processing on AWS
• Data Flow on AWS
• Demo
2
3
WHAT IS BIG DATA???
4
5
4V OF BIG DATA
6
BIG DATA CONCEPTS
7
BIG DATA CONCEPTS
• Scalability
• Cloud Computing
• Data Storage
• Data Analytics & Querying
• Data Processing
• Data Flow
8
BIG DATA ON AWS
9
AWS BIG DATA PORTOFILO
10
SCALABILITY ON AWS
• Types of scalability:
• Vertical Scalability: More/Less CPU, RAM, ...
• Horizontal Scalability: More/Less machine
• Auto scalable by
• CPU
• Memory
• Network
• Disk
• …
11
CLOUD COMPUTING ON AWS
• Rich service catalog
• «Pay as you go» payment model
• Cloud Computing models:
 IaaS (Infrastructure as a Service)
o EC2, S3
 PaaS (Platform as a Service)
o Elastic Beanstalk
 SaaS (Software as a Service)
o DynamoDB, ElasticSearch
12
DATA STORAGE ON AWS
13
DATA STORAGE ON AWS
• There are many different data storage services
for different purposes
• Key features for storage:
• Data storage services on AWS:
 S3
 Glacier
14
 Durability
 Security
 Cost
 Scalability
 Access Performance
 Availability
 Reliability
S3
• Secure, durable and highly-available storage service
• Bucket: Containers for objects
 Globally unique regardless of the AWS region
• Object: Stored entries
 Uniquely within a bucket by a name and a version ID
• Supports versioning
• WORM (Write once, read many)
• Consistency models
 «read-after-write» consistency for puts of new objects
 «eventual consistency» for puts/deletes of existing objects
 Updates to a single object are atomic.
15
GLACIER
• Storage service optimized for infrequently used data
• Archives are stored in the containers name «vault»s
• Uploading is synchronous operation with replicas
• Downloading is asynchronous operation through retrieval job
• Immutable
• Encrypted by default
• «S3» data can be archived into «Glacier» by life-cycle rules
• 4x cheaper than «S3»
16
DATA
ANALYTICS & QUERYING
ON AWS
17
DATA ANALYTICS & QUERYING ON AWS
• There are different data analytics&querying services for
different purposes.
 NoSQL
 SQL
 Text search
 Analytics / Visualising
• Data analytics&querying services on AWS:
18
 DynamoDB
 Redshift
 RDS
 Elasticsearch (+ Kibana)
 CloudSearch
 QuickSight
DYNAMODB
• Key-value or document storage based NoSQL database
• Consistency model
• Eventual consistency by default
• Strong consistency as optional
• Supports
• «DynamoDB Streaming» for tracking changes in near real time
• Integrated with
19
 EMR
 Elasticsearch
 Redshift
 Lambda
 Secondary indexes
 Batch operations
 Conditional writes
 Atomic counters
REDSHIFT
• SQL data warehouse solution
• Fault tolerant by
 Replicating to other nodes in the cluster
 Backing up to «S3» continuously and automatically
• Scalable
 Vertically by allowing switching to new cluster
 Horizontally by adding new nodes to cluster
• Integrated with
20
 S3
 EMR
 DynamoDB
 Kinesis
 Data Pipeline
RDS
• Makes it easier to set up, operate and scale a relational
database in the cloud
• Manages backups, software patching, automatic failure
detection and recovery
• Supports
 MySQL
 PostgreSQL
 Oracle
 Microsoft SQL Server
 Aurora (Amazon’s MySQL based RDMS by 5x performance)
21
ELASTICSEARCH
• Makes it easy to deploy, operate and scale «Elasticsearch» in the
AWS cloud
• Supports
 Taking snapshots for backup
 Restoring from backups
 Replicating domains across Availability Zones
• Integrated with
 «Kibana» for data visualization
 «Logstash» to process logs and load them into «ES»
 «S3», «Kinesis» and «DynamoDB» for loading data into «ES»
22
CLOUDSEARCH
• AWS’s managed and high-performance search solution
• Reliable by replicating search domain into multiple AZ
• Automatic monitoring and recovery
• Auto scalable
• Supports popular search features such as
On 34 languages
23
 Free-text search
 Faceted search
 Highlighting
 Autocomplete suggestion
 Geospatial search
 …
QUICKSIGHT
• AWS’s new very fast, cloud-powered business intelligence (BI)
service
• Uses a new, Super-fast, Parallel, In-memory Calculation Engine
(«SPICE») to perform advanced calculations and render
visualizations rapidly
• Supports creating analyze stories and sharing them
• Integrated/Supported data sources:
• Said that 1/10th the cost of traditional BI solutions
• Native Access on available major mobile platforms
24
 EMR
 Kinesis
 S3
 DynamoDB
 Redshift
 RDS
 3rd party (i.e. Salesforce)
 Local file upload
 …
DATA PROCESSING ON
AWS
25
DATA PROCESSING ON AWS
• Allows processing of huge amount of data in highly scalable
way as distributed
• Support both of
 Batch
 Stream
data processing concepts
• Data processing services on AWS:
 EMR (Elastic Map-Reduce)
 Kinesis
 Lambda
 Machine Learning
26
EMR (ELASTIC MAPREDUCE)
• Big data service/infrastructure to process large amounts of
data efficiently in highly-scalable manner
• Supported Hadoop distributions:
 Amazon
 MapR
• Low cost by «on-demand», «spot» or «reserved» instances
• Reliable by retrying failed tasks and automatically replacing
poorly performing or failed instances
• Elastic by resizing cluster on the fly
• Integrated with
27
 S3
 DynamoDB
 Data Pipeline
 Redshift
 RDS
 Glacier
 VPC
 …
EMR – CLUSTER COMPONENTS
• Master Node
 Tracks, monitors and manages the cluster
 Runs the «YARN Resource Manager» and
the «HDFS Name Node Service»
• Core Node [Slave Node]
 Runs tasks and stores data in the HDFS
 Runs the «YARN Node Manager Service»
and «HDFS Data Node Daemon» service
• Task Node [Slave Node]
• Only runs the tasks
• Runs the «YARN Node Manager Service»
28
EMR – BEST PRACTICES FOR INSTANCE
GROUPS
29
Project Master Instance Group Core Instance Group Task Instance Group
Long-
running
clusters
On-Demand On-Demand Spot
Cost-driven
workloads
Spot Spot Spot
Data-critical
workloads
On-Demand On-Demand Spot
Application
testing
Spot Spot Spot
EMR – CLUSTER LIFECYCLE
30
EMR - STORAGE
• The storage layer includes the different file systems for
different purposes as follows:
 HDFS (Hadoop Distributed File System)
 EMRFS (Elastic MapReduce File System)
 Local Storage
31
EMR - HDFS
• Distributed, scalable file system for Hadoop
• Stores multiple copies to prevent data lost on failure
• It is an ephemeral storage because it is reclaimed when cluster
terminates
• Used by the master and core nodes
• Useful for intermediate results during Map-Reduce processing
or for workloads which have significant random I/O
• Prefix: «hdfs://»
32
EMR - EMRFS
• An implementation of HDFS which allows clusters to store data
on «AWS S3»
• Eventually consistent by default on top of «AWS S3»
• Consistent view can be enabled with the following supports:
 Read-after-write consistency
 Delete consistency
 List consistency
• Client/Server side encryption
• Most often, «EMRFS» (S3) is used to store input and output
data and intermediate results are stored in «HDFS»
• Prefix: «s3://»
33
EMR – LOCAL STORAGE
• Refers to an ephemeral volume directly attached to an EC2
instance
• Ideal for storing temporary data that is continually changing,
such as
 Buffers
 Caches
 Scratch data
 And other temporary content
• Prefix: «file://»
34
EMR – DATA PROCESSING FRAMEWORKS
35
• Hadoop
• Spark
• Hive
• Pig
• HBase
• HCatalog
• Zookeeper
• Presto
• Impala
• Sqoop
• Oozie
• Mahout
• Phoenix
• Tez
• Ganglia
• Hue
• Zeppelin
• …
KINESIS
• Collects and processes large streams of data records in real time
• Reliable by synchronously replicating streaming data across
facilities in an AWS Region
• Elastic by increasing shards based on the volume of input data
• Has
 S3
 DynamoDB
 ElasticSearch
connectors
• Also integrated with «Storm» and «Spark»
36
KINESIS – HIGH LEVEL ARCHITECTURE
37
KINESIS – KEY CONCEPTS[1]
• Streams
 Ordered sequence of data records
 All data is stored for 24 hours (can be increased to 7 days)
• Shards
 It is a uniquely identified group of data records in a stream
 Kinesis streams are scaled by adding or removing shards
 Provides 1MB/sec data input and 2MB/sec data output capacity
 Supports up to 1000 TPS (put records per second)
• Producers
 Put records into streams
• Consumers
 Get records from streams and process them
38
KINESIS – KEY CONCEPTS[2]
• Partition Key
 Used to group data by shard within a stream
 Used to determine which shard a given data record belongs to
• Sequence Number
 Each record has a unique sequence number in the owned shard
 Increase over time
 Specific to a shard within a stream, not across all shards
• Record
 Unit of data stored
 Composed of <partition-key, sequence-number, data-blob>
 Data size can be up to 1 MB
 Immutable
39
KINESIS – GETTING DATA IN
• Producers use «PUT» call to store data in
a stream
• A partition key is used to distribute the
«PUT»s across shards
• A unique sequence # is returned to the
producer for each event
• Data can be ingested at 1MB/second or
1000 Transactions/second per shard
40
KINESIS – GETTING DATA OUT
• Kinesis Client Library (KCL) simplifies
reading from the stream by abstracting
you from individual shards
• Automatically starts a worker thread for
each shard
• Increases and decreases thread count as
number of shards changes
• Uses checkpoints to keep track of a
thread’s location in the stream
• Restarts threads & workers if they fail
41
LAMBDA
• Compute service that run your uploaded code on your behalf
using AWS infrastructure
• Stateless
• Auto-scalable
• Pay per use
• Use cases
 As an event-driven compute service triggered by actions
on «S3», «Kinesis», «DynamoDB», «SNS», «CloudWatch»
 As a compute service to run your code in response to HTTP
requests using «AWS API Gateway»
• Supports «Java», «Python» and «Node.js» languages
42
MACHINE LEARNING
• Machine learning service for building ML models and
generating predictions
• Provides implementations of common ML data transformations
• Supports industry-standard ML algorithms such as
 binary classification
 multi-class classification
 regression
• Supports batch and real-time predictions
• Integrated with
• S3
• RDS
• Redshift
43
DATA FLOW ON AWS
44
DATA FLOW ON AWS
• Allows transferring data between data
storage/processing points
• Data flow services on AWS:
 Firehose
 Data Pipeline
 DMS (Database Migration Service)
 Snowball
45
FIREHOSE
• Makes it easy to capture and load massive volumes of load
streaming data into AWS
• Supports endpoints
 S3
 Redshift
 Elasticsearch
• Supports buffering
 By size: Ranges from 1 MB to 128 MB. Default 5 MB
 By interval: Ranges from 60 seconds to 900 seconds.
Default 5 minutes
• Supports encryption and compression
46
DATA PIPELINE
• Automates the movement and transformation of data
• Comprise of workflows of tasks
• The flow can be scheduled
• Has visual designer with drag-and-drop support
• The definition can be exported/imported
• Supports custom activities and preconditions
• Retry flow in case of failure and notifies
• Integrated with
47
 EMR
 S3
 DynamoDB
 Redshift
 RDS
DATABASE MIGRATION SERVICE
• Helps to migrate databases into AWS easily and securely
• Supports
 Homogeneous migrations (i.e. Oracle <=> Oracle)
 Heterogeneous migrations (i.e. Oracle <=> PostgreSQL)
• Also used for:
• Database consolidation
• Continuous data replication
• Keep your applications running during migration
 Start a replication instance
 Connect source and target databases
 Let «DMS» load data and keep them in sync
 Switch applications over to the target at your convenience
48
SNOWBALL
• Import/Export service that accelerates transferring large
amounts of data into and out of AWS using physical storage
appliances, bypassing the Internet
• Data is transported securely as encrypted
• Data is transferred via 10G connection
• Has 50TB or 80TB storage options
• Steps:
49
 Job Create
 Processing
 In transit to you
 Delivered to you
 In transit to AWS
 At AWS
 Importing
 Completed
DEMO TIME
50
Source code
and
Instructions
are available at
github.com/serkan-ozal/ankaracloudmeetup-bigdata-demo
51
STEPS OF DEMO
• Storing
• Searching
• Batch Processing
• Stream Processing
• Analyzing
52
STORING
53
[STORING]
CREATING DELIVERY STREAM
• Create an «AWS S3» bucket to store tweets
• Create an «AWS Firehose» stream to push tweets into to be stored
• Attach created bucket as endpoint to the stream
Then;
• Listened tweet data can be pushed into this stream to be stored
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-storing-firehose
54
[STORING]
CRAWLING TWEETS
• Configure stream name to be used by crawler application
• Build «bigdata-demo-storing-crawler» application
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application listens tweets via «Twitter Streaming API»
• Application pushes tweets data into «AWS Firehose» stream
• «AWS Firehose» stream stores tweets on «AWS S3» bucket
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-storing-crawler
55
SEARCHING
56
[SEARCHING]
CREATING SEARCH DOMAIN
• Create an «AWS Elasticsearch» domain
• Wait until it becomes active and ready to use
Then
• Tweet data can also be indexed to be searched
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-searching-
elasticsearch
57
[SEARCHING]
INDEXING INTO SEARCH DOMAIN
• Create an «AWS Lambda» function with the given code
• Configure created «AWS S3» bucket as trigger
• Wait until it becomes active and ready to use
Then;
• Stored tweet data will also be indexed to be searched
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-searching-lambda
58
BATCH PROCESSING
59
[BATCH PROCESSING]
CREATING EMR CLUSTER
• Create an «AWS EMR» cluster
• Wait until it becomes active and ready to use
Then;
• Stored tweet data can be processed as batch
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-emr
60
[BATCH PROCESSING]
CREATING TABLE TO STORE BATCH RESULT
• Create an «AWS DynamoDB» table to save batch results
• Wait until it becomes active and ready to use
Then;
• Analyze result data can be saved into this table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-
dynamodb
61
[BATCH PROCESSING]
PROCESSING TWEETS VIA HADOOP
• Build&upload «bigdata-demo-batchprocessing-hadoop» to «AWS S3»
• Run uploaded jar as «Custom JAR» step on «AWS EMR» cluster
• Wait until step has completed
Then;
• Processed tweet data is on «AWS S3» bucket
• Analyze result is in «AWS DynamoDB» table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-hadoop
62
[BATCH PROCESSING]
QUERYING TWEETS VIA HIVE
• Upload Hive query file to «AWS S3»
• Run uploaded query as «Hive» step on «AWS EMR» cluster
• Wait until step has completed
Then;
• Query results are dumped to «AWS S3»
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-hive
63
[BATCH PROCESSING]
SCHEDULING WITH DATA PIPILINE
• Import «AWS Data Pipeline» definition
• Activate pipeline
Then;
• Every day, all batch processing jobs (Hadoop +Hive) will be
executed automatically
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-
datapipeline
64
STREAM PROCESSING
65
[STREAM PROCESSING]
CREATING REALTIME STREAM
• Create an «AWS Kinesis» stream to push tweets to be analyzed
• Wait until it becomes active and ready to use
Then;
• Listened tweet data can be pushed into this stream to be analyzed
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis
66
[STREAM PROCESSING]
PRODUCING STREAM DATA
• Configure stream name to be used by producer application
• Build «bigdata-demo-streamprocessing-kinesis-producer»
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application listens tweets via «Twitter Streaming API»
• Listened tweets are pushed into «AWS Kinesis» stream
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis-producer
67
[STREAM PROCESSING]
CREATING TABLE TO SAVE STREAM RESULT
• Create an «AWS DynamoDB» table to save realtime results
• Wait until it becomes active and ready to use
Then;
• Analyze result can be saved into this table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
dynamodb
68
[STREAM PROCESSING]
CONSUMING STREAM DATA
• Configure
• «AWS Kinesis» stream name for consuming tweets from
• «AWS DynamoDB» table name for saving results into
• Build «bigdata-demo-streamprocessing-consumer» application
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application consumes tweets from «AWS Kinesis» stream
• Sentiment analyze is applied to tweets via «Stanford NLP»
• Analyze results are saved into «AWS DynamoDB» table in realtime
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis-consumer 69
ANALYZING
70
[ANALYZING]
ANALYTICS VIA QUICKSIGHT
• Create a data source from «AWS S3»
• Create an analyse based on data source from «AWS S3»
• Wait until importing data and analyze has completed
Then;
• You can do your
 Query
 Filter
 Visualization
 Story (flow of visualisations like presentation)
stuff on your dataset and share them
71
THE BIG PICTURE
72
73
END OF DEMO
74
THANKS!!!
75

Contenu connexe

Tendances

5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
SQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesSQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesOsama Jomaa
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptxchennakesava44
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 

Tendances (20)

5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
SQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesSQL vs. NoSQL Databases
SQL vs. NoSQL Databases
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 

Similaire à Big data on aws

AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
Data Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRData Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRAmazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAmazon Web Services
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Amazon Web Services
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Real-time Analytics with Open-Source
Real-time Analytics with Open-SourceReal-time Analytics with Open-Source
Real-time Analytics with Open-SourceAmazon Web Services
 

Similaire à Big data on aws (20)

AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Data Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRData Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMR
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
AWS 101.pptx
AWS 101.pptxAWS 101.pptx
AWS 101.pptx
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Cloud Service.pptx
Cloud Service.pptxCloud Service.pptx
Cloud Service.pptx
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
Real-time Analytics with Open-Source
Real-time Analytics with Open-SourceReal-time Analytics with Open-Source
Real-time Analytics with Open-Source
 

Plus de Serkan Özal

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaSerkan Özal
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the HoodSerkan Özal
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data PresentationSerkan Özal
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceSerkan Özal
 

Plus de Serkan Özal (7)

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS Lambda
 
MySafe
MySafeMySafe
MySafe
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the Hood
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data Presentation
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map Reduce
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 

Dernier

Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 

Dernier (20)

Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 

Big data on aws

  • 1. Ankara Cloud Meetup – July 2016 Serkan ÖZAL BIG DATA ON AWS
  • 2. AGENDA • What is Big Data? • Big Data Concepts • Big Data on AWS • Data Storage on AWS • Data Analytics & Querying on AWS • Data Processing on AWS • Data Flow on AWS • Demo 2
  • 3. 3
  • 4. WHAT IS BIG DATA??? 4
  • 5. 5
  • 6. 4V OF BIG DATA 6
  • 8. BIG DATA CONCEPTS • Scalability • Cloud Computing • Data Storage • Data Analytics & Querying • Data Processing • Data Flow 8
  • 9. BIG DATA ON AWS 9
  • 10. AWS BIG DATA PORTOFILO 10
  • 11. SCALABILITY ON AWS • Types of scalability: • Vertical Scalability: More/Less CPU, RAM, ... • Horizontal Scalability: More/Less machine • Auto scalable by • CPU • Memory • Network • Disk • … 11
  • 12. CLOUD COMPUTING ON AWS • Rich service catalog • «Pay as you go» payment model • Cloud Computing models:  IaaS (Infrastructure as a Service) o EC2, S3  PaaS (Platform as a Service) o Elastic Beanstalk  SaaS (Software as a Service) o DynamoDB, ElasticSearch 12
  • 13. DATA STORAGE ON AWS 13
  • 14. DATA STORAGE ON AWS • There are many different data storage services for different purposes • Key features for storage: • Data storage services on AWS:  S3  Glacier 14  Durability  Security  Cost  Scalability  Access Performance  Availability  Reliability
  • 15. S3 • Secure, durable and highly-available storage service • Bucket: Containers for objects  Globally unique regardless of the AWS region • Object: Stored entries  Uniquely within a bucket by a name and a version ID • Supports versioning • WORM (Write once, read many) • Consistency models  «read-after-write» consistency for puts of new objects  «eventual consistency» for puts/deletes of existing objects  Updates to a single object are atomic. 15
  • 16. GLACIER • Storage service optimized for infrequently used data • Archives are stored in the containers name «vault»s • Uploading is synchronous operation with replicas • Downloading is asynchronous operation through retrieval job • Immutable • Encrypted by default • «S3» data can be archived into «Glacier» by life-cycle rules • 4x cheaper than «S3» 16
  • 18. DATA ANALYTICS & QUERYING ON AWS • There are different data analytics&querying services for different purposes.  NoSQL  SQL  Text search  Analytics / Visualising • Data analytics&querying services on AWS: 18  DynamoDB  Redshift  RDS  Elasticsearch (+ Kibana)  CloudSearch  QuickSight
  • 19. DYNAMODB • Key-value or document storage based NoSQL database • Consistency model • Eventual consistency by default • Strong consistency as optional • Supports • «DynamoDB Streaming» for tracking changes in near real time • Integrated with 19  EMR  Elasticsearch  Redshift  Lambda  Secondary indexes  Batch operations  Conditional writes  Atomic counters
  • 20. REDSHIFT • SQL data warehouse solution • Fault tolerant by  Replicating to other nodes in the cluster  Backing up to «S3» continuously and automatically • Scalable  Vertically by allowing switching to new cluster  Horizontally by adding new nodes to cluster • Integrated with 20  S3  EMR  DynamoDB  Kinesis  Data Pipeline
  • 21. RDS • Makes it easier to set up, operate and scale a relational database in the cloud • Manages backups, software patching, automatic failure detection and recovery • Supports  MySQL  PostgreSQL  Oracle  Microsoft SQL Server  Aurora (Amazon’s MySQL based RDMS by 5x performance) 21
  • 22. ELASTICSEARCH • Makes it easy to deploy, operate and scale «Elasticsearch» in the AWS cloud • Supports  Taking snapshots for backup  Restoring from backups  Replicating domains across Availability Zones • Integrated with  «Kibana» for data visualization  «Logstash» to process logs and load them into «ES»  «S3», «Kinesis» and «DynamoDB» for loading data into «ES» 22
  • 23. CLOUDSEARCH • AWS’s managed and high-performance search solution • Reliable by replicating search domain into multiple AZ • Automatic monitoring and recovery • Auto scalable • Supports popular search features such as On 34 languages 23  Free-text search  Faceted search  Highlighting  Autocomplete suggestion  Geospatial search  …
  • 24. QUICKSIGHT • AWS’s new very fast, cloud-powered business intelligence (BI) service • Uses a new, Super-fast, Parallel, In-memory Calculation Engine («SPICE») to perform advanced calculations and render visualizations rapidly • Supports creating analyze stories and sharing them • Integrated/Supported data sources: • Said that 1/10th the cost of traditional BI solutions • Native Access on available major mobile platforms 24  EMR  Kinesis  S3  DynamoDB  Redshift  RDS  3rd party (i.e. Salesforce)  Local file upload  …
  • 26. DATA PROCESSING ON AWS • Allows processing of huge amount of data in highly scalable way as distributed • Support both of  Batch  Stream data processing concepts • Data processing services on AWS:  EMR (Elastic Map-Reduce)  Kinesis  Lambda  Machine Learning 26
  • 27. EMR (ELASTIC MAPREDUCE) • Big data service/infrastructure to process large amounts of data efficiently in highly-scalable manner • Supported Hadoop distributions:  Amazon  MapR • Low cost by «on-demand», «spot» or «reserved» instances • Reliable by retrying failed tasks and automatically replacing poorly performing or failed instances • Elastic by resizing cluster on the fly • Integrated with 27  S3  DynamoDB  Data Pipeline  Redshift  RDS  Glacier  VPC  …
  • 28. EMR – CLUSTER COMPONENTS • Master Node  Tracks, monitors and manages the cluster  Runs the «YARN Resource Manager» and the «HDFS Name Node Service» • Core Node [Slave Node]  Runs tasks and stores data in the HDFS  Runs the «YARN Node Manager Service» and «HDFS Data Node Daemon» service • Task Node [Slave Node] • Only runs the tasks • Runs the «YARN Node Manager Service» 28
  • 29. EMR – BEST PRACTICES FOR INSTANCE GROUPS 29 Project Master Instance Group Core Instance Group Task Instance Group Long- running clusters On-Demand On-Demand Spot Cost-driven workloads Spot Spot Spot Data-critical workloads On-Demand On-Demand Spot Application testing Spot Spot Spot
  • 30. EMR – CLUSTER LIFECYCLE 30
  • 31. EMR - STORAGE • The storage layer includes the different file systems for different purposes as follows:  HDFS (Hadoop Distributed File System)  EMRFS (Elastic MapReduce File System)  Local Storage 31
  • 32. EMR - HDFS • Distributed, scalable file system for Hadoop • Stores multiple copies to prevent data lost on failure • It is an ephemeral storage because it is reclaimed when cluster terminates • Used by the master and core nodes • Useful for intermediate results during Map-Reduce processing or for workloads which have significant random I/O • Prefix: «hdfs://» 32
  • 33. EMR - EMRFS • An implementation of HDFS which allows clusters to store data on «AWS S3» • Eventually consistent by default on top of «AWS S3» • Consistent view can be enabled with the following supports:  Read-after-write consistency  Delete consistency  List consistency • Client/Server side encryption • Most often, «EMRFS» (S3) is used to store input and output data and intermediate results are stored in «HDFS» • Prefix: «s3://» 33
  • 34. EMR – LOCAL STORAGE • Refers to an ephemeral volume directly attached to an EC2 instance • Ideal for storing temporary data that is continually changing, such as  Buffers  Caches  Scratch data  And other temporary content • Prefix: «file://» 34
  • 35. EMR – DATA PROCESSING FRAMEWORKS 35 • Hadoop • Spark • Hive • Pig • HBase • HCatalog • Zookeeper • Presto • Impala • Sqoop • Oozie • Mahout • Phoenix • Tez • Ganglia • Hue • Zeppelin • …
  • 36. KINESIS • Collects and processes large streams of data records in real time • Reliable by synchronously replicating streaming data across facilities in an AWS Region • Elastic by increasing shards based on the volume of input data • Has  S3  DynamoDB  ElasticSearch connectors • Also integrated with «Storm» and «Spark» 36
  • 37. KINESIS – HIGH LEVEL ARCHITECTURE 37
  • 38. KINESIS – KEY CONCEPTS[1] • Streams  Ordered sequence of data records  All data is stored for 24 hours (can be increased to 7 days) • Shards  It is a uniquely identified group of data records in a stream  Kinesis streams are scaled by adding or removing shards  Provides 1MB/sec data input and 2MB/sec data output capacity  Supports up to 1000 TPS (put records per second) • Producers  Put records into streams • Consumers  Get records from streams and process them 38
  • 39. KINESIS – KEY CONCEPTS[2] • Partition Key  Used to group data by shard within a stream  Used to determine which shard a given data record belongs to • Sequence Number  Each record has a unique sequence number in the owned shard  Increase over time  Specific to a shard within a stream, not across all shards • Record  Unit of data stored  Composed of <partition-key, sequence-number, data-blob>  Data size can be up to 1 MB  Immutable 39
  • 40. KINESIS – GETTING DATA IN • Producers use «PUT» call to store data in a stream • A partition key is used to distribute the «PUT»s across shards • A unique sequence # is returned to the producer for each event • Data can be ingested at 1MB/second or 1000 Transactions/second per shard 40
  • 41. KINESIS – GETTING DATA OUT • Kinesis Client Library (KCL) simplifies reading from the stream by abstracting you from individual shards • Automatically starts a worker thread for each shard • Increases and decreases thread count as number of shards changes • Uses checkpoints to keep track of a thread’s location in the stream • Restarts threads & workers if they fail 41
  • 42. LAMBDA • Compute service that run your uploaded code on your behalf using AWS infrastructure • Stateless • Auto-scalable • Pay per use • Use cases  As an event-driven compute service triggered by actions on «S3», «Kinesis», «DynamoDB», «SNS», «CloudWatch»  As a compute service to run your code in response to HTTP requests using «AWS API Gateway» • Supports «Java», «Python» and «Node.js» languages 42
  • 43. MACHINE LEARNING • Machine learning service for building ML models and generating predictions • Provides implementations of common ML data transformations • Supports industry-standard ML algorithms such as  binary classification  multi-class classification  regression • Supports batch and real-time predictions • Integrated with • S3 • RDS • Redshift 43
  • 44. DATA FLOW ON AWS 44
  • 45. DATA FLOW ON AWS • Allows transferring data between data storage/processing points • Data flow services on AWS:  Firehose  Data Pipeline  DMS (Database Migration Service)  Snowball 45
  • 46. FIREHOSE • Makes it easy to capture and load massive volumes of load streaming data into AWS • Supports endpoints  S3  Redshift  Elasticsearch • Supports buffering  By size: Ranges from 1 MB to 128 MB. Default 5 MB  By interval: Ranges from 60 seconds to 900 seconds. Default 5 minutes • Supports encryption and compression 46
  • 47. DATA PIPELINE • Automates the movement and transformation of data • Comprise of workflows of tasks • The flow can be scheduled • Has visual designer with drag-and-drop support • The definition can be exported/imported • Supports custom activities and preconditions • Retry flow in case of failure and notifies • Integrated with 47  EMR  S3  DynamoDB  Redshift  RDS
  • 48. DATABASE MIGRATION SERVICE • Helps to migrate databases into AWS easily and securely • Supports  Homogeneous migrations (i.e. Oracle <=> Oracle)  Heterogeneous migrations (i.e. Oracle <=> PostgreSQL) • Also used for: • Database consolidation • Continuous data replication • Keep your applications running during migration  Start a replication instance  Connect source and target databases  Let «DMS» load data and keep them in sync  Switch applications over to the target at your convenience 48
  • 49. SNOWBALL • Import/Export service that accelerates transferring large amounts of data into and out of AWS using physical storage appliances, bypassing the Internet • Data is transported securely as encrypted • Data is transferred via 10G connection • Has 50TB or 80TB storage options • Steps: 49  Job Create  Processing  In transit to you  Delivered to you  In transit to AWS  At AWS  Importing  Completed
  • 51. Source code and Instructions are available at github.com/serkan-ozal/ankaracloudmeetup-bigdata-demo 51
  • 52. STEPS OF DEMO • Storing • Searching • Batch Processing • Stream Processing • Analyzing 52
  • 54. [STORING] CREATING DELIVERY STREAM • Create an «AWS S3» bucket to store tweets • Create an «AWS Firehose» stream to push tweets into to be stored • Attach created bucket as endpoint to the stream Then; • Listened tweet data can be pushed into this stream to be stored • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-storing-firehose 54
  • 55. [STORING] CRAWLING TWEETS • Configure stream name to be used by crawler application • Build «bigdata-demo-storing-crawler» application • Deploy into «AWS Elasticbeanstalk» Then; • Application listens tweets via «Twitter Streaming API» • Application pushes tweets data into «AWS Firehose» stream • «AWS Firehose» stream stores tweets on «AWS S3» bucket • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-storing-crawler 55
  • 57. [SEARCHING] CREATING SEARCH DOMAIN • Create an «AWS Elasticsearch» domain • Wait until it becomes active and ready to use Then • Tweet data can also be indexed to be searched • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-searching- elasticsearch 57
  • 58. [SEARCHING] INDEXING INTO SEARCH DOMAIN • Create an «AWS Lambda» function with the given code • Configure created «AWS S3» bucket as trigger • Wait until it becomes active and ready to use Then; • Stored tweet data will also be indexed to be searched • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-searching-lambda 58
  • 60. [BATCH PROCESSING] CREATING EMR CLUSTER • Create an «AWS EMR» cluster • Wait until it becomes active and ready to use Then; • Stored tweet data can be processed as batch • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-emr 60
  • 61. [BATCH PROCESSING] CREATING TABLE TO STORE BATCH RESULT • Create an «AWS DynamoDB» table to save batch results • Wait until it becomes active and ready to use Then; • Analyze result data can be saved into this table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing- dynamodb 61
  • 62. [BATCH PROCESSING] PROCESSING TWEETS VIA HADOOP • Build&upload «bigdata-demo-batchprocessing-hadoop» to «AWS S3» • Run uploaded jar as «Custom JAR» step on «AWS EMR» cluster • Wait until step has completed Then; • Processed tweet data is on «AWS S3» bucket • Analyze result is in «AWS DynamoDB» table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-hadoop 62
  • 63. [BATCH PROCESSING] QUERYING TWEETS VIA HIVE • Upload Hive query file to «AWS S3» • Run uploaded query as «Hive» step on «AWS EMR» cluster • Wait until step has completed Then; • Query results are dumped to «AWS S3» • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-hive 63
  • 64. [BATCH PROCESSING] SCHEDULING WITH DATA PIPILINE • Import «AWS Data Pipeline» definition • Activate pipeline Then; • Every day, all batch processing jobs (Hadoop +Hive) will be executed automatically • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing- datapipeline 64
  • 66. [STREAM PROCESSING] CREATING REALTIME STREAM • Create an «AWS Kinesis» stream to push tweets to be analyzed • Wait until it becomes active and ready to use Then; • Listened tweet data can be pushed into this stream to be analyzed • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis 66
  • 67. [STREAM PROCESSING] PRODUCING STREAM DATA • Configure stream name to be used by producer application • Build «bigdata-demo-streamprocessing-kinesis-producer» • Deploy into «AWS Elasticbeanstalk» Then; • Application listens tweets via «Twitter Streaming API» • Listened tweets are pushed into «AWS Kinesis» stream • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis-producer 67
  • 68. [STREAM PROCESSING] CREATING TABLE TO SAVE STREAM RESULT • Create an «AWS DynamoDB» table to save realtime results • Wait until it becomes active and ready to use Then; • Analyze result can be saved into this table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- dynamodb 68
  • 69. [STREAM PROCESSING] CONSUMING STREAM DATA • Configure • «AWS Kinesis» stream name for consuming tweets from • «AWS DynamoDB» table name for saving results into • Build «bigdata-demo-streamprocessing-consumer» application • Deploy into «AWS Elasticbeanstalk» Then; • Application consumes tweets from «AWS Kinesis» stream • Sentiment analyze is applied to tweets via «Stanford NLP» • Analyze results are saved into «AWS DynamoDB» table in realtime • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis-consumer 69
  • 71. [ANALYZING] ANALYTICS VIA QUICKSIGHT • Create a data source from «AWS S3» • Create an analyse based on data source from «AWS S3» • Wait until importing data and analyze has completed Then; • You can do your  Query  Filter  Visualization  Story (flow of visualisations like presentation) stuff on your dataset and share them 71
  • 73. 73