SlideShare une entreprise Scribd logo
1  sur  41
P U B L I C S E C T O R
S U M M I T
WASH INGTON DC
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Cyber Data Lake: How CIS Analyzes
Billions of Network Traffic Records
per Day
Brian Calkin
Chief Technology Officer
Center for Internet Security
3 0 2 6 3 9
Oliver Atoa
Senior Consultant
AWS/WWPS Proserve
Bob Strahan
Principal Consultant
AWS/WWPS Proserve
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Agenda
1. Center for Internet Security (CIS) Netflow Challenge (Brian)
2. Our Solution (Bob and Oliver)
3. Results (Brian)
4. Q&A
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Session Goals
1. Educate you with useful architecture concepts
2. Empower you to explore similar approaches for your own business
Familiarity with Data Lakes on AWS presumed
– this not a Data Lake pitch!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Center for Internet Security
Non-profit using the power of a global
community to develop best practices for
securing IT systems and data
Mission: Identify, develop, validate,
promote, and sustain best practice solutions
for cyber defense. Build and lead
communities to enable an environment of
trust in cyberspace.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
CIS Benchmarks and CIS Controls
CIS Benchmarks and CIS Controls are the global standard
and recognized best practices for securing IT systems and data
against the most pervasive attacks
CIS’ proven guidelines are continuously refined and verified by a
volunteer, global community of experienced IT professionals
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Multi-State & Elections Infrastructure
Information Analysis Center (MS-ISAC & EI-ISAC)
The MS-ISAC has been designated by DHS as the key
resource for cyber threat prevention, protection, response,
and recovery for the nation’s state, local, tribal, and
territorial governments
Through the EI-ISAC, election agencies gain access to an
elections-focused cyber defense suite, including sector-
specific threat intelligence products, incident response and
remediation, threat and vulnerability monitoring,
cybersecurity awareness and training products, and tools for
implementing security best practices.
~6,000 Member
Organizations
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
CIS Network Monitoring - Albert
• Network intrusion detection
sensor
• Fully monitored and managed
• State, local, tribal and territorial
government focused
• Open source software on
commodity hardware
• Alert data analyzed 24x7
• ~350 Sensors deployed
nationwide
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
• Generated on sensors
• Passive DNS data also
collected
• Data is valuable for performing
ad-hoc queries
Albert - NetFlow
• Source IP
• Destination IP
• Source Port
• Destination Port
• TCP Flags
• Number of bytes of traffic
sent/received
• Timestamp
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
• Based on Suricata
• Alerts generated based on
known signatures
• ~27,000 Signatures per
Sensor
• ~10,000 Albert events
analyzed per month
• ~5,000 Albert events escalated
to SLTT entities per month
Albert – Network Intrusion Detection
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The Challenge
- 48 Million Records Per Minute
- Rate of incoming traffic is not consistent and continually
increasing
- Several petabytes (and growing) of NetFlow data
- Local SAN Storage Full
- Ad-hoc queries take way too long to run
- Hours, Days, Weeks…
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The Desired Solution
- Store Six Months of NetFlow Data
- High Performance Ad-Hoc SQL Queries
- Goal: Get from Days and Weeks to Seconds and Minutes
- Cost Effective
- Highly Secure
- Extensible Features for Future Enhancements and
Growth
*Looked at Both On-Premise & Cloud Solutions
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Our mission
Ingest binary netflow and DPI records from hundreds of sensors
each generating millions of records every couple of minutes.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Our mission
.
Transform and enrich these records, and save them to cost
efficient storage.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Our mission
Provide analysts with fast seamless SQL query
access to all the records; newest (few minutes latency) to oldest
(months back).
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Our mission
Make it secure, cost effective, well instrumented,
reliable, scalable to handle future growth, and extensible as a
foundation for advanced automated analytics and machine
learning.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lake
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Ingest binary netflow and DPI records from hundreds of sensors each
generating millions of records every couple of minutes.
• Receiver service on AWS
• Amazon Elastic Compute Cloud (Amazon EC2)
Autoscaling, Multi-AZ, NLB with EIP
• SCP file transfer with client auth (keypairs)
• IP whitelisting
• Receivers convert each incoming sensor file
to CSV files, and uploads each file to
Amazon Simple Storage Service (Amazon S3)
(not yet enriched or query ready)
• Receiver cluster scales up and down as
incoming record volumes fluctuate.
Ingestion
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Enrichment and Amazon S3 data lake storage
Transform and enrich flow records and save them to cost efficient storage.
We use a AWS Lambda function
• Triggered immediately as each new CSV
file arrives in Amazon S3 from Receiver
• For each record:
• Detect corrupted records and fix or reject
them
• Enrich good records with additional useful
fields (e.g. IP ASN, directionality, etc.)
• Save records to Amazon S3 (stage0), with
prefixes that define Hive partitions:
• p_sensor=<sensorname>/p_year=YYYY/
p_date=YYYY-MM-DD/p_hour=YYYY-MM-
DD_hh/<file>.csv.gz
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Near real-time SQL access
Provide analysts with fast seamless SQL query access to all the records;
newest (few minutes latency) to oldest (months back).
Enrich AWS Lambda function also adds new
partitions to predefined AWS Glue catalog
tables
Partitions optimize query cost and speed; filters
on sensorname and/or flow timestamp use
‘partition pruning’
Records are accessible to analysts via Amazon
Athena SQL
Latency (sensor to SQL) < 5min
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Optimize for efficient long term queries
Provide analysts with fast seamless SQL query access to all the records;
newest (few minutes latency) to oldest (months back).
Small files (micro-batches) minimize the
latency for NRT queries (stage0).
But we need to optimize for large time
span queries (stage1)
Stage 1 table has same columns and
partitions as stage0, but:
• Columnar file format (parquet)
• Bucketed by srcIP (faster queries)
• Minimize files per hourly partition
Deep dive coming up!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Seamless SQL access
Provide analysts with fast seamless SQL query access to all the records
Recap
• Stage 0 optimized for NRT queries (Enrich Lambda)
– lots of small files per partition
• Stage 1 optimized for historical queries (ETL)
– fewer optimized files per partition
Views combine (UNION ALL) stage0 & stage1
tables to give the best of both worlds
• Low latency for recent data from stage0 (<6hrs)
• Efficiency (cost + speed) from stage1 (>6hrs)
• Views updated every hour (scheduled Lambda)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Well Architected!
Make it secure, cost effective, well instrumented, reliable,
scalable to handle future growth, and extendible as a foundation
for advanced automated analytics and machine learning.
Data Lake on Amazon S3 +
AWS Glue catalog: foundation for
future enhancements and
innovations (using the ‘right tool for
the job’ – Amazon SageMaker,
Amazon QuickSight, Amazon
RedShift, etc.)
Automated
Deployment:
AWS CodePipeline,
AWS CodeBuild, and
AWS CloudFormation
Storage Retention & Costs:
Amazon S3 Intelligent Tiering and
Lifecycle Management
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Stage 1 ETL – Deep Dive
• Columnar file format - Parquet
• Bucketed by source IP (fixed
number of larger files)
• CSV
• Variable number of
small files in NRT
Make queries faster and cheaper
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Stage 1 ETL – Deep Dive
VPC
AWS Cloud
Auto Scaling group
EC2 instance EC2 instance
Stage0 puts partition values in
ETL Amazon Simple Queue
Service
(Amazon SQS) queue
AutoScaling Group driven by
Amazon SQS queue depth
Each Amazon EC2 instance runs
a Dask script that processes one
partition at a time - avoid
network shuffling
Maintain Stage0 partition
structure
ETL SQS Queue
Partition Message
Read all CSV files in partition
Data Lake Bucket
Data catalog
stage0
stage1
Partition Message
Write bucketed Parquet
Get table info & update partition
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Stage 1 ETL – Deep Dive
Why not Amazon Kinesis or Firehose?
Initially iterated using established streaming patterns with
Kinesis and Firehose. As we optimized, we needed greater
control for this particular case:
• Avoid data shuffling. The file micro-batches being ingested
are inherently partitioned in the way we needed
• Firehose partitions files in Amazon S3 by ingestion time. To
optimize NRT queries we need to partition by fields in the
payload: e.g. sensorname and flow timestamp
• File format optimizations. E.g. Parquet row group size and
Hive Bucketing
Amazon Simple
Queue Service
Amazon EC2
Amazon
Kinesis
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Stage 1 ETL – Deep Dive
Why Dask and not Amazon EMR, Spark, or AWS Glue?
Initially iterated using AWS Glue and Amazon EMR. Awesome tools!
Landed on Dask for greater low-level control based on a combination
of optimizations and features for this particular case:
• Skewed partitions – larger sensors can cause stragglers
• Reduce network data shuffling
• Amazon S3 multi-part uploads. Avoid staging and visible when
successful
• Automatic retry of failed partitions
• Dynamic partition overwrite. Backfill or redo of partitions
• Partition management – make them available when needed
• Hive Bucketing
Amazon
EMR
AWS Glue
Amazon EC2
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Stage 1 ETL – Deep Dive
Bucketing
• Hive Partitions for small set of values
• Hive Bucketing for large unique values
(e.g., IP address billions)
• Bucketing provides significant
performance improvement for queries
using bucketed fields
• Hash field and apply mod function based
on number of buckets
• Fixed number of evenly distributed files
• We implemented it in Dask hash(bucket) %
number of buckets
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Stage 1 ETL – Deep Dive
• Partitioning targets a
specific folder
• Bucketing targets a
specific file
• Columnar format can
skip data within the file
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The results are in…
The answer
is…
You may be asking yourself, did it all work out??
Yes!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Some Examples
Count of all records generated by a single
sensor
• The “old way” – 15 minutes
• The AWS way – 2 minutes
7.5x
faster
Count of all records generated by all sensors
• Old way – 36 hours
• AWS – 3 minutes
720x faster
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Athena Query 1 – Flows Byte Aggregation 1HR
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
A couple more…
Query for all traffic, destined to a specific IP
address and port, over a one week time
period
• Old way – 48 hours
• AWS – 19 minutes
150x faster
Query for all traffic, destined to a set of IP
addresses over port 80, over a one week
time period
• Old way – 72 hours
• AWS – 12 minutes
360x faster
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Athena Query 2 – All Traffic to a Single IP & Port – 1 Week
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Next Steps
• Leverage Amazon Redshift Spectrum
• Building UI for broader usability
• Collect additional datatypes for improved alerting and correlation
• Leverage Artificial Intelligence/Machine Learning to identify malicious
activity
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Bob Strahan
strahanr@amazon.com
Oliver Atoa
oatoa@amazon.com
Brian Calkin
Brian.Calkin@cisecurity.org
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T

Contenu connexe

Tendances

Introduction to Intrusion detection and prevention system for network
Introduction to Intrusion detection and prevention system for networkIntroduction to Intrusion detection and prevention system for network
Introduction to Intrusion detection and prevention system for networkEng. Mohammed Ahmed Siddiqui
 
Threat Intelligence 101 - Steve Lodin - Submitted
Threat Intelligence 101 - Steve Lodin - SubmittedThreat Intelligence 101 - Steve Lodin - Submitted
Threat Intelligence 101 - Steve Lodin - SubmittedSteve Lodin
 
Soc and siem and threat hunting
Soc and siem and threat huntingSoc and siem and threat hunting
Soc and siem and threat huntingVikas Jain
 
Beginner's Guide to SIEM
Beginner's Guide to SIEM Beginner's Guide to SIEM
Beginner's Guide to SIEM AlienVault
 
Siem solutions R&E
Siem solutions R&ESiem solutions R&E
Siem solutions R&EOwais Ahmad
 
3 Software Stacks for IoT Solutions
3 Software Stacks for IoT Solutions3 Software Stacks for IoT Solutions
3 Software Stacks for IoT SolutionsIan Skerrett
 
SIEM : Security Information and Event Management
SIEM : Security Information and Event Management SIEM : Security Information and Event Management
SIEM : Security Information and Event Management SHRIYARAI4
 
Cisco Cybersecurity Essentials Chapter- 7
Cisco Cybersecurity Essentials Chapter- 7Cisco Cybersecurity Essentials Chapter- 7
Cisco Cybersecurity Essentials Chapter- 7Mukesh Chinta
 
Workshop Trend Micro
Workshop Trend MicroWorkshop Trend Micro
Workshop Trend MicroAymen Mami
 
Intrusion detection system ppt
Intrusion detection system pptIntrusion detection system ppt
Intrusion detection system pptSheetal Verma
 
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...carlitocabana
 
Cyber threat intelligence: maturity and metrics
Cyber threat intelligence: maturity and metricsCyber threat intelligence: maturity and metrics
Cyber threat intelligence: maturity and metricsMark Arena
 
DATA LOSS PREVENTION OVERVIEW
DATA LOSS PREVENTION OVERVIEWDATA LOSS PREVENTION OVERVIEW
DATA LOSS PREVENTION OVERVIEWSylvain Martinez
 
The Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudThe Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudMarco Rodrigues
 
Introduction to IDS & IPS - Part 1
Introduction to IDS & IPS - Part 1Introduction to IDS & IPS - Part 1
Introduction to IDS & IPS - Part 1whitehat 'People'
 

Tendances (20)

Introduction to Intrusion detection and prevention system for network
Introduction to Intrusion detection and prevention system for networkIntroduction to Intrusion detection and prevention system for network
Introduction to Intrusion detection and prevention system for network
 
Honey pots
Honey potsHoney pots
Honey pots
 
Threat Intelligence 101 - Steve Lodin - Submitted
Threat Intelligence 101 - Steve Lodin - SubmittedThreat Intelligence 101 - Steve Lodin - Submitted
Threat Intelligence 101 - Steve Lodin - Submitted
 
Soc and siem and threat hunting
Soc and siem and threat huntingSoc and siem and threat hunting
Soc and siem and threat hunting
 
Beginner's Guide to SIEM
Beginner's Guide to SIEM Beginner's Guide to SIEM
Beginner's Guide to SIEM
 
Siem solutions R&E
Siem solutions R&ESiem solutions R&E
Siem solutions R&E
 
3 Software Stacks for IoT Solutions
3 Software Stacks for IoT Solutions3 Software Stacks for IoT Solutions
3 Software Stacks for IoT Solutions
 
SIEM : Security Information and Event Management
SIEM : Security Information and Event Management SIEM : Security Information and Event Management
SIEM : Security Information and Event Management
 
Security Information Event Management - nullhyd
Security Information Event Management - nullhydSecurity Information Event Management - nullhyd
Security Information Event Management - nullhyd
 
Cisco Cybersecurity Essentials Chapter- 7
Cisco Cybersecurity Essentials Chapter- 7Cisco Cybersecurity Essentials Chapter- 7
Cisco Cybersecurity Essentials Chapter- 7
 
Workshop Trend Micro
Workshop Trend MicroWorkshop Trend Micro
Workshop Trend Micro
 
Intrusion detection system ppt
Intrusion detection system pptIntrusion detection system ppt
Intrusion detection system ppt
 
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
07 - Defend Against Threats with SIEM Plus XDR Workshop - Microsoft Sentinel ...
 
IPSec VPN tunnel
IPSec VPN tunnelIPSec VPN tunnel
IPSec VPN tunnel
 
Cyber threat intelligence: maturity and metrics
Cyber threat intelligence: maturity and metricsCyber threat intelligence: maturity and metrics
Cyber threat intelligence: maturity and metrics
 
Honey pot in cloud computing
Honey pot in cloud computingHoney pot in cloud computing
Honey pot in cloud computing
 
DATA LOSS PREVENTION OVERVIEW
DATA LOSS PREVENTION OVERVIEWDATA LOSS PREVENTION OVERVIEW
DATA LOSS PREVENTION OVERVIEW
 
The Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudThe Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco Cloud
 
Introduction to IDS & IPS - Part 1
Introduction to IDS & IPS - Part 1Introduction to IDS & IPS - Part 1
Introduction to IDS & IPS - Part 1
 
Active Directory Trusts
Active Directory TrustsActive Directory Trusts
Active Directory Trusts
 

Similaire à CIS Cyber Data Lake Analytics

Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...AWS Summits
 
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...Amazon Web Services
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time dataAmazon Web Services
 
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitScalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitAmazon Web Services
 
Architetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo realeArchitetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo realeAmazon Web Services
 
갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...
갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...
갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...Amazon Web Services Korea
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitAmazon Web Services
 
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Amazon Web Services
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeAmazon Web Services
 
From Strategy to Reality: Better Decisions With Data
From Strategy to Reality: Better Decisions With DataFrom Strategy to Reality: Better Decisions With Data
From Strategy to Reality: Better Decisions With DataAmazon Web Services
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWSAmazon Web Services
 
Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesAmazon Web Services
 
All Databases Are Equal, But Some Databases Are More Equal than Others: How t...
All Databases Are Equal, But Some Databases Are More Equal than Others: How t...All Databases Are Equal, But Some Databases Are More Equal than Others: How t...
All Databases Are Equal, But Some Databases Are More Equal than Others: How t...javier ramirez
 
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...Amazon Web Services
 
How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019Randall Hunt
 
Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)Amazon Web Services
 
AWS re:Invent Comes to London 2019 - Database, Analytics, AI &ML
AWS re:Invent Comes to London 2019 - Database, Analytics, AI &MLAWS re:Invent Comes to London 2019 - Database, Analytics, AI &ML
AWS re:Invent Comes to London 2019 - Database, Analytics, AI &MLAmazon Web Services
 

Similaire à CIS Cyber Data Lake Analytics (20)

Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
 
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time data
 
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitScalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
 
Architetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo realeArchitetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo reale
 
갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...
갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...
갤럭시 규모의 인공지능 서비스를 위한 AWS 데이터베이스 아키텍처 - 김상필 솔루션 아키텍트 매니저, AWS / 김정환 데브옵스 엔지니어,...
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
 
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_Singapore
 
From Strategy to Reality: Better Decisions With Data
From Strategy to Reality: Better Decisions With DataFrom Strategy to Reality: Better Decisions With Data
From Strategy to Reality: Better Decisions With Data
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWS
 
Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data Warehouses
 
All Databases Are Equal, But Some Databases Are More Equal than Others: How t...
All Databases Are Equal, But Some Databases Are More Equal than Others: How t...All Databases Are Equal, But Some Databases Are More Equal than Others: How t...
All Databases Are Equal, But Some Databases Are More Equal than Others: How t...
 
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019
 
Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)Choosing the Right Database (Database Freedom)
Choosing the Right Database (Database Freedom)
 
AWS re:Invent Comes to London 2019 - Database, Analytics, AI &ML
AWS re:Invent Comes to London 2019 - Database, Analytics, AI &MLAWS re:Invent Comes to London 2019 - Database, Analytics, AI &ML
AWS re:Invent Comes to London 2019 - Database, Analytics, AI &ML
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
AWS re:Invent recap
AWS re:Invent recapAWS re:Invent recap
AWS re:Invent recap
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

CIS Cyber Data Lake Analytics

  • 1. P U B L I C S E C T O R S U M M I T WASH INGTON DC
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Cyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per Day Brian Calkin Chief Technology Officer Center for Internet Security 3 0 2 6 3 9 Oliver Atoa Senior Consultant AWS/WWPS Proserve Bob Strahan Principal Consultant AWS/WWPS Proserve
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Agenda 1. Center for Internet Security (CIS) Netflow Challenge (Brian) 2. Our Solution (Bob and Oliver) 3. Results (Brian) 4. Q&A
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Session Goals 1. Educate you with useful architecture concepts 2. Empower you to explore similar approaches for your own business Familiarity with Data Lakes on AWS presumed – this not a Data Lake pitch!
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Center for Internet Security Non-profit using the power of a global community to develop best practices for securing IT systems and data Mission: Identify, develop, validate, promote, and sustain best practice solutions for cyber defense. Build and lead communities to enable an environment of trust in cyberspace.
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T CIS Benchmarks and CIS Controls CIS Benchmarks and CIS Controls are the global standard and recognized best practices for securing IT systems and data against the most pervasive attacks CIS’ proven guidelines are continuously refined and verified by a volunteer, global community of experienced IT professionals
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Multi-State & Elections Infrastructure Information Analysis Center (MS-ISAC & EI-ISAC) The MS-ISAC has been designated by DHS as the key resource for cyber threat prevention, protection, response, and recovery for the nation’s state, local, tribal, and territorial governments Through the EI-ISAC, election agencies gain access to an elections-focused cyber defense suite, including sector- specific threat intelligence products, incident response and remediation, threat and vulnerability monitoring, cybersecurity awareness and training products, and tools for implementing security best practices. ~6,000 Member Organizations
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T CIS Network Monitoring - Albert • Network intrusion detection sensor • Fully monitored and managed • State, local, tribal and territorial government focused • Open source software on commodity hardware • Alert data analyzed 24x7 • ~350 Sensors deployed nationwide
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T • Generated on sensors • Passive DNS data also collected • Data is valuable for performing ad-hoc queries Albert - NetFlow • Source IP • Destination IP • Source Port • Destination Port • TCP Flags • Number of bytes of traffic sent/received • Timestamp
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T • Based on Suricata • Alerts generated based on known signatures • ~27,000 Signatures per Sensor • ~10,000 Albert events analyzed per month • ~5,000 Albert events escalated to SLTT entities per month Albert – Network Intrusion Detection
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T The Challenge - 48 Million Records Per Minute - Rate of incoming traffic is not consistent and continually increasing - Several petabytes (and growing) of NetFlow data - Local SAN Storage Full - Ad-hoc queries take way too long to run - Hours, Days, Weeks…
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T The Desired Solution - Store Six Months of NetFlow Data - High Performance Ad-Hoc SQL Queries - Goal: Get from Days and Weeks to Seconds and Minutes - Cost Effective - Highly Secure - Extensible Features for Future Enhancements and Growth *Looked at Both On-Premise & Cloud Solutions
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Our mission Ingest binary netflow and DPI records from hundreds of sensors each generating millions of records every couple of minutes.
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Our mission . Transform and enrich these records, and save them to cost efficient storage.
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Our mission Provide analysts with fast seamless SQL query access to all the records; newest (few minutes latency) to oldest (months back).
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Our mission Make it secure, cost effective, well instrumented, reliable, scalable to handle future growth, and extensible as a foundation for advanced automated analytics and machine learning.
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Data Lake
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Ingest binary netflow and DPI records from hundreds of sensors each generating millions of records every couple of minutes. • Receiver service on AWS • Amazon Elastic Compute Cloud (Amazon EC2) Autoscaling, Multi-AZ, NLB with EIP • SCP file transfer with client auth (keypairs) • IP whitelisting • Receivers convert each incoming sensor file to CSV files, and uploads each file to Amazon Simple Storage Service (Amazon S3) (not yet enriched or query ready) • Receiver cluster scales up and down as incoming record volumes fluctuate. Ingestion
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Enrichment and Amazon S3 data lake storage Transform and enrich flow records and save them to cost efficient storage. We use a AWS Lambda function • Triggered immediately as each new CSV file arrives in Amazon S3 from Receiver • For each record: • Detect corrupted records and fix or reject them • Enrich good records with additional useful fields (e.g. IP ASN, directionality, etc.) • Save records to Amazon S3 (stage0), with prefixes that define Hive partitions: • p_sensor=<sensorname>/p_year=YYYY/ p_date=YYYY-MM-DD/p_hour=YYYY-MM- DD_hh/<file>.csv.gz
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Near real-time SQL access Provide analysts with fast seamless SQL query access to all the records; newest (few minutes latency) to oldest (months back). Enrich AWS Lambda function also adds new partitions to predefined AWS Glue catalog tables Partitions optimize query cost and speed; filters on sensorname and/or flow timestamp use ‘partition pruning’ Records are accessible to analysts via Amazon Athena SQL Latency (sensor to SQL) < 5min
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Optimize for efficient long term queries Provide analysts with fast seamless SQL query access to all the records; newest (few minutes latency) to oldest (months back). Small files (micro-batches) minimize the latency for NRT queries (stage0). But we need to optimize for large time span queries (stage1) Stage 1 table has same columns and partitions as stage0, but: • Columnar file format (parquet) • Bucketed by srcIP (faster queries) • Minimize files per hourly partition Deep dive coming up!
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Seamless SQL access Provide analysts with fast seamless SQL query access to all the records Recap • Stage 0 optimized for NRT queries (Enrich Lambda) – lots of small files per partition • Stage 1 optimized for historical queries (ETL) – fewer optimized files per partition Views combine (UNION ALL) stage0 & stage1 tables to give the best of both worlds • Low latency for recent data from stage0 (<6hrs) • Efficiency (cost + speed) from stage1 (>6hrs) • Views updated every hour (scheduled Lambda)
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Well Architected! Make it secure, cost effective, well instrumented, reliable, scalable to handle future growth, and extendible as a foundation for advanced automated analytics and machine learning. Data Lake on Amazon S3 + AWS Glue catalog: foundation for future enhancements and innovations (using the ‘right tool for the job’ – Amazon SageMaker, Amazon QuickSight, Amazon RedShift, etc.) Automated Deployment: AWS CodePipeline, AWS CodeBuild, and AWS CloudFormation Storage Retention & Costs: Amazon S3 Intelligent Tiering and Lifecycle Management
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Stage 1 ETL – Deep Dive • Columnar file format - Parquet • Bucketed by source IP (fixed number of larger files) • CSV • Variable number of small files in NRT Make queries faster and cheaper
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Stage 1 ETL – Deep Dive VPC AWS Cloud Auto Scaling group EC2 instance EC2 instance Stage0 puts partition values in ETL Amazon Simple Queue Service (Amazon SQS) queue AutoScaling Group driven by Amazon SQS queue depth Each Amazon EC2 instance runs a Dask script that processes one partition at a time - avoid network shuffling Maintain Stage0 partition structure ETL SQS Queue Partition Message Read all CSV files in partition Data Lake Bucket Data catalog stage0 stage1 Partition Message Write bucketed Parquet Get table info & update partition
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Stage 1 ETL – Deep Dive Why not Amazon Kinesis or Firehose? Initially iterated using established streaming patterns with Kinesis and Firehose. As we optimized, we needed greater control for this particular case: • Avoid data shuffling. The file micro-batches being ingested are inherently partitioned in the way we needed • Firehose partitions files in Amazon S3 by ingestion time. To optimize NRT queries we need to partition by fields in the payload: e.g. sensorname and flow timestamp • File format optimizations. E.g. Parquet row group size and Hive Bucketing Amazon Simple Queue Service Amazon EC2 Amazon Kinesis Amazon Kinesis Data Streams Amazon Kinesis Data Firehose
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Stage 1 ETL – Deep Dive Why Dask and not Amazon EMR, Spark, or AWS Glue? Initially iterated using AWS Glue and Amazon EMR. Awesome tools! Landed on Dask for greater low-level control based on a combination of optimizations and features for this particular case: • Skewed partitions – larger sensors can cause stragglers • Reduce network data shuffling • Amazon S3 multi-part uploads. Avoid staging and visible when successful • Automatic retry of failed partitions • Dynamic partition overwrite. Backfill or redo of partitions • Partition management – make them available when needed • Hive Bucketing Amazon EMR AWS Glue Amazon EC2
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Stage 1 ETL – Deep Dive Bucketing • Hive Partitions for small set of values • Hive Bucketing for large unique values (e.g., IP address billions) • Bucketing provides significant performance improvement for queries using bucketed fields • Hash field and apply mod function based on number of buckets • Fixed number of evenly distributed files • We implemented it in Dask hash(bucket) % number of buckets
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Stage 1 ETL – Deep Dive • Partitioning targets a specific folder • Bucketing targets a specific file • Columnar format can skip data within the file
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T The results are in… The answer is… You may be asking yourself, did it all work out?? Yes!
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Some Examples Count of all records generated by a single sensor • The “old way” – 15 minutes • The AWS way – 2 minutes 7.5x faster Count of all records generated by all sensors • Old way – 36 hours • AWS – 3 minutes 720x faster
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Athena Query 1 – Flows Byte Aggregation 1HR
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T A couple more… Query for all traffic, destined to a specific IP address and port, over a one week time period • Old way – 48 hours • AWS – 19 minutes 150x faster Query for all traffic, destined to a set of IP addresses over port 80, over a one week time period • Old way – 72 hours • AWS – 12 minutes 360x faster
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Athena Query 2 – All Traffic to a Single IP & Port – 1 Week
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Next Steps • Leverage Amazon Redshift Spectrum • Building UI for broader usability • Collect additional datatypes for improved alerting and correlation • Leverage Artificial Intelligence/Machine Learning to identify malicious activity
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Bob Strahan strahanr@amazon.com Oliver Atoa oatoa@amazon.com Brian Calkin Brian.Calkin@cisecurity.org
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T