2. Strata New York 2018
EDW core data
1PB
Incremental data
4TB/DAY
On-line data storage
5PB
>600M customers
>2,000M accounts
Big data – How big is big?
CCB - 2nd biggest bank in China.
About China Construction Bank (CCB)
3. Strata New York 2018
Tactical
Decision
Makers
General Business
Users
Strategic Decision
Makers
Operational Decision
Makers
Headquarters
Source Systems
ALS
CLPM
CCMI
S
EDW
Teradata 5450
(6 nodes), 18T
ERPF
CCBS
SMIS
Material DSS Database
OCR
M
…
Cube
CMIS
CMIS
CCD
A
…1104
Operational
Data Storage
ODS
Historic
al data
Branches
Source Systems
100+
reports
100+ Users
1st Generation EDW (2004)
4. Strata New York 2018
Dining Room
Readily Accessible to End Users
(and BI Developers)
Safe, Hospitable Environment
Data Assets “Ready for Primetime”
Dimensionally Structured
Kitchen
Off Limits to End Users
Data Professionals Only Please
Dangerous / Inhospitable Environment
”Data Assets “Not Ready for Primetime”
Structured Variably For Data Processing
Dimensional Semantic Layer
Dimensional Tier
[Physical or Virtual (CIF or Data Vault)]
(Virtual or Physical)
Un/Semi-Structured Data Movement
Un/Semi-Structured Source Data
Persistent
Un/Semi-
Structured
Staging Area
Unstructured ->
Structured Data
Discovery
Processing
Structured Data Movement
Structured Source Data
Persistent Structured Data
Repository
Insight
Generation /
Data Mining
Big Data Blueprint (2012)
5. Strata New York 2018
Tactical
Decision
Makers
General Business
Users
Strategic Decision
Makers
Operational Decision
Makers
Presentation Layer
Headquarters
Source System
ALS
CLPM
CCMIS
EDW
Teradata 6650
(10+10 nodes), 600T
`Big Data Analytics Platform
Hadoop
Legacy Data Marts
OCRM…
1000+
Cube
CMIS
Historical
Data
SOR
MPP DB
ERPF
CCBS CCDA…1104
Operational
Data Storage
Branch
ODSB
Performance
Marketing
EDW
Teradata 2750
(32 nodes), 750T
Branches
Source Systems
25,000+
reports
2,000+
Data
Mining
Theme
s
SQL Translation between different databases is a big lesson.
100,000+ Users
User Experience Challenges:
Data latency
High-performance
EDW Challenges:
System I/O
Maintanence and data lineage
Big Data Transformation (2016)
6. Strata New York 2018
Tactical
Decision
Makers
General Business
Users
Strategic Decision
Makers
Operational Decision
Makers
Presentation Layer
Headquarters
Source System
ALS
CLPM
CCMIS
EDW
Teradata 6650
(10+10 nodes), 600T
`Big Data Analytics Platform
Hadoop
Legacy Data Marts
OCRM…
1000+
Cube
CMIS
Historical
Data
SOR
MPP DB
ERPF
CCBS CCDA…1104
Operational
Data Storage
Branch
ODSB
Performance
Marketing
EDW
Teradata 2750
(32 nodes), 750T
Branches
Source Systems
25,000+
reports
2,000+
Data
Mining
Theme
s
SQL Translation between different databases is a big lesson.
100,000+ Users
User Experience Challenges:
Data latency
High-performance
EDW Challenges:
System I/O
Maintanence and data lineage
Big Data Transformation (2016)
7. Strata New York 2018
Tactical
Decision
Makers
General Business
Users
Strategic Decision
Makers
Operational Decision
Makers
Presentation Layer
Headquarters
Source System
ALS
CLPM
CCMIS
EDW
Teradata 6650
(10+10 nodes), 600T
`Big Data Analytics Platform
Hadoop
Legacy Data Marts
OCRM…
1000+
Cube
CMIS
Historical
Data
SOR
MPP DB
ERPF
CCBS CCDA…1104
Operational
Data Storage
Branch
ODSB
Performance
Marketing
EDW
Teradata 2750
(32 nodes), 750T
Branches
Source Systems
25,000+
reports
2,000+
Data
Mining
Theme
s
SQL Translation between different databases is a big lesson.
100,000+ Users
User Experience Challenges:
Data latency
High-performance
EDW Challenges:
System I/O
Maintanence and data lineage
Big Data Transformation (2016)
8. Strata New York 2018
Tactical
Decision
Makers
General Business
Users
Strategic Decision
Makers
Operational Decision
Makers
Presentation Layer
Headquarters
Source System
ALS
CLPM
CCMIS
EDW
Teradata 6650
(10+10 nodes), 600T
`Big Data Analytics Platform
Hadoop
Legacy Data Marts
OCRM…
1000+
Cube
CMIS
Historical
Data
SOR
MPP DB
ERPF
CCBS CCDA…1104
Operational
Data Storage
Branch
ODSB
Performance
Marketing
EDW
Teradata 2750
(32 nodes), 750T
Branches
Source Systems
25,000+
reports
2,000+
Data
Mining
Theme
s
SQL Translation between different databases is a big lesson.
100,000+ Users
User Experience Challenges:
Data latency
High-performance
EDW Challenges:
System I/O
Maintanence and data lineage
Big Data Transformation (2016)
9. Strata New York 2018
100,000+ users1,200+ million records
PB-level data storageMillisecond-level responding
Metrics can be published by sub-organizations,
and be subscribed by end-user touching
Intelligent Eyes(1st version, Sept 2016)
Mobile product brought an opportunity
10. Strata New York 2018
Benefits
TCO
Teradata no longer
increased
Cost of unit storage ↓ 66%
Delivery cycle time ↓ from
6 months to 1 months
1 Performance
Mobile users ↑ from 0 to
100,000+;
Active PC users ↓ 90%;
Page view (PV) up to
1,000,000 daily
Real-time applications
emerged
Data latency ↓
from 48 hours to 7
hours
Millisecond-level
responding.
2 User
Experience
Access data anywhere and
anytime
25000+ reports ↓
to 5000 and 800
mobile data
metrics
Eliminating vertical
shaft data
problems
3
11. Strata New York 2018
How to re-engineering legacy EDW to Data Lake
• Discover users’ values by collecting their usage
records.
• Enable end users to join the data game.
• Build data conformance bus on Hive.
• Rebuild Analytics layer by Apache Kylin.
• Testing driven development.
12. Strata New York 2018
AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
L2 Cache Oracle Database
DATA MARTS
TEST/
DEV
ANALYTICAL
ARCHIVE
CAPTURE | STORE | REFINE
MDX RESTFUL
SERVICE
DATA LAB
INDEPENDENT
DATA MART
DUAL
SYSTEMS
TD 66XX
TD 2700
L2 Cache HBase
GP
L1 Cache Redis
ETL
EDW has evolved to Data Ecosystem
13. Strata New York 2018
AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
L2 Cache Oracle Database
DATA MARTS
TEST/
DEV
ANALYTICAL
ARCHIVE
CAPTURE | STORE | REFINE
MDX RESTFUL
SERVICE
DATA LAB
INDEPENDENT
DATA MART
DUAL
SYSTEMS
TD 66XX
TD 2700
L2 Cache HBase
GP
L1 Cache Redis
ETL
EDW has evolved to Data Ecosystem
14. Strata New York 2018
AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
L2 Cache Oracle Database
DATA MARTS
TEST/
DEV
ANALYTICAL
ARCHIVE
CAPTURE | STORE | REFINE
MDX RESTFUL
SERVICE
DATA LAB
INDEPENDENT
DATA MART
DUAL
SYSTEMS
TD 66XX
TD 2700
L2 Cache HBase
GP
L1 Cache Redis
ETL
EDW has evolved to Data Ecosystem
16. Strata New York 2018
About Apache Kylin
• Leading Open Source OLAP for Big Data
• Open source by eBay in 2014
• Graduated to Apache Top Project in 2015
• 1000+ Adoptions world wild
• 2015 InfoWorld Bossie Awards
• 2016 InfoWorld Bossie Awards
17. Strata New York 2018
Presentation
Visualization
Data
Lake
Data
Source
o Too many options
o Low performance
o Long learning curve
o Compatibility issue
o Technology vs Data
OLAP: The Missing Part of Big Data
Hive Impala Spark
SQL
Drill
MapReduce …Spark
18. Strata New York 2018
Presentation
Visualization
Data
Lake
Data
Source
o SQL Acceleration for Big Data
o Semantic Layer
o Speed up Analytics
o ANSI SQL Interface
o High Performance and High
Concurrency
Apache Kylin: Bring OLAP back to Big Data
OLAP
Data Mart
Hive Impala Spark SQL Drill
MapReduce …Spark
21. Strata New York 2018
Featured Customers
Trusted by Fortune 500
Lenovo
#226 of Fortune 500
OPPO
#4 Smart Phone Vendor
Global
Lufax
#1 Fintech in China
CPIC
#252 of Fortune 500
SAIC
#41 of Fortune 500
#47 of Fortune 500
Huawei
#83 of Fortune 500
Huatai Securities
Top Securities in China
Top 3 Telecom in China
McDonald’s
#436 Fortune 500
China UnionPay
#3 Payment Network
Data from Fortune Global 500 year 2017:
http://fortune.com/global500/list/
#33 of Fortune 500
22. Strata New York 2018
Partners
Global Ecosystem
Microsoft Azure Partner
Amazon Web Service Technology Partner
Tableau Technology Partner
Cloudera Sliver Partner
MapR Converge Partner
Hortonworks Community Partner
Huawei Solution Partner
23. Evolution of Data Warehousing
Data Mart
Orders
Payments
Contacts
Products
Customers
Data Warehouse
Contacts
Orders
Payments
Products
Data
Warehouse
Data Lake
Contacts
Orders
Payments
Products
Data
Warehouse
Contacts
Orders
Payments
Products
Next GenerationCloud
Contacts
Orders
Payments
Products
Data
Warehouse
Products
Contacts
Orders
Payments ?
26. Historical Real time
Fusion of Historical &
Real-time Data
Fusion of
Local and Cloud
On-premises Cloud
EDW Data Lake
Fusion of
Traditional DW & Big Data
Fusional DW Architecture
Kyligence Enterprise
Product Screenshot
29. Strata New York 2018
Kyligence Position in Big Data Ecosystem
Fill the gap between business and technology
Kyligence Enterprise
powered by Apache Kylin
BI
Visualization
OLAP
Data Mart
Data Lake
Source
Data
HDFS YARN MapReduce Spark Kafka …Spark SQL
• Fusional
• Unified EDW & Data Lake
• Unified Realtime and Historical
• Unified On-Prem and Cloud
• Intelligent
• Machine Learning-augmented
modeling
• High Performance
• Sub-seconds query speed on
massive dataset
• High Concurrency
• Web-scale OLAP query
30. Evolution of Data Warehousing
Data Mart
Orders
Payments
Contacts
Products
Customers
Data Warehouse
Contacts
Orders
Payments
Products
Data
Warehouse
Data Lake
Contacts
Orders
Payments
Products
Data
Warehouse
Contacts
Orders
Payments
Products
Fusional &
Intelligent DW
Cloud
Contacts
Orders
Payments
Products
Data
Warehouse
Products
Contacts
Orders
Payments
31. Strata New York 2018
Kyligence Cloud
Transforming Big Data Analytics to Cloud
Kyligence Cloud
ANSI SQL
Dashboard OLAP
Hadoop
Customer Cloud Account
client
cloud
Kyligence Enterprise Platform
streaming
Cluster Deploy
Account Management
Diagnosis &
Optimization
Queries & Reporting
cloud
storage
tables, logs, files
RDBMS
(metadata)
ANSI SQL
Cloud Data
Warehouse
Cluster Management
32. Strata New York 2018
Kyligence Cloud
Transforming Big Data Analytics to Cloud
One-click
provisioning
Auto Scaling
High
Performance
Seamless
Integration
Intelligent
Ops
Deploy globally in 30
minutes
Scale cluster
automatically for
different workloads
Powered by Kyligence
Analytics Platform
Connect to cloud data
sources
Enterprise ODBC driver
for BI
Online diagnosis and
continuous
optimization
Speed Up OLAP analysis and mission-critical queries to interactive speed
35. Strata New York 2018
SQL Acceleration for Big Data
Kyligence Enterprise
Powered by Apache Kylin
ANSI SQL
Kyligence
Storage
Hadoop Platform
T-SQL Oracle SQL PostgreSQL
Ingestion SQL Pushdown
Impala
Query
Analytics
36. Strata New York 2018
SQL Acceleration for Big Data
< 1s
DB
line_orders
buyer_accounts
seller_accounts
product_items
…
√
√
√
SQL SQL
37. Strata New York 2018
SQL Acceleration for Big Data
Intelligent Cubing
Kyligence Enterprise
ANSI SQL
Pushdown
For Ad-Hoc
Aggregation
& Index query
Solution
• Speed up SQL on Hadoop automatically
• Supports Hive, Impala, Spark SQL and more will
coming
• High performance and high concurrency OLAP
Benefits
• Unified analytics platform for aggregation and ad-hoc
query
• Self-services enables analysts without IT
SQL on
Hadoop
39. Strata New York 2018
Powering Excel for Big Data
Extend big data analytics to every analysts desktop
Analyze Your Big Data LIVE with
Excel
MDX/ANSI SQL Interface
Self-service Big Data from On-
Perm to Cloud
40. Strata New York 2018
LIVE
No data import is needed
Slice and dice your big data
Your Excel can fully leverage
Kyligence Cube capability
41. Strata New York 2018
LIVE
No data import is needed
Slice and dice your big data
Your Excel can fully leverage
Kyligence Cube capability
42. Strata New York 2018
Anywhere
Desktop
Website
Mobile
Kyligence currently support Pin your Excel report to Power BI mobile
47. Strata New York 2018
Streaming OLAP
Consume Streaming Data via
Kafka
MDX/ANSI SQL Interface
Batch & Streaming together
Data Source
HDFS
(Recent data)
Kyligence Enterprise
Pushdown Cube Access
Build Cube
Loading
Processing
Kafka Topic
Monitor
Prediction
Alerts …
BI
MOLAP …
Cube
(Full history data)
Near Real-time
(On recent data)
Historical
(On full history data)
Good afternoon everyone! I’m Zhu Zhi from China Construction Bank. The topic I would like to share today is about the big data migration. We started this work in 2012. It was really passive to complete it with thoughts of data warehouses. About two years ago, we started to transform the data warehouse to our data lake driven by a mobile data app ([æp]), and it worked well. Then encouraged by Mr. Han, I fortunately have an opportunity to share the experience over here.
China Construction Bank is known as the second largest bank in China, serving more than 600 million retail customers and more than 10 million corporate customers. 4 Trillionbyte (TB: 单位量级一般写作缩写TB,读作Trillionbyte,如果不方便记忆,只读TB缩写也可) of incremental data is generated daily by more than 2 billion accounts. Up to date, we have stored more than 5 Petabyte (PB同理) online-data, 20% of it belonging to the data warehouse.
Many companies, especially Internet companies, can use a new big data technology stack like Hadoop from the beginning. But our company can’t directly replace traditional applications with new technologies since we have a long history of building enterprise data warehouses (EDW). In 2004, we built the first generation of Teradata data warehouse which only had 16 Trillionbyte (TB) of data volume and more than 100 users and reports. By 2012, the data volume had increased by 35 times, as much as 600 Trillionbyte (TB); number of users increased nearly 500 times, apps 250(读作two-fifty更常见) times, reports reaching a total of 25,000.
When the concept of big data was proposed in 2011, our company had to face great challenges: First, the total cost of ownership (TCO) maintained much high. Second, more semi-structured data and unstructured data were applied gradually. Third, Business Intelligence (BI)-based data apps needed to be upgraded to advanced analysis algorithms in order to satisfy the increasing requirements of users. Therefore, we put forward the idea of big data blueprint.
As you can see, the entire blueprint was divided into two parts: the above part was the restaurant, aimed at numerous business users; we hope that they could use the data in a self-service way. Whereas, the lower part was the kitchen for technical engineers; it was available for unstructured data and advanced data insights compared to traditional data warehouses,At that point, we realized,the outbreak of Hadoop and Spark ecosystems, but due to the transformation problems of the technical team, we still maintained the development mode of Perl+SQL. In that case, we could work through the problems by choosing an open MPP database and transferring the data and programs.
In 2016, we ended up realizing that this way couldn’t handle the core problem. Our development speed couldn’t keep up with changes of business users’ requirements. Additionally, this open MPP database could neither satisfy the growing data volume nor replace Teradata completely. The code was harder to maintain and the problems were exploding. For example, this is a very common sql statement translated automatically in the picture, which is five pages’ long. This kind of sql statement was flooding in our entire data warehouse. When we scanned the code into this data graph, I finally understood what a spaghetti-like system was. This was a very painful experience.
In 2016, we ended up realizing that this way couldn’t handle the core problem. Our development speed couldn’t keep up with changes of business users’ requirements. Additionally, this open MPP database could neither satisfy the growing data volume nor replace Teradata completely. The code was harder to maintain and the problems were exploding. For example, this is a very common sql statement translated automatically in the picture, which is five pages’ long. This kind of sql statement was flooding in our entire data warehouse. When we scanned the code into this data graph, I finally understood what a spaghetti-like system was. This was a very painful experience.
In 2016, we ended up realizing that this way couldn’t handle the core problem. Our development speed couldn’t keep up with changes of business users’ requirements. Additionally, this open MPP database could neither satisfy the growing data volume nor replace Teradata completely. The code was harder to maintain and the problems were exploding. For example, this is a very common sql statement translated automatically in the picture, which is five pages’ long. This kind of sql statement was flooding in our entire data warehouse. When we scanned the code into this data graph, I finally understood what a spaghetti-like system was. This was a very painful experience.
In 2016, we ended up realizing that this way couldn’t handle the core problem. Our development speed couldn’t keep up with changes of business users’ requirements. Additionally, this open MPP database could neither satisfy the growing data volume nor replace Teradata completely. The code was harder to maintain and the problems were exploding. For example, this is a very common sql statement translated automatically in the picture, which is five pages’ long. This kind of sql statement was flooding in our entire data warehouse. When we scanned the code into this data graph, I finally understood what a spaghetti-like system was. This was a very painful experience.
To provide convenience for users, my team developed a mobile data app in 2016. Initially, we just wanted to simplify the process of access to data for users. After meeting Luke, I suddenly realized that this app was a key to transform a data warehouse to a data lake. This picture shows our earliest version of app (MVP). It is similar to the famous software called Straight Flush in China. We thought outside the box and gradually released data cubes from the data warehouse onto mobile phones. End users subscribed to the data they cared about on mobile phones, so that we could know the most essential data by counting their clicks. Then we transferred the most valuable things from the traditional architecture to a data lake using Kylin system and MapReduce. This app not only met requirements of numerous users, but also made the overall architecture highly maintainable, reducing TCO.
It turned out that this mobile data app outperformed our previous design. With regard to TCO, Teradata no longer increased (本句直译,我不太理解teradata不再增长的含义,这里的teradata是单位量级的含义吗?), and the cost of unit storage declined 66%. The delivery cycle time of new apps was reduced from 6 months to 1 months. In terms of performance, mobile users grew from 0 to 100,000, while the active PC users dropped 90%. The page view (PV) was up to one million daily. Additionally, real-time applications emerged due to the reduced data latency from 48 hours to 7 hours and millisecond-level responding. As for user experience, users were free to access data not limited by time or places since 25000 reports were reduced to 5000 and 800 mobile data metrics and silo data problems were eliminated.
What we learned through the process: First is discovering users’ values by collecting their usage records in order to transform the logics of 25,000 reports into about 1000 cube apps; Second is Kylin’s self-service system enabling end users to join the data game and to share, greatly reducing the expenses of developers; Third is building the data conformance bus on Hive to eliminate the vertical shaft data problem; Fourth is Rebuild Analytics layer by Apache Kylin system to make programs highly maintainable; Fifth is applying test-driven development method to refactor the calculation logic.
Two years later, the data warehouse has eventually evolved into a vibrant data ecosystem, where Hadoop gathers all structured data into the data lake and Kylin’s self-service system with our data service system continuously deliver the most valuable data to various apps. Appreciate Mr. Luke’s support to our project. Next, let Mr. Luke to introduce some ideas about enterprise intelligent data warehouse.
Two years later, the data warehouse has eventually evolved into a vibrant data ecosystem, where Hadoop gathers all structured data into the data lake and Kylin’s self-service system with our data service system continuously deliver the most valuable data to various apps. Appreciate Mr. Luke’s support to our project. Next, let Mr. Luke to introduce some ideas about enterprise intelligent data warehouse.
Two years later, the data warehouse has eventually evolved into a vibrant data ecosystem, where Hadoop gathers all structured data into the data lake and Kylin’s self-service system with our data service system continuously deliver the most valuable data to various apps. Appreciate Mr. Luke’s support to our project. Next, let Mr. Luke to introduce some ideas about enterprise intelligent data warehouse.