Big Data in Banking focuses on the use of big data and Hadoop in the Canadian banking sector. The key points are:
1) The RDARR regulatory project is driving major investments in data management by the big six Canadian banks, totaling around $800 million over three years. This has led banks to implement Hadoop data hubs to centralize data.
2) Adoption of Hadoop for risk applications is still in early stages, with a focus on regulatory reporting. Capital markets has led adoption so far.
3) Lessons learned include choosing flexible Hadoop distributions, using native Hadoop tools for best performance, and designing hubs for data engineers rather than casual users. Infrastructure must have
WordPress Websites for Engineers: Elevate Your Brand
TechConnex Big Data Series - Big Data in Banking
1. Big Data in Banking
Risk Systems Perspective
Andre Langevin
langevin@utilis.ca
www.swi.com
2. Agenda
Ø Big Data at the Big 6
Ø RDAR Data Hubs
Ø Lessons Learned (so far)
Ø Technology Themes in 2016
An important note about this presentation: in order to respect the commercial interests and privacy of my clients, I have refrained from using specific
company names, unless information is publicly available.
4. RDARR Drives Big 6 Adoption
Ø RDARR is a mandatory regulatory project:
v Regulatory response to 2008 credit crisis
v Requires re-build of data gathering and regulatory reporting to implement
measurable data quality, operational metadata and auditable data lineage
v Regulatory enforcement starts in 2017
Ø Big 6 IT spend of ~$800MM over three years on RDARR
v Combined Big 6 IT spend on all Risk Systems projects is ~$400MM per year
v RDARR spend has largely been incremental – other regulatory initiatives have
continued to drive project spend separate from RDARR
Ø Hadoop data hub is a typical RDARR solution element
The investment spend by
G-SIBs on RDARR is very
significant, averaging
US$230MM
per bank. These
investment costs are
likely to increase.
Oliver Wyman “BCBS 239:
Learning from the Prime
Movers”
All of Canada’s Big 6
banks were designated
as Domestically
Systematically Important
Banks (D-SIBS) by OSFI,
meaning they must fully
comply with BCBS-239.
8. Deployment Patterns
Ø Mix of virtual and physical server deployments:
v Cisco UCS and VMWare vSphere are leading infrastructure choices
Ø Many banks report using multiple grids aligned to business units*:
v Tools to manage multi-tenancy on Hadoop are still nascent
v Organizational issues (cost allocation, support team alignments) inhibit shared deployments
Ø Vendor community has invested heavily in cloud deployment tools:
v One-click deployments of all major Hadoop distributions are available on public clouds
Ø Banks looking at “hub and sandbox” deployments on private clouds:
v Popular pattern in established US deployments
v Big 6 all have a built internal private cloud or access to one through a major infrastructure provider
v Notable S3/AWS deployment by US regulator FINRAsets the standard
* Hortonworks CAB
10. Typical RDARR Data Hub
Ø RDARR focus drives Data Hub solution characteristics:
v RDARR objective is auditable batch reporting – tied in to central lineage and metadata solutions
v Little consideration of unstructured or real-time data sources
v Often characterized as a raw-data landing zone for otherwise inaccessible mainframe data
v Resistance to fully adopt Hadoop as a data hub – often paired with legacy database hubs
Ø Retail data focus drives emphasis on security
v PIPEDA/GBL compliance deemed critical despite little to no use of PII/PCI data in reports
v SOX compliance mandatory
Ø Architecture teams are the dominant view in data hub projects
v Business sponsor is often a newly established Data Management Office
v Focus on cost and process optimization of data flows to downstream reporting solutions
Ø Internal build – low to no adoption of commercial hub solutions
11. RDARR Data Hub Challenges
Ø Hadoop Data Governance is early stage and poorly integrated:
v No good Hadoop solution to data governance (yet)
v Data linage is at the file level in Hadoop – not suitable for RDARR critical data element traceability
v Policy-based data access solutions still in development (e.g. Navigator, Atlas)
Ø Enterprise ETL tools not Hadoop enabled:
v Many tools unable to push transformation work to Hadoop (or only as rudimentary Hive SQL)
v Performance of established ETL tools often poor on Hadoop
Ø Early mover penalty: Hadoop 2.x included solutions to many early security
and operational problems “in the box:”
v Projects with 2013 start dates were based on Hadoop 1.x – and so are usually Cloudera-based
v Established US banking shops are usually on Cloudera or MapR implementations for same reason
14. Choosing a Hadoop Distribution
Ø Maximize your exposure to change:
v Hadoop moves at very fast pace: expect to deploy a meaningful update every 3-6 months
v Avoid designs and products that try to encapsulate Hadoop – they fall behind faster than you can
recover your investment
Ø Legacy tool compatibility is important:
v SAS compatibility is critical (even though SAS doesn’t integrate well with Hadoop)
v Does your organization have DB2 or PL/SQL skills to preserve?
Ø It’s not as easy to switch distributions as you think
Ø Wait for the features you like to become free:
v Strong history of the open-source distribution incorporating features that were previously
proprietary – newer vendors attack incumbents by producing open-source replacements for
proprietary extensions
15. Data Engineering
Ø Risk modelling is often very inefficient:
v A quantitative modeler typically spends 80% of their time data gathering and preparing data
v Specialized data preparation is often difficult to repeat in production environments
Ø Data Engineering accelerates quantitative modelling:
v Advanced research labs hire data engineers to support their quantitative modelers
v Data Engineers are a hybrid of computer programmer and mathematician: they use IT-friendly tools
to source and package data into forms that are tailored to the modeler’s tool set (e.g. building a
smoothing a time series)
v Marketing teams use a 1:5 ratio of modelers and data engineers – but 10:1 is common on the “buy
side” and so is a better staffing target for a bank
Ø Data hubs should target data engineers as users:
v Build sophisticated tools for expert consumers, rather than rudimentary tools for casual users
17. Risk Architecture Insights
Ø Hadoop is a compute grid:
v Yarn is a functionally equivalent to DataSynapseor Platform Symphony
Ø You can wrap most computations using map/reduce:
v Writing a map/reduce wrapper to feed data to your C#, Java, C++, or
python applications is surprisingly easy – a hundred lines of code usually
does it
Ø Use Hadoop to bring the computation to the data:
v Re-process your data files into computationally efficient HDFS blocks
v Eliminating movement of data in a compute-centric risk application
improves performance dramatically
v Still need caching of intermediate valuation products (e.g. zero curves)
18. Infrastructure Lessons Learned
Ø Pay attention to the network:
v Hadoop needs a fast network backbone between nodes
v Applications and databases that draw data from Hadoop (e.g.
Tableau) should be co-located
Ø Hadoop grids should cost less than $1,000/TB:
v Including hardware and support subscription for a major Hadoop
distribution
v Hadoop reference configurations are based on mid-price commodity
hardware, so use that
v Virtualization will provide cheaper infrastructure, but higher node
counts offset savings by driving up support subscription costs
Storage Costs (TB)
Hadoop $1,000
SAN $5,000
Database $12,000
InformationWeek 07/27/2012
21. Technology Themes for 2016
Ø Mix-and-match SQL engines:
v Native Hadoop SQL engines lack many advanced features in database SQL engines
v Oracle and IBM are unbundling their Hadoop implementations of PL/SQL and DB2
v Oracle’s PL/SQL engine for Hadoop runs on Cloudera and could be available on Hortonworks
v IBM is releasing BigSQL (DB2) for ODP – meaning it won’t be available on Cloudera
Ø Open Data Platform: FUD or fantastic?
v Pivotal has used ODP to partner with Hortonworks and focus on their other tools
v IBM has promised to release all of their data science tools for ODP, but has been slow to deliver
Ø IBM “all in” on Spark:
v IBM’s data science tools (e.g. BigR) complement typical Spark use cases (e.g. clustering)
Ø Tableau displacing Cognos & BOBJ