TechConnex Big Data Series - Big Data in Banking

Big Data in Banking
Risk Systems Perspective
Andre Langevin
langevin@utilis.ca
www.swi.com

Agenda
Ø Big Data at the Big 6
Ø RDAR Data Hubs
Ø Lessons Learned (so far)
Ø Technology Themes in 2016
An important note about this presentation: in order to respect the commercial interests and privacy of my clients, I have refrained from using specific
company names, unless information is publicly available.

RDARR Drives Big 6 Adoption
Ø RDARR is a mandatory regulatory project:
v Regulatory response to 2008 credit crisis
v Requires re-build of data gathering and regulatory reporting to implement
measurable data quality, operational metadata and auditable data lineage
v Regulatory enforcement starts in 2017
Ø Big 6 IT spend of ~$800MM over three years on RDARR
v Combined Big 6 IT spend on all Risk Systems projects is ~$400MM per year
v RDARR spend has largely been incremental – other regulatory initiatives have
continued to drive project spend separate from RDARR
Ø Hadoop data hub is a typical RDARR solution element
The investment spend by
G-SIBs on RDARR is very
significant, averaging
US$230MM
per bank. These
investment costs are
likely to increase.
Oliver Wyman “BCBS 239:
Learning from the Prime
Movers”
All of Canada’s Big 6
banks were designated
as Domestically
Systematically Important
Banks (D-SIBS) by OSFI,
meaning they must fully
comply with BCBS-239.

Big 6 Hadoop Risk Applications
Ø Many projects are underway, but relatively few are in production:
v Plans for enhanced model building and analytics for retail banking following 2016 RDARR deadline
v Capital Markets has been leading driver of Hadoop adoption for compute applications
Ø Risk Systems teams have started building Hadoop-based applications:
v Volcker Rule Compliance Metrics (e.g. RENTD)
v Portfolio Stress Testing
v Market Risk VaR History
v On-Demand Risk
Ø Trading Floor Risk Managers have installed stand-alone Hadoop instances:
v Often cloud-based, used in specialized analysis of derivative sensitivities or historical market data

Importing US Risk Applications
Ø Expect to see more risk applications pioneered by leading US banks:
v Trading Strategy Back Testing
v Granular Capital, CVA and Market Risk Trending
v Capital Markets Dealer Compliance
v Credit Adjudication Models
v Behavioral Models (Often for Collections)
v Fast-time Transactional Fraud Detection
v AML
v Commercial Credit Network Analysis

Big 6 Vendor Alignments
Ø Banks have each chosen a strategic
Hadoop vendor:
v TD, CIBC and NB use Cloudera
v RBC and BNS use Hortonworks
v BMO uses Pivotal (Hortonworks)
Ø “Land grab” among vendors:
v Multi-year subscription deals at large discounts to
lock in customers
Ø IBM struggling for share despite
entrenched starting position:
v Lack of SAS support was a show stopper
Forrester Wave Q1 2014

Deployment Patterns
Ø Mix of virtual and physical server deployments:
v Cisco UCS and VMWare vSphere are leading infrastructure choices
Ø Many banks report using multiple grids aligned to business units*:
v Tools to manage multi-tenancy on Hadoop are still nascent
v Organizational issues (cost allocation, support team alignments) inhibit shared deployments
Ø Vendor community has invested heavily in cloud deployment tools:
v One-click deployments of all major Hadoop distributions are available on public clouds
Ø Banks looking at “hub and sandbox” deployments on private clouds:
v Popular pattern in established US deployments
v Big 6 all have a built internal private cloud or access to one through a major infrastructure provider
v Notable S3/AWS deployment by US regulator FINRAsets the standard
* Hortonworks CAB

Typical RDARR Data Hub
Ø RDARR focus drives Data Hub solution characteristics:
v RDARR objective is auditable batch reporting – tied in to central lineage and metadata solutions
v Little consideration of unstructured or real-time data sources
v Often characterized as a raw-data landing zone for otherwise inaccessible mainframe data
v Resistance to fully adopt Hadoop as a data hub – often paired with legacy database hubs
Ø Retail data focus drives emphasis on security
v PIPEDA/GBL compliance deemed critical despite little to no use of PII/PCI data in reports
v SOX compliance mandatory
Ø Architecture teams are the dominant view in data hub projects
v Business sponsor is often a newly established Data Management Office
v Focus on cost and process optimization of data flows to downstream reporting solutions
Ø Internal build – low to no adoption of commercial hub solutions

RDARR Data Hub Challenges
Ø Hadoop Data Governance is early stage and poorly integrated:
v No good Hadoop solution to data governance (yet)
v Data linage is at the file level in Hadoop – not suitable for RDARR critical data element traceability
v Policy-based data access solutions still in development (e.g. Navigator, Atlas)
Ø Enterprise ETL tools not Hadoop enabled:
v Many tools unable to push transformation work to Hadoop (or only as rudimentary Hive SQL)
v Performance of established ETL tools often poor on Hadoop
Ø Early mover penalty: Hadoop 2.x included solutions to many early security
and operational problems “in the box:”
v Projects with 2013 start dates were based on Hadoop 1.x – and so are usually Cloudera-based
v Established US banking shops are usually on Cloudera or MapR implementations for same reason

Leaving Business Value on the Table
Ø Rudimentary governance and security tools produce a
bias against self-serve access to data:
v Transfer modelling and analytic users’ frustrations with existing data
warehouse solutions to a new platform
v PII/PCI data control solutions can prevent deployment of analytical tools
Ø Design for static regulatory reporting objectives ignores
high-value interactive exploration and discovery uses:
v Standardized reporting schemas (such as IBM BDW) have limited value to
risk modelers and analysts
Ø Focus on meeting operational SLAs over sharing of grids
Banks are struggling to
understand the concrete
business impact
associated with BCBS
239; nearly 70 per cent
of domestic systemically
important banks (D-SIBs)
and half of G-SIBs have
not quantified the
benefits.
Oliver Wyman “BCBS 239:
Learning from the Prime
Movers”

Choosing a Hadoop Distribution
Ø Maximize your exposure to change:
v Hadoop moves at very fast pace: expect to deploy a meaningful update every 3-6 months
v Avoid designs and products that try to encapsulate Hadoop – they fall behind faster than you can
recover your investment
Ø Legacy tool compatibility is important:
v SAS compatibility is critical (even though SAS doesn’t integrate well with Hadoop)
v Does your organization have DB2 or PL/SQL skills to preserve?
Ø It’s not as easy to switch distributions as you think
Ø Wait for the features you like to become free:
v Strong history of the open-source distribution incorporating features that were previously
proprietary – newer vendors attack incumbents by producing open-source replacements for
proprietary extensions

Data Engineering
Ø Risk modelling is often very inefficient:
v A quantitative modeler typically spends 80% of their time data gathering and preparing data
v Specialized data preparation is often difficult to repeat in production environments
Ø Data Engineering accelerates quantitative modelling:
v Advanced research labs hire data engineers to support their quantitative modelers
v Data Engineers are a hybrid of computer programmer and mathematician: they use IT-friendly tools
to source and package data into forms that are tailored to the modeler’s tool set (e.g. building a
smoothing a time series)
v Marketing teams use a 1:5 ratio of modelers and data engineers – but 10:1 is common on the “buy
side” and so is a better staffing target for a bank
Ø Data hubs should target data engineers as users:
v Build sophisticated tools for expert consumers, rather than rudimentary tools for casual users

Developer Lessons Learned
Ø Productivity and performance improve with native Hadoop tools:
v The “Hadoop edition” of most legacy ETL packages perform slowly and are poorly integrated with
Hadoop – you are usually just buying an HDFS adapter
Ø Learn the native tools – it’s easier than you think:
v A Java programmer can learn Map/Reduce in a week
v Most end-users already know how to use SQL and python
Ø Use Pig to tune your SQL queries:
v The best optimization for Hive SQL is often to structure data on ingestion in a Hadoop-friendly way
Ø You will find lots of small bugs in Hadoop:
v Your Hadoop vendor’s support team are a critical resource to your success

Risk Architecture Insights
Ø Hadoop is a compute grid:
v Yarn is a functionally equivalent to DataSynapseor Platform Symphony
Ø You can wrap most computations using map/reduce:
v Writing a map/reduce wrapper to feed data to your C#, Java, C++, or
python applications is surprisingly easy – a hundred lines of code usually
does it
Ø Use Hadoop to bring the computation to the data:
v Re-process your data files into computationally efficient HDFS blocks
v Eliminating movement of data in a compute-centric risk application
improves performance dramatically
v Still need caching of intermediate valuation products (e.g. zero curves)

Infrastructure Lessons Learned
Ø Pay attention to the network:
v Hadoop needs a fast network backbone between nodes
v Applications and databases that draw data from Hadoop (e.g.
Tableau) should be co-located
Ø Hadoop grids should cost less than $1,000/TB:
v Including hardware and support subscription for a major Hadoop
distribution
v Hadoop reference configurations are based on mid-price commodity
hardware, so use that
v Virtualization will provide cheaper infrastructure, but higher node
counts offset savings by driving up support subscription costs
Storage Costs (TB)
Hadoop $1,000
SAN $5,000
Database $12,000
InformationWeek 07/27/2012

Infrastructure Lessons Learned
Ø Don’t try to prevent infrastructure failure:
v Hadoop is very fault tolerant–it is designed to handle an annual equipment failure rate of 8%
v Do not use fault tolerant hardware – use JBOD instead of RAID arrays
v A well-designed Hadoop grid will keep running for the 24 hours it takes your hardware vendor to
replace a broken machine under a normal support contract
Ø The best back-up for Hadoop is Hadoop:
v Hadoop is the cheapest form of on-line storage available, and is cost-competitive and more
reliable than tape.
v Replicate your Hadoop grid to a second grid at a different site for a high-grade disaster recovery
solution.

Technology Themes for 2016
Ø Mix-and-match SQL engines:
v Native Hadoop SQL engines lack many advanced features in database SQL engines
v Oracle and IBM are unbundling their Hadoop implementations of PL/SQL and DB2
v Oracle’s PL/SQL engine for Hadoop runs on Cloudera and could be available on Hortonworks
v IBM is releasing BigSQL (DB2) for ODP – meaning it won’t be available on Cloudera
Ø Open Data Platform: FUD or fantastic?
v Pivotal has used ODP to partner with Hortonworks and focus on their other tools
v IBM has promised to release all of their data science tools for ODP, but has been slow to deliver
Ø IBM “all in” on Spark:
v IBM’s data science tools (e.g. BigR) complement typical Spark use cases (e.g. clustering)
Ø Tableau displacing Cognos & BOBJ

Data Governance Themes for 2016
Ø Native Hadoop Data Governance:
v Hortonworks has partnered with JP Morgan, Merck and Aetna to
build an advanced Hadoop data governance solution in the
Apache Atlas project
v Atlas is intended to govern Hadoop data in a federated
governance model – partner adoption will drive success
Ø Federated Data Governance:
v The Big 6 have all adopted IBM IGC as their enterprise RDARR
lineage and metadata solution.
v IBM provides REST APIs to integrate IGC with non-IBM products.
v Will ODP partners Hortonworks and IBM manage to establish
Atlas on IGC as the definitive Hadoop solution in a distributed
governance model?

Risk Technology Themes for 2016
Ø Model development on Hadoop:
v As RDARR data hubs hit critical mass, risk model development
will gravitate to Hadoop-based tools
Ø Notebook workspaces:
v Increased use of Hadoop modelling environments will drive
demand for Notebook environments based on Jupyter and
Apache Zeppelin (e.g. IBM Knowledge Anyhow)
Ø On-Demand Risk on Hadoop:
v Next generation on-demand risk applications will converge
stand-alone compute grid and data cache and persistence onto
Hadoop stack to eliminate data movement – better
performance and lower costs

TechConnex Big Data Series - Big Data in Banking

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à TechConnex Big Data Series - Big Data in Banking

Similaire à TechConnex Big Data Series - Big Data in Banking (20)

Dernier

Dernier (20)

TechConnex Big Data Series - Big Data in Banking