Intel big data analytics in health and life sciences personalized medicine

The Age of Data-Driven
Personalized Medicine
Ketan Paranjape
Worldwide Director, Health & Life Sciences
Intel Corporation
www.intel.com/healthcare/bigdata

Notice and Disclaimers
• Notice: This document contains information on products in the design phase of development. The information here is
subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or
your distributor to obtain the latest specification before placing your product order.
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED
IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS,
INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not
intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications,
product descriptions, and plans at any time, without notice.
• All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.
• Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or
"undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or
incompatibilities arising from future changes to them.
•Performance tests and ratings are measured using specific computer systems and/or components and reflect the
approximate performance of Intel products as measured by those tests. Any difference in system hardware or
software design or configuration may affect actual performance.
• The Intel products discussed herein may contain design defects or errors known as errata which may cause the
product to deviate from published specifications. Current characterized errata are available on request.
• Knights Corner, Knights Ferry, Aubrey Isle and other code names featured are used internally within Intel to identify
products that are in development and not yet publicly announced for release. Customers, licensees and other third
parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or
services and any such use of Intel's internal code names is at the sole risk of the user.
• Copies of documents which have an order number and are referenced in this document, or other Intel literature, may
be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
• Intel®, Itanium®, Xeon®, Pentium®, and the Intel logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
• Copyright © 2011-13, Intel Corporation. All rights reserved.
• *Other names and brands may be claimed as the property of others.

Compute for Personalized Medicine
a.k.a Big Data Analytics in Healthcare and Analytics

ACTIONABLE EHR
ANALYTICS -PAYER,
PROVIDER

Regional Health Information Network
RHIN – China (Jinzhou, Pop 3M)
• Challenge: RHIN has
challenges with scalability,
performance and
maintenance. Data storage
is expensive
• Solution: EMR data and
healthcare services running
on Intel Hadoop Distribution
and Xeon E5 servers.
• Benefits: High
performance and scalability
demonstrated via POC and
stress testing. Significantly
reduced storage cost
• 1/5 Reduction in
Response Time; 5x
Concurrent Users
Data processing flow of RHIN platform
http://hadoop.intel.com/pdfs/IntelChinaHealthyCityAnalyticsCaseStudy.pdf

GE-Medical Quality Improvement
Consortium (MQIC)
• Challenge – Gaining value from
data in EMRs/EHRs and other
digital health information tools
• Solution – De-identified data
from Centricity EMRs; Analytics
capabilities to enhance their
quality and reporting activities.
• 1.6 billion documents
representing 30 million de-
identified patient records and
209 million office visits.
• Benefits - Physician practices
and ambulatory care clinics
deliver their best care more
efficiently, along with
population-based research and
public health activities.
6http://visualization.geblogs.com/visualization/network/
DEMO - http://visualization.geblogs.com/visualization/network/

NHS Trust – Leeds Teaching
Hospitals
• Challenge – Capture data at the
point of admission, throughout the
patient care cycle and use natural
language processing (NLP) to make
sense of unstructured care notes and
combine with structured care data for
analysis
• Solution – Partnering with ISVs –
Ascribe, Two 10degrees, Microsoft
and machines powered by Intel Xeon
processor E5 family;
• 30M patients, > 7M attendances
each year worth of records
• Benefits – Billing optimizations
(doctors log the correct data),
Resource Optimizations (learning
patient trends for resource planning)
7http://visualization.geblogs.com/visualization/network/
DEMO - http://visualization.geblogs.com/visualization/network/
“The use of big data analysis on
our patient care notes enables us
to prove things our clinical intuition
was telling us. In the new world
anecdotal evidence isn’t enough.
What we think isn’t sufficient to
spend money. We need proof.”
Iain MacBrairdy,
Business Manager,
Emergency Medicine,
Leeds Teaching Hospitals

Charite “Real-time” Cancer Analysis –
Matching proper therapies to patients
• Challenge: Real-time
analysis of cancer patients
using the in-memory SAP
HANA Oncolyzer database
that is running on mission
critical Intel Xeon family
infrastructure. (3.5M Data
points per Patient, Up to 20
TB of data/patient
• Solution: Using structured
and unstructured data to
collect and analyze tables
used to take up to two days
-- now takes seconds
• Benefits: Improves medical
quality in disruptive way for
– Patient
– Doctor
– Hospital
– Research
8
http://moss.ger.ith.intel.com/sites/SAP/SAP%20account%20team%20documents/Marketing/SAP%20HANA/SAPHANA_Charite_case_study_HI.PDF

HANA Oncolyzer
• Ad-hoc Analysis of heterogeneous
tumor data for cancer research
• Medical records from decades of tens
of thousands of patients
• Structured and unstructured data
(records, time series, free text, etc.)
Solution
• Integrated into condensed but
exhaustive view
• On-the-fly analyses
(e.g. Kaplan-Meier estimation, cohort statistics)
• Includes external data sources
(e.g. PubMed, pharmaceutical databases)
• Attributes can be native, views,
freetext-extracted, calculated

LIFE SCIENCES, PHARMA,
GENOMICS

Life Sciences: At the intersection of
transformative forces
Enabling exascale
computing on massive
data sets
Helping enterprises build
open interoperable clouds
Contributing code and
fostering ecosystem
HPC Cloud Open Source
10
18

Genomics Is A Big Data Problem
AffectingFactors
Cell
Response
313 Exabytes
if everyone in the
US has their genes
sequenced
495 Exabytes
if every cancer
patient in the US
has their genes
sequenced every 2
weeks
A complex
interaction of varied
& changing intrinsic
and extrinsic factors
determine cell
response

Life Sciences:
Key Industry Challenges and Solutions
• Many (most) applications are single-
threaded, single address space
Intel is delivering optimizations working
with open source community, developing
NGS+HPC curriculum
• Some algorithms scale quadratically with
the size of the problem. Large data sets
exceed available memory and storage
Innovations in acceleration, compute,
storage, networking, security, and *-as-a-
service.
• International collaboration is an
imperative, bioinformatics expertise is
scarce
• Intel is working closely with the ecosystem
to address enterprise to cloud transmission
of terabyte payloads
• Databases are distributed, data is siloed
and will likely stay that way
Tools like Hadoop, Lustre, Graphlab, In-
Memory Analytics, etc.
Need for Balanced Compute Infrastructure

Dell Active Infrastructure for
HPC Life Sciences
• Challenge: Experiment processing takes
7 days with current infrastructure.
Delays treatment for sick patients
• Solution: Dell Next Generation
Sequencing Appliance
– Single Rack Solution
– 9 Teraflops of Sandy Bridge Processors
– Lustre File Storage
– Intel SW tools and engineers
• Benefits: RNA-Seq processing
reduced to 4 hour
• Includes everything you need for NGS -
compute, storage, software, networking,
infrastructure, installation, deployment,
training, service & support
Dell HSS (Lustre)
(up to 360TB)
Dell NSS (NFS)
(up to 180TB)
Infrastructure:
Dell PE, PC & F10
M420 (Compute)
(up to 32 nodes)
2U Plenum
Actual placement in racks may vary.
NSS-HA Pair
NSS User
Data
HSS Metadata
Pair
HSS OSS Pair
HSS User
Data

IBM, CLC bio Genomics Sequencing
Analytics Solution
• Challenge: Need for processing power and
storage capacity in order to correlate the
variants in the genome with the relevant patient
symptoms
• Solution: IBM®, CLC Genomics server SW,
Genomics Workbench client SW; Small (48
Cores, 192 GB), Medium, Large (192 Cores,
768 GB) Analytics Solutions
• Benefits:
– Reference Mapping for 37x coverage human
genome – ~9hr (1 node) to ~30mins (37 nodes)
– Variant Calling and annotation for 37x coverage –
~40 hrs (1 node) to ~3hrs (23 nodes)
• Infrastructure
– IBM System x® 3550 M4, E5-2650; 48 CPU cores and 192
GBs of memory to 192 CPU cores and 768 GBs of memory
– IBM Storwize® V7000
– CLC Genomics Server 5.0.2 , Workbench 6.0.1
– 7x 3TB SAS 6 Gbps HDD (16 TB usable)
http://www-148.ibm.com/bin/newsletter/tool/landingPage.cgi?lpId=6155

NGS Appliances
BioTeam “SlipStream”
• Challenge: Significant IT overhead,
limited bioinformatics support, changing
landscape
• Solution: “Slipstream” Appliance
• Benefits:
– Minimize lab IT startup costs
– Integrate and standardize data management
including security, easily traceable results
– Adaptable to any Laboratory, Workflow-
based Lab Management
– Seamless Sequencer Integration
• Infrastructure
– Dell PowerEdge T620 Desktop Server
– 2x Intel Xeon 8 Core Processors (16 cores)
– 16x 32GB RAM (512GB), 1x 100GB SSD
– 7x 3TB SAS 6 Gbps HDD (16 TB usable)

Convey Computing’s Hybrid Core Architecture
to Accelerate Algorithms
• Challenge: Advances in sequencing
technology have significantly increased data
generation and require similar computational
advances for bioinformatics analysis
• Solution: Convey Hybrid-Core (HC)
architecture - Intel® x86 microprocessors
with a coprocessor comprised of
reconfigurable hardware (FPGAs)
• Benefits: Accelerated BWA pipeline up to
18x compared to a standard x86 system
• Project Characteristics:
HC-1: Intel L5408, Xilinx Virtex-5 FPGAs, 1TB
SATA disks
HC-2: Intel X5670, Xilinx Virtex-5 FPGAs, 1TB
SATA disks
HC-2ex: 128GB (host), 64GB (coprocessor),
1TB SATA disks

Genomics & Health Analytics Appliances
18
2U Plenum
Actual placement in racks may vary.
NSS-HA Pair
NSS User Data
HSS Metadata Pair
HSS OSS Pair
HSS User Data
Scale through independent solutions,
each targeting a different segment & usage model

Ultra High-Speed
Networking Optimizations – Aspera Labs
• Challenge: Improving big data transfer to
and from the backend data center
• Solution: Optimize ultra high-speed (10
Gbps and beyond) data transfer solutions
built on Aspera’s FASP ™ transport
technology and Intel’s innovative hardware
platform
• Benefits with Intel Xeon E5-2600
(DDIO, SR-IOV)
– 300% improvement in Aspera transfer
throughput
– Same transfer speed performance in both
physical and virtualized computing
environments
– Both LAN and WAN transfer speeds had
similar results
• Infrastructure and Data Characteristics:
– Xeon E5 2687, 32GB DDR3 with Non-Uniform Memory Access (NUMA)
Data Direct IO (DDIO), Intel 910 SSD, Intel 82599EB 10 GbE
– Aspera Enterprise server 3.1.1.66573, Aspera Performance Automation
Suite

• Challenge: Can high performance interconnect
technology (InfiniBand) keep up with increase in
number of processor cores?
• Workloads: VASP, WIEN2K
• Benchmarks: MVAPICH (MPI over InfiniBand),
IMPI (Intel MPI)
• Results:
– Scale-up research – 5 to 10 fold time
improvement in performance when scaling
from a single node to 16 nodes
– Intel® True Scale Fabric QDR-40 shows
excellent price/performance results
• Infrastructure and Data Characteristics:
– 1 Head + 16 compute nodes, Dual Xeon® E5 2680 2.7GHz p/node
– 32GB of RAM 1666MHz p/node
– RHEL, Compiler, MPI variations available
– Intel® Cluster Suite, Intel® Fabric Suite
High-Performance Interconnect (InfiniBand)
and HPC – Intel® True Scale Fabric

Data Life Cycle Management
with iRODS – EMC, RENCI
4.3M WGS for
all US
newborns/yr.
~= 100 PB*
Can I
describe
it?
Can I find
it?
Can I
access it?
Can I
move it?
* Chris Mason, Weill Cornell Medical College, WGA
Mtg. Nov. 2012

High Performance Scale-out Storage for
Wellcome Trust Sanger Institute
Challenge: Exponential increases in the volume of data being generated – but
storage budgets are flat or growing slowly.
Large data sets are difficult to proactively manage, and can easily overwhelm
storage resources. Un-optimized storage has a direct, negative impact on
application performance – slowing the time for breakthrough results.
Solution: Exploit the power and scale of HPC-class storage, powered by Intel®
Enterprise Edition for Lustre* software for unprecedented performance with
unmatched management simplicity.
Benefits that storage solutions powered by Intel EE for Lustre software:
– Openness – Developed and enhanced by the Lustre experts
– Global namespace – all clients can access all data
– Performance – Upwards of 1 TB/s
– Virtually unlimited file system and per file sizes
– Management simplicity using Intel® Manager for Lustre*

Heterogeneous Clusters for Biomedical Computing at
Virginia Bioinformatics Institute (VBI)
• Challenge: Scalable infrastructure for
rapid data growth and the need to run
varied applications is driving the need for
novel computing needs.
• Solution: Combination of Intel® Xeon®,
Intel® True Scale QDR Infiniband and SGI’s
infiniteStorage platform was deployed to
deliver a 300% speedup. Overall reduction
in cost resulted in the purchase of additional
compute nodes.
• VBI Cluster – Symmetric multiprocessing
(SMP) nodes (large memory Xeon E7) with
1 TB of RAM, massively parallel processing
(MPP) nodes (Xeon E5) with 64 GB. 50 PB of
tape storage, 600 TB of HDD. Using SGI’s
IS16000 platform and Intel TrueScale
fabric, VBI moves data through the storage
systems at 2 GB per second.
“The amazing thing is that we see
almost a three times performance
increase on 48 nodes compared to 56
nodes of the previous generatyion,
even though the processors are
slightly slower clock speed. The Intel®
QuickPath Interconnect and Intel®
TrueScale Fabric have has a big
impact.”
Dr. Kevin Shinpaugh,
Director of IT and HPC,
Virginia Bioinformatics Institute

Top-5 Pharmaceutical Company -
SAS Grid
• Challenge: Need to accelerate and
optimize “time to results” clinical trial
simulation environment; resource
allocation and job prioritization was
manual/ad-hoc
• Solution: “Scale-Out” architecture:
– SAS Visual Analytics, Enterprise Miner, Grid
Manager
– Red Hat Enterprise Linux
– Xeon E5 servers (HP)
• Benefits: Clinical trial simulation
exercises reduced from hours to < 5
minutes; registration decisions
accelerated with multi-hundred million
USD impact
http://www.intel.com/content/www/us/en/cloud-computing/cloud-computing-xeon-e5-carestream-imaging-brief.html

Mitsui Knowledge Industry (MKI)
• Challenge: Reduce the amount of
time it takes to do complete genomic
analysis and deliver results to
patients
• Solution: Real-Time Big Data
Platform
– R (Revolution Analytics)
– SAP HANA
– Hadoop
• Benefits: Genomic analysis
shortened from several days to 20
minutes; performance for some
queries improved 400,000 X
http://www.intel.com/content/www/us/en/cloud-computing/cloud-computing-xeon-e5-carestream-imaging-brief.html
http://www.saphana.com/docs/DOC-3641

Value
• Enable researchers to discover biomarkers
and drug targets by correlating genomic data
sets
• 90% gain in throughput; 6X data compression
Analytics
• Provide curated data sets with pre-computed
analysis (classification, correlation,
biomarkers)
• Provide APIs for applications to combine and
analyze public and private data sets
Data Management
• Use Hive and Hadoop for query and search
• Dynamically partition and scale Hbase
• 10-node cluster / Intel Xeon E5 processors
• 10GbE network
Data-Intensive Discovery: Genomics
Intel Distribution

Intel Confidential
• Solution: Intel Distribution for
Hadoop (IDH), Map Reduce,
Hbase, Hive
• Benefits: Ability to compare 14
million proteins and more,
reducing the processing time
from days to hours.
• Project Characteristics:
Hadoop: 5 nodes Cluster
Storage:16TB (Internal
storage) per server
Servers: Xeon E5 2 socket 8
cores, 64GB RAM
SLA: reducing processing time
from 30 days to less then a
day and scale to 4x4 million
samples comparison
Data: Multi-Terabyte database
Problem Statement:
Back in 2008 a genome research team
faced compute and scalability issue in
comparing all pairs of 4 million
proteins, the BLAST search results
overwhelmed a single database table.
Today they need to compare 14 million
proteins, this requirement cannot
be addressed with existing
technology.
Big Data, Bioinformatics
Team website Blast
Program
Genome data
Proteins comparison
High performance scalable
Hadoop/Hbase cluster

High Throughput Science:
Embracing Cloud-based Analytics
• Challenge: Team of cancer
researchers had to screen a drug
concept with a list of tens of millions
of molecules working with a tight
deadline, a fixed budget, and strict
security and compliance requirements.
Schrödinger’s existing in-house
servers would be tied up for weeks
• Solution: Schrödinger leveraged
software from AWS partner, Cycle
Computing, to provision a fully
secured cluster of 50,000 cores,
powered by the Intel® Xeon®
processor E5 family.
– This configuration enabled the
team to run 16 million
molecular simulations an hour.
– Developed 1000 molecule list
in < 8hrs.

High Throughput Science:
Large Scale Computational Chemistry Simulation
• Challenge: Sustaining access to
50000+ compute cores for large scale
computational chemistry simulation
results in under a week. Ability to
monitor and re-launch jobs, no
additional capital expenditure with
internal HPCC already running at
capacity.
• Solution: Novartis leveraged software
from AWS partner, Cycle Computing,
and MolSoft to provision a fully
secured cluster of 30,000 CPUs,
powered by the Intel® Xeon®
processor E5 family.
– Completed screening of 3.2
million compounds in
approximately 9 hrs, compared
to 4 -14 days on existing
resources.
Virtual Screening

Goals and Current applications target
• Focus on improving
genomics pipelines
• Optimize individual
applications
• Work with code
authors to release
optimizations
• Intel® Xeon®
processor focus
 Selectively experiment with
Intel® Xeon Phi™ coprocessor
DOMAIN Applications
Intel®
Architecture
Target
Genomics
Bowtie 1*, Bowtie 2* Xeon® processor
BWA* Xeon® processor
BLAST* Xeon® processor
GATK* Xeon® processor
HMMER*
Xeon® processor
Xeon® Phi™
coprocessor
Abyss* Xeon® processor
Velvet* Xeon® processor
*Other names and brands may be claimed as the property of others.

TGen* RNA sequencing pipeline
Partnership between Intel®, DELL*, Tgen*
1.8x
** 2-socket Intel(R) Xeon(R) CPU E5-2687W / 3.1 GHz

Goals and Current applications target
• Optimize for Intel®
Xeon® processor and
Intel® Xeon Phi™
coprocessor (node
and cluster)
• Increase availability
of applications on
Intel® Xeon Phi™
coprocessor
• Work with code
authors to release
optimizations
DOMAIN Applications
Intel®
Architecture
Targets
Molecular
Dynamics/
Chemistry
AMBER*
Xeon® processor
Xeon® Phi™
coprocessor
NAMD*
GROMACS*
GAMESS*
Quantum Espresso*
Gaussian*
VASP*
CP2K*
QBOX*
CPMD*
LAMMPS*

Intel® Xeon® processor: new platforms,
architecture improve Life Science applications
2-socket “Ivybridge vs. Sandybridge”
Ivybridge 12c/24T 2.7Ghz, Sandybridge 8c/16T 2.9GHz
http://www.intel.com/content/www/us/en/benchmarks/server/xeon-e5-2600-v2/xeon-e5-v2-hpc-life-sciences.html

34
Optimizing/Accelerating the DNA Pipeline
Compression – IPP library – HW Acceleration – Custom library
FPGA Acceleration

35
Incorporating Intel IPP Deflater into Picard Tools

36
Picard MarkDuplicates Optimizations
Two Fold Approach:
1. Added optional tag ‘MC’ to SAM Specification
• Tag ‘MC’ is used to store Mate Cigar for a Paired Read, where mate is mapped.
• SAM JDK extended to support tag ‘MC’
• MergeBamAlignment modified to include the new ‘MC’ tag within each relevant
record of the SAM/BAM file
2. Redesign of MarkDuplicates
• Inclusion of ‘MC’ tag provides opportunity for algorithmic redesign of
MarkDuplicates
• Overall speedup ~2x for MergeBamAlig/MarkDuplicates
Additional Gains: Enables streaming of records for the entire pre-GATK phase (from
‘bwa mem’ to ‘MarkDuplicates’ ) in a typical bwa_mem+GATK workflow

37
MarkDuplicates
RdX_1: …………Cigar
………………
………………
RdY_1: …………Cigar
………………
………………
RdX_2: …………Cigar
………………
………………
………………
………………
RdY_2: …………Cigar
BASELINE
1) Store information per each read:
Used to determine unclipped 5’ coordinate for
both ends & orientation of pair
2) Sort reads within the entire file by unclipped
5’coordinate and MarkDuplicates
3) Write out BAM file
OPTIMIZED
RdX_1: …………Cigar ………. MC:
…………………………………
…………………………………
RdY_1: …………Cigar ……… MC:
…………………………………
…………………………………
RdX_2: ………Cigar ……… MC:
…………………………………
…………………………………
…………………………………
…………………………………
RdY_2: ………Cigar ………. MC:
PairX
PairY
PairX+
+
1) Sort reads within a small window by
unclipped 5’ coordinate:
- MarkDuplicates
- Write out

38
DNA Pipeline: BWA+GATK: Whole Genome Sample: ~65x Coverage
Collaborating with our Partners and Medical community
Process level
Parallelism
Thread-level
Parallelism
Step # of
Threads
Runtime
(hours)
Read Alignment (bwa
mem)
24 7
View (samtools) 24 2
Sort + Index (samtools) 24 3
MarkDuplicates
(picardtools) + Index
1 11
RealignerTargetCreator
(GATK)
24 1
IndelRealigner* (GATK) +
Index
24 6.5
BaseRecalibrator(GATK) 24 1.3
PrintReads* (GATK) +
Index + Flagstat
24 12.3
TOTAL (hours)
44
Step Tool # of
Threads
Runtime
(hours)
Read Alignment (bwa) 16 8
Sampe (bwa) 1 24
Import (samtools) 1 11
Sort + Index (samtools) 1 14.5
MarkDuplicates
(picardtools) + Index
1 11.5
UnifiedGenotyper* (GATK) 16 7.5
SomaticIndelDetector
(GATK)
1 3
RealignerTargetCreator
(GATK)
16 0.8
IndelRealigner* (GATK) +
Index
1 17.5
BaseRecalibrator*(GATK) 1 62
PrintReads* (GATK) +
Index + Flagstat
1 25
TOTAL (hours) 177
Algorithmic
Improvement
6X improvement so far and 4X without major code change and rest with code changes.
Redesign of
Mark Duplicates
+
Merge Bam Align
30-36
hours

39
Profiling: Single Instance Run – Lower Latency
# of Machines = 1
# of cores/Machine = 24
Temporary Storage – RAID0 2x4TB HDD
Input Dataset: G15512.HCC1954.1, coverage: 65x
Average CPU utilization is very low. Most cores not being used
Average I/O bandwidth is very low. Application not I/O bound
Average memory footprint is small. Application not using memory available in newer systems
There is a lot of room to improvise

40
Smith Waterman Acceleration
Working on accelerating two versions of Smith Waterman:
1. Simplified version where gap open, gap extension, and mismatch penalties are identical
2. Affine gap penalty (as implemented in BWA-MEM)
Initial results on #1 seem promising
Speed up measured in terms of
throughput for these runs.
Banded Smith Waterman
implementation
Bitwise parallelism:
Packed32: 32-bit uint
Packed64: 64-bit uint
AVX: 256-bit vector
Xeon Phi: 512-bit vector

41
Optimizing/Accelerating
Compression – IPP library – HW Acceleration – Custom library
FPGA Acceleration

Genomics - Big Data Problem
AffectingFactors
Cell Response
313 Exabytes
if everyone in the US has
their genes sequenced
495 Exabytes
if every cancer patient in the US has
their genes sequenced every 2 weeks.
Images, Assays and Drug
response data will push it
further up as shown in Blue line
Complex interaction of
varied & changing intrinsic
and extrinsic factors
determine cell response
Source: Knights Cancer Institute, Oregon Health Sciences University & Intel
Proliferation
Apoptosis
Differentiation
DNA Repair
Motility
Senescence
With Genomic Data growing rapidly, hospitals and research centers need to access the local data (the ones not shared) and
the centralized public/private data for various analysis and analytics for Genomic Research/Development/Medicine.
Compute has to be done “where data is” and need to be consistent locally and in the cloud.
Energy, Total Cost of Operation are key
Invasion,Metastasis&
therapeuticresponse
The day when every newborn gets their DNA sequenced is not far away: http://www.nih.gov/news/health/sep2013/nhgri-04.htm.

43
1
2
2 3
3
3
4
4
4
4 5
5
5
5
PairHMM Matrix Dependencies
Wave-Front Computation in AVX

44
Pair HMM Acceleration using AVX
• Computation kernel and bottleneck in GATK Haplotype Caller
• AVX enables 8 floating point SIMD operations in parallel
• 2 Ways to vectorize HMM computation
• Intra-Sequence – Parallelize computation within one HMM matrix
operation. Run multiple (8) computations concurrently along diagonal
• Inter-Sequence – Perform multiple (8) HMM matrix operations at once
Time (seconds) Speedup C++/Java
Serial C++ 1540 1x / 9x
1 core with AVX (Intra) 340 4.5x / 40.7x
1 core with AVX (Inter) 285 5.4x / 48.6x
24 cores with AVX (Inter) 14.3 108x / 970x
24 cores hybrid (Inter) 15.7 98x / 882x

Policy – United States, European
Union
Snapshot of US, EU Recommendations
Develop an ICT-enabled European Strategy for Personalised
Medicine
2014-2020
Driving research to unleash the potential of ICT at the point-of-care
EU R&D initiatives must address:
 Interoperability of technical standards for managing and sharing sequence data in
research and clinical samples;
 Development of hardware, software and workflow algorithms to accelerate cost
efficient analysis of genetic abnormalities that cause cancer and other complex
diseases;
 Research to ensure convergence of Big Data and Cloud Computing infrastructure to
meet the requirements of High Performance Computing and data throughout the life
sciences and healthcare value chains
The eHealth Action Plan 2020 should include Personalised Medicine as a
priority
 Gain knowledge of the challenges and barriers (technical, organizational, legal and
political) to the adoption of ICT in support of Personalised Medicine leveraged by
genomic information;
 Evaluate how to change workflows and education requirements to facilitate adoption
of ICT mediated personalized medicine in clinical practice;
 Expand collaboration with other regions of the world in matters of common interest,
e.g. by leveraging the eHealth MoU with the United States of America;
 Study, evaluate and disseminate technology neutral risk assessment frameworks for
data privacy and security, covering the entire ICT enabled Personalised Medicine
delivery chain;
 Develop effective methods for enabling the use of medical information for public health
and research

Intel Assets for Life Sciences
Intel
Xeon E5
Intel
Xeon Phi
Intel Fabric Intel
Storage
Intel
Software
• Up to 80%
greater
performance
• Up to 70% more
energy efficiency
• Up to 30% less
network latency
• Hardware-
accelerated
security (AES-NI)
• Broad industry
adoption
Consistent
Performance Gains
each generation
• Performance and
programmability for
highly-parallel
workloads
• Programming
continuity and
scalable parallel
programming
models: common
source code and
software tools
between multicore
Intel® Xeon® and
manycore Intel®
Xeon Phi™
• Partner ecosystem
continues growing
and making progress
• Intel® Cluster
Studio XE compilers,
libraries, analysis
tools, OpenMP and
MPI
• Intel® Hadoop
Distribution
• Intel® Data Center
Manager and Intel®
Node Manager (NM)
Intel® Expressway
Service Gateway for
Cloud usage models
• Intel® True Scale
Fabric designed
from the ground up
for HPC
• QDR-40 and QDR-80
deliver performance
that scales - high
MPI message rates
and end-to-end
latency that stays
low at scale
• Optimized support
for Intel® Xeon® E5
and Xeon® Phi
processors
• Intel Fabric Suite –
IB Fabric
Management &
FastFabric
Management tools
• Intel® Xeon®
processors and
platforms are
enabled with
beneficial storage
optimizations
• Solid State Drives
(SSD) and other NVM
technologies improve
storage performance
• Intel® Cache
Acceleration
Software
• Intel’s open source
Lustre file-system
support/development
and Chroma
management/provisio
ning tools

Summary
• Enabling ecosystem of partners to innovate and make
Personalized Medicine vision a reality
• Delivering hardware-enhanced capabilities and software to
accelerate science, translate results, deliver today.
• Looking for collaboration opportunities to take
Personalized Medicine mainstream by 2020
• Big Data/Analytics in Health & Life Sciences
• www.intel.com/healthcare
• hadoop.intel.com

49
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING
TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks
of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
49

Intel big data analytics in health and life sciences personalized medicine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Intel big data analytics in health and life sciences personalized medicine

Similaire à Intel big data analytics in health and life sciences personalized medicine (20)

Dernier

Dernier (20)

Intel big data analytics in health and life sciences personalized medicine