Trends from the Trenches: 2019

Trends from the Trenches
2019 Bio-IT World Conference
Chris Dagdigian
https://bioteam.net

Want these slides?
slideshare.net/chrisdag or
https://bioteam.net

Image by Deanna & Amy; used with permission
https://metoostem.com/
● Seems appropriate to include this
● Recurring 2019 theme for me has
been listening to the stories of
women forced early career
academic paths or jobs because of
systemic harassment & bias in
STEM fields

@chris_dag - https://bioteam.net
I’m Chris. I work for
BioTeam
● Failed scientist turned infrastructure nerd
● 20 years working on infrastructure for life
science research; Now I’m old & lame
● As a consultant I get to see how many
different groups of smart people tackle similar
challenges
● Often I’m allowed to talk about what I see so I
collect trends, observations and common pain
points
● Started talking at BioIT in 2010 and they
won’t let me gracefully retire

Thought Excretor
Magic Quadrant
Competence / Domain Insight
Can talk bluntly
in public
@hpc_guru
@fdmnts@glennklockwood
… you get
the idea
@{ many smart
people }
@{ vendor shills }
@mndoci

Tune Me Out or Filter My Words Accordingly
● Not a pundit
● Not a “thought leader”
● Not pretending to speak on behalf of our
huge and diverse industry
● This is a personal talk delivered through the
prism of prior work, clients, projects and
conversations
● Lots of industry/government work recently
● My observations have the same
diversity/inclusion problems as
science/workplaces in general
● Heavily influenced by past and current
projects and the interesting people I’ve
spoken or interacted with

2019 Catch-all: Observations, Anecdotes & Emergent Stuff

01: We’ve done OK entering “data intensive science” era
Turbulent for sure but we’ve managed
...
● Compute
○ Physical, virtual and cloud based computing
is a tractable problem at most scales
● Networking
○ > 10-Gbps still painful and expensive
○ Science DMZ design patterns are working
● Storage
○ Large capacity is a solved problem
○ Consumption rate still scary
One of the biggest unsolved problems
● Data Management, Discovery, Cataloging
and Classification
● It’s easy to store vast piles of data; we are still
terrible at understanding what we have
Question: How many vendors and
products did you see this week at BioIT’19
explicitly focusing on data management,
curation, metadata or discovery?

02: Scientific Computing: Still Undervalued By Leadership
Prior Talks / Younger Me
● “Computers are digital benchtops, not the
simple business process endpoints that
Enterprise IT treats them as”
● “HPC capability is essential for R&D; we
need leadership and investment parity with
the wetlab folk”
Today’s Talk / Older, Heftier & Wiser
● Still viewed as cost center to be minimized,
optimized and “value engineered”
● Only a few treat “extract insight & value
from data” as core competitive differentiator
beyond vapid words in mission statements
● HR not touting as major recruitment and
retention asset
Incompetence in this space is an existential survival threat to your company or organization.

02: Scientific Computing: Still Undervalued By Leadership
Incompetence in this space is an existential survival threat to your company or organization.

03: Scientific Computing: User Trends
User Base Climbing Rapidly
● HPC and analytic capabilities are
extending from discovery and spreading
across the enterprise
● Pervasive need for HPC and analytic
competence across the entire
organization
○ % of staff outgrowing “laptop scale”
analysis is climbing fast
○ Competitive differentiator
○ Recruitment/Retention resource
○ Survival requirement in 2019
Users getting MORE and LESS
Sophisticated
● Users forced away from laptop-scale methods
require significant training and onboarding
● Yet new hires and early-career recruits often
show up with prior HPC and cloud expertise
● In general we are still pretty bad at training,
best practice propagation and knowledge
transfer
○ Especially in helping “intermediate” level users
become experts
Past talks ...
Today ...

04: Definition of HPC Being Stretched In Extreme Ways
● “Beyond laptop scale” computing requirement is
becoming pervasive across organizations
● We are often the only multi-user / shared-service
entities with large scale compute, storage, memory,
GPU and visualization capabilities
● HPC in danger of becoming the dumping ground for
any problem that does not fit on a laptop
● Parasitic usage causes:
○ Infrastructure tuned & biased for generic workflows
○ Support org becomes even more overwhelmed
○ Angry users demanding high-touch support and
special accommodations for niche stuff
“Any analysis that can’t
run on a cheap leased
laptop must require
HPC”
-- league of bad mgmt

05: Compilers & Toolchains: Mini Trend?
Coming out of a relatively stable era
● Intel dominated compute
● Genomics/informatics dominated workload
● Hardware & software well characterized
● 10+ years since I had to mess with
commercial compilers
This may be changing …
● Ludicrous rate of innovation seen in the
instrument space is starting to appear in our
tooling & applications
● Now?
○ Software in rapid improve/innovate cycles
○ Kernels and kernel modules matter
○ Compiler & glibc versions matter
○ Conservative RHEL/CentOS Linux distributions
may be moving too slowly for some scientific
domains
● May be time to re-evaluate some of our
foundational environments & toolchains

05: Compilers & Toolchains: Mini Trend?
The latest/fastest CPUs are expensive.
GPUs are expensive.
NVLINK is expensive.
DGX-2 list price is $400,000/ea
A reasonable investment in compiler and toolchain optimizations could pay
significant dividends

06: Compilers & Toolchains : Relion CryoEM Homework
Try this at home, kids! (if you have CPUs with AVX-512 support)
● Download latest Relion codebase from https://github.com/3dem/relion
● Test #1
○ Build using stock compiler and developer tools with CPU acceleration enabled
○ Run & time “relion_refine” using the common benchmark data set and commands
● Test #2
○ Repeat build with upgraded compiler and developer tools (ie GCC-7 on CentOS/RHEL 7)
○ Time how long the run takes
● Test #3 (if possible)
○ Repeat work using Intel ICC compiler (Intel Parallel Studio)
○ Time how long the run takes

07: Call To Action - Bigger Relion CryoEM Benchmark Sets
● The most prevalent/popular Relion benchmark uses a ~50GB input data set
● Everyone appears to be using it; especially vendors trying to sell you stuff
○ Small enough to fit in RAM and hit the caching effect of almost all storage systems
○ We (BioTeam) do not believe this is a realistic test in 2019
■ … for anything other than getting compiler and CPU/GPU optimizations correct
● Seeking multi-terabyte CryoEM data organized for Relion 2D or 3D classification
○ We think the community needs MUCH larger benchmarking resources and data sets
○ We will happily host, share & re-distribute
○ We will publish our own results testing against this data
○ Contact chris@bioteam.net
How many scientists do you know with CryoEM experimental data sets that are less than 60GB in size?

08: Machine Learning & Training Data - Awsome and Ugly
● Proper ML/AI requires lots of training data
● Need “training” & “validation” sets
● The data engineering work is non-trivial
● Metadata is essential; bad data will sink you
● Competitors with “better” data will beat you
The race to acquire, generate or license
has the potential to be both awesome
and ugly
● Significant opportunities for both
innovation and abuse
Innovation Example
● We are starting to see organizations doing
really interesting things to acquire the
training data they need
● An Example:
Publish/Host useful tools on the cloud
○ Users get access to sophisticated analysis
resources they do not have locally
○ Opt-in data sharing process generates ...
○ 30,000 de-identified MRI scans per week

Topic: Lean Times & Resource Scarcity

"Unit cost of storage is decreasing but not as fast as data production
is increasing. Our computing costs grow ~10%/year while budget
grows at ~3% so we've had to cut [research] mission to preserve
essential capability "
-- Scientific leader @ nationally recognized institution

Lean Times: Prior Talks
● Cheaper to repeat experiment than store the data over full lifecycle
● Unit cost of storage out of sync with ease of data generation
● Petabytes of open access data easily available; & valid reasons to use it
● IT knows you haven’t touched that data in years
Also:
● Deleting raw and derived scientific data is OK
● Performing data triage is OK
● … as long as data deletion decisions are made by scientists, not IT

Lean Times: Today
● Data management still a source of existential dread for Bio-IT
● Core problem has seeped beyond “it is easier to acquire vast piles of
scientific data than it is to sensibly and safely store it over time”
● Today we see single scientists asking research questions that can totally
consume a leadership-class supercomputer or system like ANTON-2
For biotech/pharma this means our researchers can easily swamp any
system of any size or capability we can reasonably deliver. That is … not
sustainable.

Lean Times: What this could mean in coming years
● We stop half-assing governance in discovery-oriented Bio-IT?
● Scientific computing orgs tighten scope, scale & supported services
● HPC resource allocation explicitly under control of scientific leadership
○ Remember it is *never* appropriate for IT to make these types of decisions
● What about moonshots and open-ended research?
○ Maybe we adopt DOE/NSF national lab model and hand out
internal credits, grants or allocations for researchers to “spend”
however they see fit ...

Lean Times: Effective Operation Principals
Required:
● Governance driven by Science (not IT groups) becomes essential
● Honest & transparent operational cost data spanning cloud/on-prem
● Full transparency of usage and resource allocation metrics
● Good logging of scientific tools and codes being invoked

Lean Times: It’s not all doom and gloom
Talking to people who have lived this before:
● Forced hard examination of bespoke/custom/standalone systems (silos)
● Helped push for internal agreement and alignment re: adopting
common platforms, APIs and shared services/sysadmin operations
● “Made us think hard about how to run technological operations in a
different way”

Topic: Silicon Matters Again

Silicon Matters Again: Then & Now
Prior Talks / Younger Me
● “Compute is commodity”
● “Intel x86 rules the world”
● “GPU usage starting to differentiate between
visualization and MD/Chemistry”
Today’s Talk / Older, Larger & Wiser
● Ahh crap …
● CPUs, GPUs, FPGA’s and custom silicone
are back on the table again and it’s getting
messy
Bottom Line:
● Significantly more benchmark & eval work
● Developer preference vs. Cost/ROI analysis
● GIANT EXCEPTION
○ Serverless folk don’t care

Silicon Matters Again:
CPU
● AMD is back with EPYC
● … it’s benchmarking time again
GPUs
● Increasingly complicated landscape
● Needed for: VDI, Viz, MD/Chem/Structure,
ML/AI and CryoEM
● Pain points
○ Need different products (VDI vs. Science)
○ Need various GPU memory configs
○ Need various #s of GPUs per chassis
○ NVLINK - when, where & how much?
○ Will cloud have them when you need them?
TPUs, FPGAs & Custom Silicone
● Many trying to differentiate in ML/AI space
via custom devices
● Clouds now have proprietary accelerators
● More benchmarking required
● SDK/Framework decisions required
● Deeper engagement with IT required

Topic: Facility & Infrastructure

Facility & Infrastructure: General Observations
● Yes, you still have to do hybrid vs. cloud vs. on-prem analysis & math
● Economics still favor on-prem or colo for 24x7 scientific workloads
○ When other capability or business requirements don’t superseed cost concern
● Why?
○ Cloud-based on-demand elastic computing is easy and well understood
○ Serverless is effing transformative; both for capability and cost, but ...
○ … Persistent, accessible petascale cloud storage is still expensive month over month
○ At petascale egress fees start to matter

Definite Trend: Colocation Suites & Cabinets
● Seeing this actively in 2019
○ I’ve got an active on-prem to colo (Markley Group) project right now
● Sign of the times:
○ Steve Lister from Novartis is now CTO for HPC & Data Analytics @ Markley Group
● Drivers
○ Cost of new-build or upgrades to on-premise facilities
○ Poor cloud economics for certain 24x7 workloads and use cases
○ Growth, merger & consolidation activities
○ Colocation is the new “Network Hub” for
■ Offsite backup and data continuity efforts
■ Flexible aggregation of cloud connectivity (single cloud or multi-cloud)
■ Speciality links to partners, collaborators and high speed research networks (Internet2)
■ Bespoke IaaS, PaaS, SaaS offerings from colo operators

Story: “Innovation Space” Horror Show

Dumbest Thing I Saw in 2018
● Where: Boston, Massachusetts
● What: Shiny new top tier incubator/innovation space for life science startups
● Wow: Office, lab space, managed stockroom/chem service, etc.
● WTF:
○ No IT/infrastructure space for tenants. At all.
■ Big Data? Exotic instruments? Data intensive science? Eff you.
■ Don’t want to place a tower server deskside or in the wet lab? Eff you.
■ Shared internet circuit & firewall (logical tennant isolation & traffic shaping though)
○ Any changes require 3-party negotiations (Space Operator, Floor leaseholder, Building owners)

I’m not kidding - Dumbest thing I saw in 2018
● Brand new incubator space targeting life science startups in the middle of Boston
● … did a new facility build with the assumption that biotech/pharma startups need
nothing but laptops, 1Gig network drops and a bit of cloud + managed IT services
● Any physical IT infrastructure not owned by the space operator or managed IT
service provider has to live under-desk or inside the wet lab space
Yes. A subset of agile, fast-moving startups need nothing more than internet and a set of cloud-based wifi &
domain controllers. Building a new facility that caters only to these shops means you’ve guzzled a bit too
much telecom vendor and cloud marketing (or you fell asleep on a pile of “CIO Magazine”)

Topic: Org Charts & Scientific Support

Support: Data Intensive Life Science Is A Different Beast
Other HPC / Supercomputing :
● Modest set of dominant, well profiled & well
optimized domain-specific codes
● May have large user base or every extreme
HPC needs but the domain and application
landscape is approachable
In life science HPC …
● 5 - 5000 users
● 600+ applications spanning 10+ domains
○ Molecular Dynamics, Fluid Dynamics,
Structural Biology, Chemistry, Genomics,
Bioinformatics, Medical/Clinical, Optical
Imaging, EM Imaging, Sensor/IoT, etc. etc.
○ Each of these breaks down into specialities
typed by species, disease, organ, pathway etc.
etc.

I hate to bust out the “... but life science is SPECIAL and UNIQUE …” take
But … If you survey commodity supercomputing and capability
supercomputing environments at both small and “national lab” scale you will
see stark differences
Domain and workload diversity (and crap code) are our distinguishing
characteristics

Not a trend because org charts vary
wildly by mission & org but ...
● Domain expertise needs to spread to the
edge of the org while Scientific Computing
groups retain and grow the expertise that
spans groups/projects/orgs and domains
Domain-Specific Expertise:
● Embedded within the group, lab, program or
R&D organization
Cross-Domain/Cross-Org Expertise:
● Science Gateways, Portals, Middleware & APIs
● User & Workflow Optimization
● High Value Application Optimization
● Data Engineering
● Data Science & Analytics
● Data Visualization
● CUDA / ML / AI Expertise
● Training
Broadly useful cross-org capabilities get
consolidated within Scientific Computing; Exotic
domain expertise moves to stakeholder teams.

Model We Like: Service Oriented Delivery
● Large scientific computing shops
reorganize around scientific use
cases and end-user requirements
● … not on technological expertise
● Great way to blow away traditional IT
silos
● “Team of teams” approach to service
delivery

Topic: Storage

Storage Landscape: Prior Talks
● Everyone needs peta-capable storage
● { insert scary growth of storage graph } OMG OMG OMG
● In tough times it is OK to favor storage capacity over performance
● Scale-out NAS is best storage platform for most
● Parallel File Systems when workload requires due to higher ops
burden
● Object Storage is the future of scientific data at rest
● It is easier to generate/acquire vast piles of data than it is to safely and
sensibly store and manage it over a full lifecycle - this is a big problem
Legacy Dag

Storage Landscape: Today [1]
Major fundamental changes
● The capacity|performance calculus has swung the other way
● We now need very fast storage to handle machine learning, AI and
image-based workflow requirements
● ML training & validation requires persistent access to the “old” data so
we still need massive storage capacity
● Dominant file type at the moment is image-based, no longer genomes
Current
Dag

Storage Landscape: Today [2]
Contributing Factors
● Lots of deployed storage is nearing EOL or end of support contract
● Some really interesting next-gen storage companies have launched
● Parallel storage is a lot more attractive w/ performance as key driver
● Operational benefits of scale-out NAS slightly less valuable in context
Current
Dag

Storage Landscape: Data As Currency
● Your organization has a big problem when the default stance among
leadership or users is “all our data is important” - way worse than “we
don’t know how to figure out what is important …”
● Not understanding the true value of data leads to hoarding, massive
inefficiencies and inability to properly leverage the data at hand
● Data management, scientifically-relevant metadata, tracking the use,
derivative uses, and amount of repeated uses of data could totally
change how we approach scientific data storage
● It's about the data, not the storage platform
Current
Dag

Storage Tiering & Namespaces: Prior Talks
● Single namespace storage is really important
● If we don’t give users a single view of storage we end up with:
○ Multiple islands of data
○ Scientists store the same data in N different locations
○ Nasty data location and data provenance issues
● Seamless tiering within the namespace is desirable if possible
Legacy Dag

Storage Tiering & Namespaces: Today [1]
● We've done a bad job at encouraging active data handling
● Data is currency; IT training focuses on "spend wisely" not "manage
effectively"
● This is a multi-partner, multi-platform, multi-cloud world
● Global data protection methods hedge against disaster but not
personal/group/lab/publication needs
● Still inappropriate for IT to make data classification decisions
Current
Dag

Current
Dag
My biggest attitude change:
IT attempts to make seamless namespaces and automatic tiering
generally have failed to meet expectations; also hard to do
efficiently or without researcher input anyway
We need to place data management responsibilities back onto
the end-users*
Users whining about having to move/manage data when their
career and publications are based on “data intensive science” will
no longer be coddled. SCIENTIFIC DATA IS YOUR JOB.

* Some Exceptions:
● There ARE data management tasks that are a waste of time and skill
for highly trained scientists
● Biggest example: large-scale physical data ingest and export - scientists
should not be dealing with portable hard drives beyond a certain scale
● Large-scale physical data movements needs written SOPs and a
process owned by IT
Current
Dag

New social contract between IT and Users
IT Provides:
● Storage that meets business and scientific requirements
○ Including scratch, active, nearline, archive and object
○ Durable, available and reliable
● Metrics, monitoring, reporting and tools that enable user self-service
End User Responsibilities:
● Users responsible for scientific data management through full lifecycle
○ Including classifying, curating, organizing and moving it
Current
Dag

Storage: What this all means
● The new requirements for speed + capacity is deeply scary
● Image workloads and ML/AI mean we can’t trade away performance in
exchange for larger capacity any more
● Enterprise IT has more justification to transition platforms:
○ Conservative shops can buy the faster flash-powered levels of Scale-out NAS
○ Conservative shops can go IBM Spectrum Scale (managed GPFs)
○ Forward-looking shops will bring in new platforms and vendors
○ BeeGFS, Ceph & Lustre will find new audiences
● I’m cool with tiers, namespaces and making end-users more responsible
Current
Dag

Storage: Interesting Players
Metadata, Discovery, Data Protection
● Starfish Storage, https://starfishstorage.com/
● Atavium, https://www.atavium.com/
● Arcitecta, https://www.arcitecta.com/
● Igneous, https://www.igneous.io/
Next-Gen / Flash Storage Architectures
● VAST Data, https://www.vastdata.com/
● WekaIO, https://www.weka.io/
● Pure Storage, https://www.purestorage.com/
Current
Dag

Storage: Interesting Players, Continued
Data Movement
● Globus, https://www.globus.org/
● DataDobi, https://datadobi.com/
● Zettar, https://www.zettar.com/
Current
Dag

Topic: Networking

Networking: Still the #1 hassle but little change since 2018
Still the #1 IT infrastructure problem in data intensive life science
● Still have trouble moving scientific data at scale across networks
● We still lag in deploying 40-gig and 100-gig networking
● Enterprise IT still focusing on datacenter rather than edge & lab
● We still need to separate business network traffic from science data
traffic using Science DMZ design patterns
● Our connections to the Internet and Cloud are still too small
● Our firewalls and security controls are still designed for business
traffic and not monster “elephant” flows
● Biggest new thing was Nvidia purchasing Mellanox !
Past &
current!

Topic: Cloud

Cloud: Meta issues still the same but some changes ...
Past &
current!
Consistent message for 10 years now
● Cloud is a capability play for
life science research organizations
● Saving money is not the primary driver*

* About that “not a cost saving thing” message …
● Serverless Computing is transformational for capability
● Serverless Computing is transformational for cost
Read this:
https://rise.cs.berkeley.edu/blog/a-berkeley-view-on-serverless-computing/
Search engine shortcut: “berkeley view on serverless 2019”
● Primary caveat is that discovery oriented science still relies heavily on
interactive human efforts with bespoke tooling. A large chunk of our Bio-IT
landscape cannot be codified into APIs & service mesh architectures

● Microsoft acquisition of Cycle Computing is really starting to become
apparent on Azure Cloud - lots of interesting HPC and storage
offerings
● Cloud efforts to build bespoke accelerated hardware for AI/ML and
inference is of some concern. What used to be a simple cost or
capability eval now will require deep IT interaction with end-users to
learn their preferences and needs for SDKs, frameworks and tooling
● Scarcity of GPU resources on AWS has been a consistent trend across
multiple projects. We can’t get them at all, let alone within a placement
group!

Wrapping Up

Recap - Bottom Line 2019 Summary
1. Unit cost of storage vs. consumption rate
will force hard choices and new governance
2. Data discovery, management, curation and
movement are still major concerns
3. Storage selection pendulum has moved in a
big way. We now have to be BIG and FAST.
This will have a major impact
4. Responsibility for scientific data
management must rely with end-user and
not IT
5. Compilers, toolchains and silicon matter
again; it’s time to resurrect the benchmark
and eval crew
6. Science users can now swamp systems of any
scale with valid research questions; expect
governance and service scope constraints to
become more prevalent
7. Colo Facilities are being used more often
8. Life Science stands apart in the HPC and
supercomputing worlds for the sheer size
and diversity of our domains and workloads

Crowdsourcing thanks!
Sincere thanks to the folk who responded online
with comments and suggestions.
Including:
● Philippe Neron
● Matthew Trunnell
● Glenn Lockwood
● Tim Cutts
● Nick Weber
● Tom Bolton
● Dirk Petersen
● Gregg TeHennepe
● Eduardo Zaborowski
● Remy Evard
● Tom Plasterer
● Joe Stanganelli
● Jason Tetrault
2020 is the 10-year BioIT World
anniversary! The conference organizers are
very interested in what you’d like to see
and hear to make next year very special.

End; Thanks!;
Want these slides?
slideshare.net/chrisdag or
https://bioteam.net

Portrait commissioned from the artist who did the
illustrations for the “Heroines of JavaScript Trading
Cards”.
Want your own?
https://twitter.com/mirlu_exe

Trends from the Trenches: 2019

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Trends from the Trenches: 2019

Similaire à Trends from the Trenches: 2019 (20)

Plus de Chris Dagdigian

Plus de Chris Dagdigian (6)

Dernier

Dernier (20)

Trends from the Trenches: 2019