Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Big Data Standards - Workshop, ExpBio, Boston, 2015
1. !
!
Big Data Standards: how to set the bar?!
!
!
Susanna-Assunta Sansone, PhD!
!
@biosharing!
@isatools!
!
Experimental Biology, Big Data Workshop, 28 March, 2015
Data Consultant,
Honorary Academic Editor
Associate Director,
Principal Investigator
http://www.slideshare.net/SusannaSansone
4. Is open data understandable, reusable?
“Reproducing the method took several
months of effort, and required using new
versions and new software that posed
challenges to reconstructing and validating
the results”
5. Is open data understandable, reusable?
Not always…but why?
• Outputs are multi-dimensional, diverse, not always well cited / stored
• Software, codes, workflows etc.; hard(er) to get hold of
• Data often distributed and fragmented to fit (siloed) databases
o Not contain enough information for others to understand it
• Uneven level of details and annotation across different databases
o Specialized, generalist, public and institutional
• Data curation activities are perceived as time consuming
o Collection and harmonization of detailed methods and experimental
steps is done/rushed at publication stage
7. Responsibilities lie across several stakeholder groups
Understand the benefits of sharing
FAIR datasets and enact them
Engage and assist researchers to enable
them to share FAIR datasets
Release or endorse practices
and polices, but also incentive
and credit mechanisms for
researchers, curators and
developers
10. • We need to report sufficient
information to reuse the dataset
• We must strike a balance between
depth and breadth of information
Without context data is meaningless
13. …how not to report the experimental information!
• L!S1 ! !liver sample 1!
• C2 ! !compound 2!
• LD ! !low dose!
• TP2 ! !time point 2!
• P1 ! !protocol 1!
• file1.gz! !compressed data file with !
! ! !phenotypic and other information
! ! !on this sample!
Sample name (?!)! Data file!
LS1_C2_LD_TP2_P1! file1.gz!
14. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
1
4
• make annotation explicit
and discoverable
• structure the descriptions for
consistency
• ensure/regulate access
• deposit and publish
• etc….
• To make any dataset ‘FAIR’, one
must have standards, tools and
best practices to:
§ report sufficient details
§ capture all salient features of
the experimental workflow
15. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
1
5
…breadth and depth !
of the experimental context!
…is pivotal
!
…and has to be both
human and machine
readable!
16. nature.com/scientificdata
A new category of publication that provides detailed descriptors of scientifically valuable
datasets. They are a highly effective link between traditional research articles and data repositories
Introducing the
Data Descriptor
18. !
!
!
Experimental metadata or!
structured component!
(in-house curated, machine-
readable format)!
Article or !
narrative component!
(PDF and HTML)!
Data Description narrative and structured components
19. A curated, structured component - why?
• Supplements the scientific discourse!
o natural language has a degree of ambiguity!
• Brings clarity in reporting research methods and procedures!
o no trimming, no cooking!
o clear samples to data files links and relation to methods!
• Provides the basis for search and discovery features!
SciData DD
Structured
content SciData DD
Structured
content
SciData DD
Structured
content
SciData DD
Structured
content
SciData DD
Structured
content
SciData DD
Structured
content
SciData DD
Structured
content
SciData DD
Structured
content
SciData DD
Structured
content
SciData DD
Structured
content
Same tissue
Same organism
Same assay
Community
Data
Repositories
20. Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared…
From natural language to ‘computable’ concepts
Data Curation Editor
Responsible for creating the structured
component, ensuring that the most
appropriate metadata is being captured.
21. Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
22. Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Type of protocol – cell preparation
Type of protocol - sample treatment
Type of protocol – liver preparation
23. Including minimum
information reporting
requirements, or
checklists to report the
same core, essential
information
Including controlled
vocabularies, taxonomies,
thesauri, ontologies etc. to
use the same word and
refer to the same ‘thing’
Including conceptual
model, conceptual
schema from which an
exchange format is derived
to allow data to flow from
one system to another
Community-developed content standards
To structure and enrich the description of datasets, facilitating
understanding, sharing and reuse!
24. de jure de facto
grass-roots
groups
standard
organizations
Community mobilization, some examples
• Structural and operational differences
§ organization types (open, close to members, society, WG etc.)
§ standards development (how to formulate, conduct and maintain)
§ adoption, uptake, outreach (link to journals, funders and commercial sector)
§ funds (sponsors, memberships, grants, volunteering)
26. A web-based, curated and searchable registry ensuring that
standards are registered, informative and discoverable; monitoring their
development and evolution and their use in databases,
and the adoption of both in data policies.
Launched Jan 2011
27. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Core functionalities:
• search and filtering, e.g. by
funder
• submissions forms to add
new records
• “claim” functionality of
existing records
• person’s profile (as
maintainer of records)
associated to the ORCID
profile (for credit, as
incentive)
• visualization and views of
content
Search, filter, claim, view and more
29. Advisory Board and Working Group - core members and adopters
Operational Team
30. The relationship among
popular standard formats
for pathway information. !
Demir, et al., The BioPAX
community standard for
pathway data sharing,
Nat Biotech. 2010.
Standards as an area of research - still a lot to do! E.g.:
1. Create relation or “usage maps and guides”, e.g.:
2. Metrics of maturity, usability and popularity
3. Embed in the ecosystem of complementary registries
31. 31
Technologically-delineated
views of the world
!
Biologically-delineated
views of the world!
Generic features ( common core )!
- description of source biomaterial!
- experimental design components!
Arrays!
Scanning! Arrays &
Scanning!
Columns!
Gels!
MS! MS!
FTIR!
NMR!
Columns!
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
To compare and integrate data we need interoperable standards
How do we address fragmentation, duplications gaps?
35. • Most researchers
understand the value of
standardized descriptions,
when using third-party
datasets!
!
• But when asked to structure
their datasets, they view
requests for even “minimal”
information as burdensome!
re is an urgent need to lower
the bar for authoring good
metadata!
Researchers hate standards!
36. • Most researchers
understand the value of
standardized descriptions,
when using third-party
datasets!
!
• But when asked to structure
their datasets, they view
requests for even “minimal”
information as burdensome!
!
Ø There is an urgent need to
lower the bar for authoring
good metadata!
Researchers hate standards!