How open data contribute to improving the world. The life science use case. The technical, social, ethical issues.
This was a talk given within the iGEM 2020 programme by the London Imperial College students group (https://2020.igem.org/Team:Imperial_College), in a webinar organised by the SOAPLab group on the topic of Ethics of Automation. Excellent Dr Brandon Sepulvado was the other speaker of the day.
2. Hello!
• Geek since 1980s and C=64 times
• Started working with Life Science Data 2003
• at Univ. of Milano-Bicocca, EMBL-EBI
• and now Rothamsted Research
• Meanwhile, (h)activism in open source, open
data
3. A Long History
Mankind and Data
• Gather knowledge
• Know how things work, make predictions
• Improve our lives
• (in addition to being good on itself)
Egypt, 2500BC (https://brewminate.com/census-taking-in-the-ancient-world/)
4. In the past 20yrs or so
Economist, 2010
(https://www.economist.com/node/21521548)
8. The Cause for Open Data/Knowledge
• Data portals, policies, standards
• https://www.data.gov/, https://data.gov.uk/
• https://www.europeandataportal.eu/en
• https://ec.europa.eu/digital-single-market/en/european-legislation-reuse-public-sector-information
• https://joinup.ec.europa.eu/
• In science
• https://fairsharing.org/
• https://www.nature.com/sdata/
• Data and activism
• DBPedia, aka Wikipedia as data (https://wiki.dbpedia.org/about)
• Wikidata (https://www.wikidata.org/)
• Open Street Map (https://www.openstreetmap.org/about)
9. Open Data Cause: The Life Science Use Case
https://evaprofecmc.jimdofree.com/unit-4-the-genetic-revolution/2-2-chromosomes-and-genes/
10. So, sequencing was (is) pretty much important...
Source: https://boydfuturist.wordpress.com/tag/human-genome-project/
(also an interesting reading)
11. ...indeed
• The race to sequence the human genome
https://www.youtube.com/watch?v=AhsIF-cmoQQ
• The Human Genome Project Race
https://genomics-old.soe.ucsc.edu/research/hgp_race
• How to sequence human genome
https://www.youtube.com/watch?v=MvuYATh7Y74
Recommended:
15. The Cause for Open Data
• Allows for reuse
• no need to regenerate
• less expensive
• Allows for integration between heterogeneous data
• different entities (genes, proteins, chemistry,
species, literature...)
• different scales (cells, organs, individuals,
populations)
• New discoveries, novel uses
• Reproducible science
• and quality improvement
Practical Reasons
16. The Cause for Open Data
• Public-funded data are ours
• Savings opportunities add up
• (but giving them out for free has a cost)
• Data are ours anyway (eg, genetic data)
• Transparency (and again, reproducibility)
• Public benefits outweigh private interests
Ethical Reasons
17. But, how?
Based on publications, which genes are related to yellow
rust? In which biological processes are their encoded
proteins involved?
1 2
3 4
5
6
1
2
3
4
5
6
18. Good Data Principles:
Interoperability through Standards
https://tinyurl.com/y5e6kfa2
https://doi.org/10.1186/s41074-019-0055-1
https://tinyurl.com/y3h9c65k
https://tinyurl.com/y2wzlwbk
19. Data Standards: schema.org example
https://www.bbcgoodfood.com/recipes/classic-potato-salad
Source & recommended read: https://www.slideshare.net/NiallBeard/bioschemas-workshop
20. schema.org used for Knetminer and Agrifood Data
github.com/Rothamsted/agri-schemas https://tinyurl.com/y44a5lj9
21. References
• Brandizi et al, 2018, https://europepmc.org/
article/med/30085931
• IB2018 presentation https://tinyurl.com/
yaq8nt5e
• AgriSchemas and data standards, IB 2019
• Reusing Knetminer data with Python/Jupyter
• https://tinyurl.com/yyhnkuyk
• https://tinyurl.com/y446y979
22. Good Data Principles: FAIR
• Findable
• ex, Give your dataset a DOI, which resolves to schema.org
descriptor, register it on datasetsearch.research.google.com
• Accessible
• ex, resolvable DOI makes it accessible. Wrap with access
control as needed
• Interoperable
• Eg, data described with schema.org, GO and other OBO
ontologies
• Query protocols/standards (eg, SPARQL, GraphQL APIs,
JSON Schema APIs, JSON-LD APIs)
• Reusable
• Clear licence
• Ideally, machine-readable licence (eg, CCREL)
Source and recommended read: https://tinyurl.com/yxocd3b9
23. Issues: Easier to Say than to Do
https://tinyurl.com/yxsftwvy
https://xkcd.com/927/
24. Issues: Common Good vs Private Interests
• ...Parts of the standard that are not priorities for Google are not well documented
anywhere. If they are priorities for Google, however, Google itself provides excellent
documentation about how information should be specified in schema.org so that Google
can use it. Because schema.org’s documentation is poor, the focus of attention stays on
Google.
Time to end Google’s domination of schema.org,
https://tinyurl.com/y6j7ke8u
• Not everyone wants data published, eg, failed clinical trials
• Balance needed between research needs and private lives, eg,
• The Immortal Life of Henrietta, Rebecca Skloot
• k-anonymity, mediation approaches
(Brandizi et al, 2017, https://doi.org/10.1186/s12911-017-0424-6)
25. Issues: Data are Power
http://www.tylervigen.com/spurious-correlations
26. Issues: Data are Power
• My son was a typically developing toddler. ... He received his first MMR at 19 months of
age. The change in him was almost immediate. He did not regress in development, but
his social skills became extremely compromised. Noises became unbearable...
MMR vaccine caused my son's autism, https://tinyurl.com/y2udlfcb
It's sad, but it's a spurious correlation, vaccines do not cause autism
27. Issues: are We in Control?
https://www.nature.com/articles/d41586-020-01874-9
https://tinyurl.com/yxay8w2j
https://www.bbc.com/news/business-42959755
https://tinyurl.com/ydykjugt
https://tinyurl.com/hu3lh32
29. So...
• Future is even more digital
• And even more data-intensive
• Everyone should at least have an idea
• Especially if you want to become a scientist
• About producing data (eg, FAIR, formats,
standards)
• And consuming data (eg, data resources, Graph
DB query languages)
• And more (eg, Python, Pandas, Graph DBs,
APIs)https://tinyurl.com/y5rdq7qx
30. So...
• Probably we need better management and (a
bit of, international) regulation
• of technical aspects (eg, PA standards,
research data publishing)
• of ethical aspects (eg, open access,
algorithms, censorships)
• But also more grassroots participation
• we are all responsible, especially as scientists
• Data science is cool!
https://tinyurl.com/y5rdq7qx
31. Acknowledgements
Ajit Singh
Software Engineer
• Joseph Hearnshaw, software engineer
• Samiul Haque, Ed Eyles, IT admins
• Alice Minotto, Earlham Inst, hosting providers
• William Brown, Ricardo Gregorio, IT admins
• Monika Mistry, master Student, Data Curator
• Sandeep Amberkar, bioinformatician, data curator
• Madhu Donepudi, Richard Holland, ext contractors,
developers
Keywan Hassani-Pak
Knetminer Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
Jeremy Parsons
Bioinformatics Scientist
32. Acknowledgements
Ajit Singh
Software Engineer
• Joseph Hearnshaw, software engineer
• Samiul Haque, Ed Eyles, IT admins
• Alice Minotto, Earlham Inst, hosting providers
• William Brown, Ricardo Gregorio, IT admins
• Monika Mistry, master Student, Data Curator
• Sandeep Amberkar, bioinformatician, data curator
• Madhu Donepudi, Richard Holland, ext contractors,
developers
Keywan Hassani-Pak
KnetMiner Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
Jeremy Parsons
Bioinformatics Scientist
AndYou!
34. The Cause for Open Data/Knowledge
• Open data is the idea that some data should be freely available to everyone to use and
republish as they wish, without restrictions from copyright, patents or other mechanisms of
control (https://en.wikipedia.org/wiki/Open_data)
• Popularised by Obama in 2009 [1], Hans Rosling [3], Tim Berners Lee [2] (recommended
readings/watches)
• [1] https://www.govtech.com/data/What-Obama-Did-for-Tech-Transparency-and-Open-Data.html
• [2] https://www.ted.com/talks/tim_berners_lee_the_next_web?language=en
• [3] https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen
35. IBM Watson
• Not the first time that AI passed the
Turing test (eg, Deep Blue and Chess,
1996)
• But big milestone (in 2011) about
knowledge management
• Specialisations possible, e.g., IBM
Watson Health
Mini documentary at
https://www.youtube.com/watch?v=P18EdAKuC1U
36. Surprising Data Insights
• Couples who argue often are more likely to
last long (90% accuracy)
• If you want such a life...
• Many other examples of surprising data:
9 Bizarre and Surprising Insights from
Data Science (https://tinyurl.com/yywgr2rv)
https://www.businessinsider.com/mathematical-secret-to-lasting-relationships-2015-6
37. Issues: Data are Power
Source and recommended read:
https://theconversation.com/five-maps-that-will-change-how-you-see-the-world-74967