Planning for big data (lessons from cultural heritage)

Planning for big data
Dr. Mia Ridge, @mia_out
Digital Curator, British Library
digitalresearch@bl.uk @BL_DigiSchol

Outline
• What is big data?
• How is it used?
• How do you prepare for working with it?

Defining 'big data'
Data that is too large or too complex to process
manually / with a desktop computer
– Number of records
– Size of files
– Mixed formats
– Unstructured data
– Relationships between datasets

Defining 'big data' - Gartner
'Volume. Data that have grown to an immense size,
prohibiting analysis with traditional tools
Variety. Multiple formats of structured and
unstructured data—such as social-media posts,
location data from mobile devices, call center
recordings, and sensor updates—that require fresh
approaches to collection, storage, and management
Velocity. Data that need to be processed in real or near-
real time in order to be of greatest value, such as
instantly providing a coupon to customers standing in
the cereal aisle based on their past cereal purchases'
https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/

'Big data' in cultural heritage

The challenges of scale
• The BL holds 180-200 million items, including:
• 8 million stamps
• 310,000 manuscript volumes
• Over 4 million maps
• Legal deposit material including pamphlets,
magazines, newspapers, sheet music and maps
• Television and radio recordings
• Websites, e-books, e-journals
• Over 3 million new items are added every year
• Only 1-2% of collections digitised

The impact of scale
My experience at Cooper Hewitt: 20% of my
residency 'dealing with the sheer size of the
dataset: it's tricky to load 60mb worth of 270,000
rows into tools that are limited by:
• the number of rows (Excel),
• rows/columns (Google Docs) or
• size of file (Google Refine, ManyEyes)
'search-and-replace cleaning takes a long time'
https://labs.cooperhewitt.org/2012/exploring-shape-collections-draft/

A splendid assortment of Gceloag
and West of England. Tweed ; also
Black Doeakin Woollen Cloths
alwaya on hand. Snit made to
order in six hoars' notice, on most
reaainable terms. Mr. M'Mohon,
Cutter.
Mysteries of Melbourne life
by Cameron, Donald, 1848?-1888.
Published 1873
Usage Public Domain Mark 1.0
Topics Australia -- Fiction

Different data, different uses
Datasets about our collections
Bibliographic datasets relating to our
published and archival holdings
Datasets for content mining Content
suitable for use in text and data
mining research
Datasets for image analysis Image
collections suitable for large-scale
image-analysis-based research
Datasets from UK Web Archive Data
and API services available for
accessing UK Web Archive collections
Digital mapping Geospatial data,
cartographic applications, digital
aerial photography and scanned-in
historic map materials http://bl.uk/digital

#messy data
http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs

Question: what kinds of big data are
you interested in working with?
What makes it 'big'?

Machine learning, artificial intelligence
and big data
Computational techniques that learn from
examples and/or data without being
programmed in advance
e.g.
• Recruitment - shortlisting CVs to job ads
• Ecommerce - Netflix, Amazon, Spotify
recommendations

Legal
https://www.veritas.com/content/dam/Veritas/docs/white-papers/21198622_GA_ENT_WP-Early-Case-Assessment-in-Electronic-Discovery_EN.pdf
Veritas case study, 'Early Case Assessment in Electronic Discovery'

Medical
Personalised treatment plans for cancer patients
• IBM Watson's used by oncologists at Memorial
Sloan-Kettering Cancer Center, suggestions
'informed by data from 600,000 medical evidence
reports, 1.5 million patient records and clinical
trials, and two million pages of text from medical
journals'
• Microsoft similarly use machine learning and
natural language processing to sort through
research data
http://news.microsoft.com/stories/computingcancer/
https://www.mskcc.org/blog/msk-trains-ibm-watson-help-doctors-make-better-treatment-choices
http://www.oxfordmartin.ox.ac.uk/publications/view/1883

Politics, finance
http://www.opensecrets.org/resources/learn/anomalies.php

Translation
• New version of Google Translate uses
'recurrent neural networks' to translate
sentences as a whole
https://research.googleblog.com/2016/09/a-neural-network-for-machine.html

Enhancing records: SherlockNet
http://bit.ly/sherlocknet

Question: what kinds of decisions could
you support by analysing big data?
What value would that add?

Planning for big data: stages
• Identify potential sources
• Digitising (unless everything is already
available as digital text/images)
• Collecting (unless everything is already
centralised)
• Reformatting (unless everything is ready to be
loaded into software)
• Storage, backup, software licences

Stages: reviewing permissions
Possible issues include:
• terms of use when data collected,
• data protection,
• copyright,
• commercial in confidence,
• proprietary systems,
• other licences

Stages: what skills do you need?
• Domain knowledge
• Analytical skills
• Technical skills

Stages: cleaning
(unless your data is already consistent)
• These are not the same place (if you're a
computer):
– U.S.
– U.S.A
– U.S.A.
– USA
– United States of America
– United States (case)

Stages: cleaning
http://openrefine.org/

Stages: cleaning
Challenge: time-consuming
Opportunity: time to get to know the data
e.g. Google Maps only understood museum
records that used 'United Kingdom'; tens of
thousands of records that used Great Britain,
England, Scotland, Wales, Northern Ireland etc
weren't mapped

Stages: cleaning
Some 'fuzziness' is unavoidable.
• Unexpectedly complex objects e.g. 'Begun in
Kiryu, Japan, finished in France'
• Permanent uncertainty e.g. 'Bali? Java?
Mexico?'

Cleaning: don't forget!
• Versioning
• Documentation

Stages: enhancing
http://nlp.stanford.edu:8080/ner/

Stages: verifying
Reality check results
• Are they accurate?
• Could they do anyone any harm?
• Do they under- or over-report any factors?

Stages: dissemination
• How can you contextualise, explain any
limitations of your analysis? e.g.
– provenance and qualities of original dataset(s);
– how it was transformed, cleaned to fit into
software;
– how confident you are in matches, results;
– what's left out of the analysis, and why?

Ico: Big data and data protection
https://ico.org.uk/media/for-organisations/documents/1541/big-data-and-data-protection.pdf

Ico: Big data and data protection

The ethics of convenience?
• More data is digital
• More data is retained
• More data contains identifiers
It's easier than ever before to make creepy
decisions

Question: what ethical issues might
arise with big data in your field? How
can you resolve them?

Thank you!
Questions?
Dr. Mia Ridge, @mia_out
Digital Curator, British Library
digitalresearch@bl.uk @BL_DigiSchol

Planning for big data (lessons from cultural heritage)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Planning for big data (lessons from cultural heritage)

Similar to Planning for big data (lessons from cultural heritage) (20)

More from Mia

More from Mia (20)

Recently uploaded

Recently uploaded (20)

Planning for big data (lessons from cultural heritage)

Editor's Notes