SlideShare a Scribd company logo
1 of 40
Planning for big data
Dr. Mia Ridge, @mia_out
Digital Curator, British Library
digitalresearch@bl.uk @BL_DigiSchol
Outline
• What is big data?
• How is it used?
• How do you prepare for working with it?
What is big data?
Defining 'big data'
Data that is too large or too complex to process
manually / with a desktop computer
– Number of records
– Size of files
– Mixed formats
– Unstructured data
– Relationships between datasets
Defining 'big data' - Gartner
'Volume. Data that have grown to an immense size,
prohibiting analysis with traditional tools
Variety. Multiple formats of structured and
unstructured data—such as social-media posts,
location data from mobile devices, call center
recordings, and sensor updates—that require fresh
approaches to collection, storage, and management
Velocity. Data that need to be processed in real or near-
real time in order to be of greatest value, such as
instantly providing a coupon to customers standing in
the cereal aisle based on their past cereal purchases'
https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/
'Big data' in cultural heritage
The challenges of scale
• The BL holds 180-200 million items, including:
• 8 million stamps
• 310,000 manuscript volumes
• Over 4 million maps
• Legal deposit material including pamphlets,
magazines, newspapers, sheet music and maps
• Television and radio recordings
• Websites, e-books, e-journals
• Over 3 million new items are added every year
• Only 1-2% of collections digitised
The impact of scale
My experience at Cooper Hewitt: 20% of my
residency 'dealing with the sheer size of the
dataset: it's tricky to load 60mb worth of 270,000
rows into tools that are limited by:
• the number of rows (Excel),
• rows/columns (Google Docs) or
• size of file (Google Refine, ManyEyes)
'search-and-replace cleaning takes a long time'
https://labs.cooperhewitt.org/2012/exploring-shape-collections-draft/
A splendid assortment of Gceloag
and West of England. Tweed ; also
Black Doeakin Woollen Cloths
alwaya on hand. Snit made to
order in six hoars' notice, on most
reaainable terms. Mr. M'Mohon,
Cutter.
Mysteries of Melbourne life
by Cameron, Donald, 1848?-1888.
Published 1873
Usage Public Domain Mark 1.0
Topics Australia -- Fiction
Different data, different uses
Datasets about our collections
Bibliographic datasets relating to our
published and archival holdings
Datasets for content mining Content
suitable for use in text and data
mining research
Datasets for image analysis Image
collections suitable for large-scale
image-analysis-based research
Datasets from UK Web Archive Data
and API services available for
accessing UK Web Archive collections
Digital mapping Geospatial data,
cartographic applications, digital
aerial photography and scanned-in
historic map materials http://bl.uk/digital
#messy data
http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs
Question: what kinds of big data are
you interested in working with?
What makes it 'big'?
How is big data used?
Machine learning, artificial intelligence
and big data
Computational techniques that learn from
examples and/or data without being
programmed in advance
e.g.
• Recruitment - shortlisting CVs to job ads
• Ecommerce - Netflix, Amazon, Spotify
recommendations
Legal
https://www.veritas.com/content/dam/Veritas/docs/white-papers/21198622_GA_ENT_WP-Early-Case-Assessment-in-Electronic-Discovery_EN.pdf
Veritas case study, 'Early Case Assessment in Electronic Discovery'
Medical
Personalised treatment plans for cancer patients
• IBM Watson's used by oncologists at Memorial
Sloan-Kettering Cancer Center, suggestions
'informed by data from 600,000 medical evidence
reports, 1.5 million patient records and clinical
trials, and two million pages of text from medical
journals'
• Microsoft similarly use machine learning and
natural language processing to sort through
research data
http://news.microsoft.com/stories/computingcancer/
https://www.mskcc.org/blog/msk-trains-ibm-watson-help-doctors-make-better-treatment-choices
http://www.oxfordmartin.ox.ac.uk/publications/view/1883
Politics, finance
http://www.opensecrets.org/resources/learn/anomalies.php
Translation
• New version of Google Translate uses
'recurrent neural networks' to translate
sentences as a whole
https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
Enhancing records: SherlockNet
http://bit.ly/sherlocknet
Question: what kinds of decisions could
you support by analysing big data?
What value would that add?
Working with big data
Planning for big data: stages
• Identify potential sources
• Digitising (unless everything is already
available as digital text/images)
• Collecting (unless everything is already
centralised)
• Reformatting (unless everything is ready to be
loaded into software)
• Storage, backup, software licences
Stages: reviewing permissions
Possible issues include:
• terms of use when data collected,
• data protection,
• copyright,
• commercial in confidence,
• proprietary systems,
• other licences
Stages: what skills do you need?
• Domain knowledge
• Analytical skills
• Technical skills
Stages: cleaning
(unless your data is already consistent)
• These are not the same place (if you're a
computer):
– U.S.
– U.S.A
– U.S.A.
– USA
– United States of America
– United States (case)
Stages: cleaning
http://openrefine.org/
...but be careful
Stages: cleaning
Challenge: time-consuming
Opportunity: time to get to know the data
e.g. Google Maps only understood museum
records that used 'United Kingdom'; tens of
thousands of records that used Great Britain,
England, Scotland, Wales, Northern Ireland etc
weren't mapped
Stages: cleaning
Some 'fuzziness' is unavoidable.
• Unexpectedly complex objects e.g. 'Begun in
Kiryu, Japan, finished in France'
• Permanent uncertainty e.g. 'Bali? Java?
Mexico?'
Cleaning: don't forget!
• Versioning
• Documentation
Stages: enhancing
http://nlp.stanford.edu:8080/ner/
Stages: verifying
Reality check results
• Are they accurate?
• Could they do anyone any harm?
• Do they under- or over-report any factors?
Stages: dissemination
• How can you contextualise, explain any
limitations of your analysis? e.g.
– provenance and qualities of original dataset(s);
– how it was transformed, cleaned to fit into
software;
– how confident you are in matches, results;
– what's left out of the analysis, and why?
The only way is Ethics
Ico: Big data and data protection
https://ico.org.uk/media/for-organisations/documents/1541/big-data-and-data-protection.pdf
Ico: Big data and data protection
The ethics of convenience?
• More data is digital
• More data is retained
• More data contains identifiers
It's easier than ever before to make creepy
decisions
Question: what ethical issues might
arise with big data in your field? How
can you resolve them?
Thank you!
Questions?
Dr. Mia Ridge, @mia_out
Digital Curator, British Library
digitalresearch@bl.uk @BL_DigiSchol

More Related Content

What's hot

Requirements Engineering for the Humanities
Requirements Engineering for the HumanitiesRequirements Engineering for the Humanities
Requirements Engineering for the HumanitiesShawn Day
 
New Forms of Collaboration in Humanities Research
New Forms of Collaboration in Humanities ResearchNew Forms of Collaboration in Humanities Research
New Forms of Collaboration in Humanities ResearchShawn Day
 
Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013J T "Tom" Johnson
 
Data-driven journalism: What is there to learn? (Stanford, June 2010) #ddj
Data-driven journalism: What is there to learn? (Stanford, June 2010) #ddjData-driven journalism: What is there to learn? (Stanford, June 2010) #ddj
Data-driven journalism: What is there to learn? (Stanford, June 2010) #ddjMirko Lorenz
 
Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)
Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)
Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)Dr Ernesto Priego
 
Building Data-centric Media Organizations
Building Data-centric Media OrganizationsBuilding Data-centric Media Organizations
Building Data-centric Media OrganizationsJ T "Tom" Johnson
 
Intro to Data Vis for the Humanities nov 2013
Intro to Data Vis for the Humanities nov 2013Intro to Data Vis for the Humanities nov 2013
Intro to Data Vis for the Humanities nov 2013Shawn Day
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project ClinicWiLS
 
Google Tools for Digital Humanities Scholars
Google Tools for Digital Humanities ScholarsGoogle Tools for Digital Humanities Scholars
Google Tools for Digital Humanities ScholarsShawn Day
 
Generous Interfaces - rich websites for digital collections
Generous Interfaces - rich websites for digital collections Generous Interfaces - rich websites for digital collections
Generous Interfaces - rich websites for digital collections Mitchell Whitelaw
 
Forms of Innovation: Collaboration, Attribution, Access
 Forms of Innovation: Collaboration, Attribution, Access Forms of Innovation: Collaboration, Attribution, Access
Forms of Innovation: Collaboration, Attribution, AccessDr Ernesto Priego
 
Is Search the Right Way?
Is Search the Right Way?Is Search the Right Way?
Is Search the Right Way?Andrew Prescott
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Ian Milligan
 
Webmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisWebmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisTimelessFuture
 
Global lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataGlobal lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataMinerva Lin
 
Open + Internet of Things
Open + Internet of ThingsOpen + Internet of Things
Open + Internet of ThingsLaura James
 

What's hot (20)

Requirements Engineering for the Humanities
Requirements Engineering for the HumanitiesRequirements Engineering for the Humanities
Requirements Engineering for the Humanities
 
New Forms of Collaboration in Humanities Research
New Forms of Collaboration in Humanities ResearchNew Forms of Collaboration in Humanities Research
New Forms of Collaboration in Humanities Research
 
Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013
 
Data-driven journalism: What is there to learn? (Stanford, June 2010) #ddj
Data-driven journalism: What is there to learn? (Stanford, June 2010) #ddjData-driven journalism: What is there to learn? (Stanford, June 2010) #ddj
Data-driven journalism: What is there to learn? (Stanford, June 2010) #ddj
 
Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)
Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)
Butterfly Hunt: On Collecting #mla14 Tweets (#mla15 #s398)
 
Situation Dänemark
Situation DänemarkSituation Dänemark
Situation Dänemark
 
Visualization notes
Visualization notesVisualization notes
Visualization notes
 
Building Data-centric Media Organizations
Building Data-centric Media OrganizationsBuilding Data-centric Media Organizations
Building Data-centric Media Organizations
 
Intro to Data Vis for the Humanities nov 2013
Intro to Data Vis for the Humanities nov 2013Intro to Data Vis for the Humanities nov 2013
Intro to Data Vis for the Humanities nov 2013
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project Clinic
 
Google Tools for Digital Humanities Scholars
Google Tools for Digital Humanities ScholarsGoogle Tools for Digital Humanities Scholars
Google Tools for Digital Humanities Scholars
 
Generous Interfaces - rich websites for digital collections
Generous Interfaces - rich websites for digital collections Generous Interfaces - rich websites for digital collections
Generous Interfaces - rich websites for digital collections
 
Forms of Innovation: Collaboration, Attribution, Access
 Forms of Innovation: Collaboration, Attribution, Access Forms of Innovation: Collaboration, Attribution, Access
Forms of Innovation: Collaboration, Attribution, Access
 
Is Search the Right Way?
Is Search the Right Way?Is Search the Right Way?
Is Search the Right Way?
 
The Online Museum
The Online MuseumThe Online Museum
The Online Museum
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014
 
Webmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisWebmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysis
 
CLASS Conference 2014
CLASS Conference 2014CLASS Conference 2014
CLASS Conference 2014
 
Global lodlam_communities and open cultural data
Global lodlam_communities and open cultural dataGlobal lodlam_communities and open cultural data
Global lodlam_communities and open cultural data
 
Open + Internet of Things
Open + Internet of ThingsOpen + Internet of Things
Open + Internet of Things
 

Similar to Planning for big data (lessons from cultural heritage)

Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
MoM2010: Data mining as an Saudi IT nation demand
MoM2010: Data mining as an Saudi IT nation demandMoM2010: Data mining as an Saudi IT nation demand
MoM2010: Data mining as an Saudi IT nation demandHend Al-Khalifa
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution WSO2
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreSoftweb Solutions
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processingPranav Gontalwar
 

Similar to Planning for big data (lessons from cultural heritage) (20)

Big Data
Big Data Big Data
Big Data
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Unit 1
Unit 1Unit 1
Unit 1
 
MoM2010: Data mining as an Saudi IT nation demand
MoM2010: Data mining as an Saudi IT nation demandMoM2010: Data mining as an Saudi IT nation demand
MoM2010: Data mining as an Saudi IT nation demand
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Intro dm
Intro dmIntro dm
Intro dm
 
Intro dm
Intro dmIntro dm
Intro dm
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
bigdatappt.pptx
bigdatappt.pptxbigdatappt.pptx
bigdatappt.pptx
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processing
 
DBMS
DBMSDBMS
DBMS
 
Big data
Big dataBig data
Big data
 

More from Mia

Living with Machines year two update
Living with Machines year two updateLiving with Machines year two update
Living with Machines year two updateMia
 
Rethink research, illuminate history with the British Library
Rethink research, illuminate history with the British LibraryRethink research, illuminate history with the British Library
Rethink research, illuminate history with the British LibraryMia
 
Living with Machines: one year in
Living with Machines: one year inLiving with Machines: one year in
Living with Machines: one year inMia
 
Festival of Maintenance talk: Apps, microsites and collections online: innova...
Festival of Maintenance talk: Apps, microsites and collections online: innova...Festival of Maintenance talk: Apps, microsites and collections online: innova...
Festival of Maintenance talk: Apps, microsites and collections online: innova...Mia
 
Operationalising AI at a national library
Operationalising AI at a national libraryOperationalising AI at a national library
Operationalising AI at a national libraryMia
 
Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...
Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...
Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...Mia
 
In search of the sweet spot: infrastructure at the intersection of cultural h...
In search of the sweet spot: infrastructure at the intersection of cultural h...In search of the sweet spot: infrastructure at the intersection of cultural h...
In search of the sweet spot: infrastructure at the intersection of cultural h...Mia
 
Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...Mia
 
Enabling digital scholarship through staff training: the British Library's ex...
Enabling digital scholarship through staff training: the British Library's ex...Enabling digital scholarship through staff training: the British Library's ex...
Enabling digital scholarship through staff training: the British Library's ex...Mia
 
A modest proposal: crowdsourcing in cultural heritage benefits us all.
A modest proposal: crowdsourcing in cultural heritage benefits us all.A modest proposal: crowdsourcing in cultural heritage benefits us all.
A modest proposal: crowdsourcing in cultural heritage benefits us all.Mia
 
Crowdsourcing at the British Library: lessons learnt and future directions
Crowdsourcing at the British Library: lessons learnt and future directionsCrowdsourcing at the British Library: lessons learnt and future directions
Crowdsourcing at the British Library: lessons learnt and future directionsMia
 
Crowdsourcing 'In the Spotlight' at the British Library
Crowdsourcing 'In the Spotlight' at the British LibraryCrowdsourcing 'In the Spotlight' at the British Library
Crowdsourcing 'In the Spotlight' at the British LibraryMia
 
Crowdsourcing: the British Library experience
Crowdsourcing: the British Library experienceCrowdsourcing: the British Library experience
Crowdsourcing: the British Library experienceMia
 
Chair's welcome, MCG's Museums+Tech 2017
Chair's welcome, MCG's Museums+Tech 2017Chair's welcome, MCG's Museums+Tech 2017
Chair's welcome, MCG's Museums+Tech 2017Mia
 
Historical thinking in crowdsourcing and citizen history projects
Historical thinking in crowdsourcing and citizen history projectsHistorical thinking in crowdsourcing and citizen history projects
Historical thinking in crowdsourcing and citizen history projectsMia
 
Cross-sector collaboration for digital museum and library projects
Cross-sector collaboration for digital museum and library projectsCross-sector collaboration for digital museum and library projects
Cross-sector collaboration for digital museum and library projectsMia
 
Connected heritage: How should Cultural Institutions Open and Connect Data?
Connected heritage: How should Cultural Institutions Open and Connect Data?Connected heritage: How should Cultural Institutions Open and Connect Data?
Connected heritage: How should Cultural Institutions Open and Connect Data?Mia
 
Wish upon a star: making crowdsourcing in cultural heritage a reality
Wish upon a star: making crowdsourcing in cultural heritage a realityWish upon a star: making crowdsourcing in cultural heritage a reality
Wish upon a star: making crowdsourcing in cultural heritage a realityMia
 
Doing Digital Research @ British Library
Doing Digital Research @ British LibraryDoing Digital Research @ British Library
Doing Digital Research @ British LibraryMia
 
Digitised Manuscripts and the British Library's new IIIF viewer
Digitised Manuscripts and the British Library's new IIIF viewer Digitised Manuscripts and the British Library's new IIIF viewer
Digitised Manuscripts and the British Library's new IIIF viewer Mia
 

More from Mia (20)

Living with Machines year two update
Living with Machines year two updateLiving with Machines year two update
Living with Machines year two update
 
Rethink research, illuminate history with the British Library
Rethink research, illuminate history with the British LibraryRethink research, illuminate history with the British Library
Rethink research, illuminate history with the British Library
 
Living with Machines: one year in
Living with Machines: one year inLiving with Machines: one year in
Living with Machines: one year in
 
Festival of Maintenance talk: Apps, microsites and collections online: innova...
Festival of Maintenance talk: Apps, microsites and collections online: innova...Festival of Maintenance talk: Apps, microsites and collections online: innova...
Festival of Maintenance talk: Apps, microsites and collections online: innova...
 
Operationalising AI at a national library
Operationalising AI at a national libraryOperationalising AI at a national library
Operationalising AI at a national library
 
Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...
Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...
Hopes, dreams and reality: crowdsourcing and the democratisation of knowledge...
 
In search of the sweet spot: infrastructure at the intersection of cultural h...
In search of the sweet spot: infrastructure at the intersection of cultural h...In search of the sweet spot: infrastructure at the intersection of cultural h...
In search of the sweet spot: infrastructure at the intersection of cultural h...
 
Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...
 
Enabling digital scholarship through staff training: the British Library's ex...
Enabling digital scholarship through staff training: the British Library's ex...Enabling digital scholarship through staff training: the British Library's ex...
Enabling digital scholarship through staff training: the British Library's ex...
 
A modest proposal: crowdsourcing in cultural heritage benefits us all.
A modest proposal: crowdsourcing in cultural heritage benefits us all.A modest proposal: crowdsourcing in cultural heritage benefits us all.
A modest proposal: crowdsourcing in cultural heritage benefits us all.
 
Crowdsourcing at the British Library: lessons learnt and future directions
Crowdsourcing at the British Library: lessons learnt and future directionsCrowdsourcing at the British Library: lessons learnt and future directions
Crowdsourcing at the British Library: lessons learnt and future directions
 
Crowdsourcing 'In the Spotlight' at the British Library
Crowdsourcing 'In the Spotlight' at the British LibraryCrowdsourcing 'In the Spotlight' at the British Library
Crowdsourcing 'In the Spotlight' at the British Library
 
Crowdsourcing: the British Library experience
Crowdsourcing: the British Library experienceCrowdsourcing: the British Library experience
Crowdsourcing: the British Library experience
 
Chair's welcome, MCG's Museums+Tech 2017
Chair's welcome, MCG's Museums+Tech 2017Chair's welcome, MCG's Museums+Tech 2017
Chair's welcome, MCG's Museums+Tech 2017
 
Historical thinking in crowdsourcing and citizen history projects
Historical thinking in crowdsourcing and citizen history projectsHistorical thinking in crowdsourcing and citizen history projects
Historical thinking in crowdsourcing and citizen history projects
 
Cross-sector collaboration for digital museum and library projects
Cross-sector collaboration for digital museum and library projectsCross-sector collaboration for digital museum and library projects
Cross-sector collaboration for digital museum and library projects
 
Connected heritage: How should Cultural Institutions Open and Connect Data?
Connected heritage: How should Cultural Institutions Open and Connect Data?Connected heritage: How should Cultural Institutions Open and Connect Data?
Connected heritage: How should Cultural Institutions Open and Connect Data?
 
Wish upon a star: making crowdsourcing in cultural heritage a reality
Wish upon a star: making crowdsourcing in cultural heritage a realityWish upon a star: making crowdsourcing in cultural heritage a reality
Wish upon a star: making crowdsourcing in cultural heritage a reality
 
Doing Digital Research @ British Library
Doing Digital Research @ British LibraryDoing Digital Research @ British Library
Doing Digital Research @ British Library
 
Digitised Manuscripts and the British Library's new IIIF viewer
Digitised Manuscripts and the British Library's new IIIF viewer Digitised Manuscripts and the British Library's new IIIF viewer
Digitised Manuscripts and the British Library's new IIIF viewer
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 

Planning for big data (lessons from cultural heritage)

  • 1. Planning for big data Dr. Mia Ridge, @mia_out Digital Curator, British Library digitalresearch@bl.uk @BL_DigiSchol
  • 2. Outline • What is big data? • How is it used? • How do you prepare for working with it?
  • 3. What is big data?
  • 4. Defining 'big data' Data that is too large or too complex to process manually / with a desktop computer – Number of records – Size of files – Mixed formats – Unstructured data – Relationships between datasets
  • 5. Defining 'big data' - Gartner 'Volume. Data that have grown to an immense size, prohibiting analysis with traditional tools Variety. Multiple formats of structured and unstructured data—such as social-media posts, location data from mobile devices, call center recordings, and sensor updates—that require fresh approaches to collection, storage, and management Velocity. Data that need to be processed in real or near- real time in order to be of greatest value, such as instantly providing a coupon to customers standing in the cereal aisle based on their past cereal purchases' https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/
  • 6. 'Big data' in cultural heritage
  • 7. The challenges of scale • The BL holds 180-200 million items, including: • 8 million stamps • 310,000 manuscript volumes • Over 4 million maps • Legal deposit material including pamphlets, magazines, newspapers, sheet music and maps • Television and radio recordings • Websites, e-books, e-journals • Over 3 million new items are added every year • Only 1-2% of collections digitised
  • 8. The impact of scale My experience at Cooper Hewitt: 20% of my residency 'dealing with the sheer size of the dataset: it's tricky to load 60mb worth of 270,000 rows into tools that are limited by: • the number of rows (Excel), • rows/columns (Google Docs) or • size of file (Google Refine, ManyEyes) 'search-and-replace cleaning takes a long time' https://labs.cooperhewitt.org/2012/exploring-shape-collections-draft/
  • 9. A splendid assortment of Gceloag and West of England. Tweed ; also Black Doeakin Woollen Cloths alwaya on hand. Snit made to order in six hoars' notice, on most reaainable terms. Mr. M'Mohon, Cutter. Mysteries of Melbourne life by Cameron, Donald, 1848?-1888. Published 1873 Usage Public Domain Mark 1.0 Topics Australia -- Fiction
  • 10. Different data, different uses Datasets about our collections Bibliographic datasets relating to our published and archival holdings Datasets for content mining Content suitable for use in text and data mining research Datasets for image analysis Image collections suitable for large-scale image-analysis-based research Datasets from UK Web Archive Data and API services available for accessing UK Web Archive collections Digital mapping Geospatial data, cartographic applications, digital aerial photography and scanned-in historic map materials http://bl.uk/digital
  • 12.
  • 13. Question: what kinds of big data are you interested in working with? What makes it 'big'?
  • 14. How is big data used?
  • 15. Machine learning, artificial intelligence and big data Computational techniques that learn from examples and/or data without being programmed in advance e.g. • Recruitment - shortlisting CVs to job ads • Ecommerce - Netflix, Amazon, Spotify recommendations
  • 17. Medical Personalised treatment plans for cancer patients • IBM Watson's used by oncologists at Memorial Sloan-Kettering Cancer Center, suggestions 'informed by data from 600,000 medical evidence reports, 1.5 million patient records and clinical trials, and two million pages of text from medical journals' • Microsoft similarly use machine learning and natural language processing to sort through research data http://news.microsoft.com/stories/computingcancer/ https://www.mskcc.org/blog/msk-trains-ibm-watson-help-doctors-make-better-treatment-choices http://www.oxfordmartin.ox.ac.uk/publications/view/1883
  • 19. Translation • New version of Google Translate uses 'recurrent neural networks' to translate sentences as a whole https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
  • 21. Question: what kinds of decisions could you support by analysing big data? What value would that add?
  • 23. Planning for big data: stages • Identify potential sources • Digitising (unless everything is already available as digital text/images) • Collecting (unless everything is already centralised) • Reformatting (unless everything is ready to be loaded into software) • Storage, backup, software licences
  • 24. Stages: reviewing permissions Possible issues include: • terms of use when data collected, • data protection, • copyright, • commercial in confidence, • proprietary systems, • other licences
  • 25. Stages: what skills do you need? • Domain knowledge • Analytical skills • Technical skills
  • 26. Stages: cleaning (unless your data is already consistent) • These are not the same place (if you're a computer): – U.S. – U.S.A – U.S.A. – USA – United States of America – United States (case)
  • 29. Stages: cleaning Challenge: time-consuming Opportunity: time to get to know the data e.g. Google Maps only understood museum records that used 'United Kingdom'; tens of thousands of records that used Great Britain, England, Scotland, Wales, Northern Ireland etc weren't mapped
  • 30. Stages: cleaning Some 'fuzziness' is unavoidable. • Unexpectedly complex objects e.g. 'Begun in Kiryu, Japan, finished in France' • Permanent uncertainty e.g. 'Bali? Java? Mexico?'
  • 31. Cleaning: don't forget! • Versioning • Documentation
  • 33. Stages: verifying Reality check results • Are they accurate? • Could they do anyone any harm? • Do they under- or over-report any factors?
  • 34. Stages: dissemination • How can you contextualise, explain any limitations of your analysis? e.g. – provenance and qualities of original dataset(s); – how it was transformed, cleaned to fit into software; – how confident you are in matches, results; – what's left out of the analysis, and why?
  • 35. The only way is Ethics
  • 36. Ico: Big data and data protection https://ico.org.uk/media/for-organisations/documents/1541/big-data-and-data-protection.pdf
  • 37. Ico: Big data and data protection
  • 38. The ethics of convenience? • More data is digital • More data is retained • More data contains identifiers It's easier than ever before to make creepy decisions
  • 39. Question: what ethical issues might arise with big data in your field? How can you resolve them?
  • 40. Thank you! Questions? Dr. Mia Ridge, @mia_out Digital Curator, British Library digitalresearch@bl.uk @BL_DigiSchol

Editor's Notes

  1. Some thoughts based on my experience
  2. Some thoughts based on my experience
  3. Volume. Big data uses massive datasets Variety. Big data often involves bringing together data from different sources e.g. tweets and sales data Velocity. In some contexts, it is important to analyse data as quickly as possible, even in real time e.g. when your bank texts you re possible fraudulent transaction https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/
  4. e.g. stuff historians might want to look at
  5. Image: The storage void of the new British Library National Newspaper Building at Boston Spa in West Yorkshire. Photo © Kippa Matthews
  6. 'Big data' in cultural heritage What kinds of data are we talking about? At the very least, providing photographs of pages, which can then be transcribed as text. Can then offer collections of metadata, of text, of images, for reading individually or mining as a dataset. A shift from reading pages to reading a dataset enables entirely new research questions. If look at dates, names, can see that it's sometimes fuzzy, messy - must be flattened to fit into precise, specific systems? Image, data. https://archive.org/details/MysteriesOfMelbourneLife
  7. Messy data - lots of different formats, not everything uses standard vocabs so it's hard to be be certain exactly who or what entities in the world they mean
  8. Thousands of UK websites have been collected since 2004 As at 30 Nov 239.46GB Number of Archived Websites 15,112. 79,276 'instances' ie snapshots Uk web archive good eg of variety - web pages / site have multiple elements, meaning often contained in links
  9. What makes it complex or hard to process?
  10. AKA why do people get excited about it? Examples from different domains.
  11. http://www.oxfordmartin.ox.ac.uk/downloads/reports/Citi_GPS_Technology_Work.pdf
  12. e.g. document review and to assist in pre-trial research; pre-crime detection, sentencing recommendations 'Symantec's eDiscovery platform is able to perform all tasks "from legal hold and collections through analysis, review, and production", and proved capable of analysing and sorting more than 570,000 documents in two days' Markoff (2011) in http://www.oxfordmartin.ox.ac.uk/downloads/reports/Citi_GPS_Technology_Work.pdf
  13. Memorial Sloan-Kettering Cancer Center [Bassett (2014)] personalise a treatment plan with reference to a given patient's individual symptoms, genetics, family and medication history
  14. http://www.opensecrets.org/resources/learn/anomalies.php
  15. https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
  16. http://blogs.bl.uk/digital-scholarship/2016/11/sherlocknet-update-millions-of-tags-and-thousands-of-captions-added-to-the-bl-flickr-images.html Vs https://www.captionbot.ai/
  17. What do project managers need to know about it?
  18. I've unpacked some of the stages so you can think about what's required at each stage. Cleaning often 80% of task once have the data, but getting the data can also take time.
  19. Cleaning is time consuming but means you'll get familiar with the data.
  20. Could also be called linking, identifying or adding structure
  21. https://ico.org.uk/media/for-organisations/documents/1541/big-data-and-data-protection.pdf
  22. Have you ever been creeped out by websites or marketing that seems to know a bit too much about you?
  23. Ethics - discussion - what ethical dilemmas have you encountered in your own work, or heard of in other contexts? Should you use data just because it's now more convenient? Scale and convenience pushing at ethics.