4. Defining 'big data'
Data that is too large or too complex to process
manually / with a desktop computer
– Number of records
– Size of files
– Mixed formats
– Unstructured data
– Relationships between datasets
5. Defining 'big data' - Gartner
'Volume. Data that have grown to an immense size,
prohibiting analysis with traditional tools
Variety. Multiple formats of structured and
unstructured data—such as social-media posts,
location data from mobile devices, call center
recordings, and sensor updates—that require fresh
approaches to collection, storage, and management
Velocity. Data that need to be processed in real or near-
real time in order to be of greatest value, such as
instantly providing a coupon to customers standing in
the cereal aisle based on their past cereal purchases'
https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/
7. The challenges of scale
• The BL holds 180-200 million items, including:
• 8 million stamps
• 310,000 manuscript volumes
• Over 4 million maps
• Legal deposit material including pamphlets,
magazines, newspapers, sheet music and maps
• Television and radio recordings
• Websites, e-books, e-journals
• Over 3 million new items are added every year
• Only 1-2% of collections digitised
8. The impact of scale
My experience at Cooper Hewitt: 20% of my
residency 'dealing with the sheer size of the
dataset: it's tricky to load 60mb worth of 270,000
rows into tools that are limited by:
• the number of rows (Excel),
• rows/columns (Google Docs) or
• size of file (Google Refine, ManyEyes)
'search-and-replace cleaning takes a long time'
https://labs.cooperhewitt.org/2012/exploring-shape-collections-draft/
9. A splendid assortment of Gceloag
and West of England. Tweed ; also
Black Doeakin Woollen Cloths
alwaya on hand. Snit made to
order in six hoars' notice, on most
reaainable terms. Mr. M'Mohon,
Cutter.
Mysteries of Melbourne life
by Cameron, Donald, 1848?-1888.
Published 1873
Usage Public Domain Mark 1.0
Topics Australia -- Fiction
10. Different data, different uses
Datasets about our collections
Bibliographic datasets relating to our
published and archival holdings
Datasets for content mining Content
suitable for use in text and data
mining research
Datasets for image analysis Image
collections suitable for large-scale
image-analysis-based research
Datasets from UK Web Archive Data
and API services available for
accessing UK Web Archive collections
Digital mapping Geospatial data,
cartographic applications, digital
aerial photography and scanned-in
historic map materials http://bl.uk/digital
15. Machine learning, artificial intelligence
and big data
Computational techniques that learn from
examples and/or data without being
programmed in advance
e.g.
• Recruitment - shortlisting CVs to job ads
• Ecommerce - Netflix, Amazon, Spotify
recommendations
17. Medical
Personalised treatment plans for cancer patients
• IBM Watson's used by oncologists at Memorial
Sloan-Kettering Cancer Center, suggestions
'informed by data from 600,000 medical evidence
reports, 1.5 million patient records and clinical
trials, and two million pages of text from medical
journals'
• Microsoft similarly use machine learning and
natural language processing to sort through
research data
http://news.microsoft.com/stories/computingcancer/
https://www.mskcc.org/blog/msk-trains-ibm-watson-help-doctors-make-better-treatment-choices
http://www.oxfordmartin.ox.ac.uk/publications/view/1883
19. Translation
• New version of Google Translate uses
'recurrent neural networks' to translate
sentences as a whole
https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
23. Planning for big data: stages
• Identify potential sources
• Digitising (unless everything is already
available as digital text/images)
• Collecting (unless everything is already
centralised)
• Reformatting (unless everything is ready to be
loaded into software)
• Storage, backup, software licences
24. Stages: reviewing permissions
Possible issues include:
• terms of use when data collected,
• data protection,
• copyright,
• commercial in confidence,
• proprietary systems,
• other licences
25. Stages: what skills do you need?
• Domain knowledge
• Analytical skills
• Technical skills
26. Stages: cleaning
(unless your data is already consistent)
• These are not the same place (if you're a
computer):
– U.S.
– U.S.A
– U.S.A.
– USA
– United States of America
– United States (case)
29. Stages: cleaning
Challenge: time-consuming
Opportunity: time to get to know the data
e.g. Google Maps only understood museum
records that used 'United Kingdom'; tens of
thousands of records that used Great Britain,
England, Scotland, Wales, Northern Ireland etc
weren't mapped
30. Stages: cleaning
Some 'fuzziness' is unavoidable.
• Unexpectedly complex objects e.g. 'Begun in
Kiryu, Japan, finished in France'
• Permanent uncertainty e.g. 'Bali? Java?
Mexico?'
33. Stages: verifying
Reality check results
• Are they accurate?
• Could they do anyone any harm?
• Do they under- or over-report any factors?
34. Stages: dissemination
• How can you contextualise, explain any
limitations of your analysis? e.g.
– provenance and qualities of original dataset(s);
– how it was transformed, cleaned to fit into
software;
– how confident you are in matches, results;
– what's left out of the analysis, and why?
38. The ethics of convenience?
• More data is digital
• More data is retained
• More data contains identifiers
It's easier than ever before to make creepy
decisions
39. Question: what ethical issues might
arise with big data in your field? How
can you resolve them?
Volume. Big data uses massive datasets
Variety. Big data often involves bringing together data from different sources e.g. tweets and sales data
Velocity. In some contexts, it is important to analyse data as quickly as possible, even in real time e.g. when your bank texts you re possible fraudulent transaction
https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/
'Big data' in cultural heritage
What kinds of data are we talking about? At the very least, providing photographs of pages, which can then be transcribed as text. Can then offer collections of metadata, of text, of images, for reading individually or mining as a dataset. A shift from reading pages to reading a dataset enables entirely new research questions.
If look at dates, names, can see that it's sometimes fuzzy, messy - must be flattened to fit into precise, specific systems?
Image, data. https://archive.org/details/MysteriesOfMelbourneLife
Messy data - lots of different formats, not everything uses standard vocabs so it's hard to be be certain exactly who or what entities in the world they mean
Thousands of UK websites have been collected since 2004 As at 30 Nov 239.46GB Number of Archived Websites 15,112. 79,276 'instances' ie snapshots
Uk web archive good eg of variety - web pages / site have multiple elements, meaning often contained in links
What makes it complex or hard to process?
AKA why do people get excited about it? Examples from different domains.
e.g. document review and to assist in pre-trial research; pre-crime detection, sentencing recommendations
'Symantec's eDiscovery platform is able to perform all tasks "from legal hold and collections through analysis, review, and production", and proved capable of analysing and sorting more than 570,000 documents in two days' Markoff (2011) in http://www.oxfordmartin.ox.ac.uk/downloads/reports/Citi_GPS_Technology_Work.pdf
Memorial Sloan-Kettering Cancer Center [Bassett (2014)] personalise a treatment plan with reference to a given patient's individual symptoms, genetics, family and medication history
http://blogs.bl.uk/digital-scholarship/2016/11/sherlocknet-update-millions-of-tags-and-thousands-of-captions-added-to-the-bl-flickr-images.html
Vs https://www.captionbot.ai/
What do project managers need to know about it?
I've unpacked some of the stages so you can think about what's required at each stage. Cleaning often 80% of task once have the data, but getting the data can also take time.
Cleaning is time consuming but means you'll get familiar with the data.
Could also be called linking, identifying or adding structure
Have you ever been creeped out by websites or marketing that seems to know a bit too much about you?
Ethics - discussion - what ethical dilemmas have you encountered in your own work, or heard of in other contexts? Should you use data just because it's now more convenient? Scale and convenience pushing at ethics.