Presentation given at the 'Open Science Infrastructures for Big Cultural Data' - Advanced International Masterclass in Plovdiv, Bulgaria. Dec. 13-15, 2018
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
Research Data Management in GLAM: Managing Data for Cultural Heritage
1. Research Data Management in ‘GLAM’:
Managing Data for Cultural Heritage
Sarah A. Stewart, The British Library
@Biostew
‘Open Science Infrastructures for Big Cultural Data’ Masterclass,
Dec. 13-15th, Plovdiv, Bulgaria
2. www.bl.uk
Outline
Благодаря ви, че дойдохте днес!
• Introduction and Challenges: Data in Cultural Heritage
• Research Data Management ‘In a nutshell’: Key Concepts
• Software as Data
• PIDs for RDM – DataCite and DOIs
• RDM at the British Library – Developing Infrastructure and
Service around Data
• Conclusion and Questions?
2
4. www.bl.uk
Digital Transformations in ‘GLAM’:
The ‘Inside-Out’ Museum
• GLAM institutions are ‘everting’ their collections and
research to the (open) web – ‘Collections without Walls’
• Dynamic, changing research landscape – development
of new tools and techniques for digital research
• New infrastructures to support digital collections,
research and scholarship
• Changing materiality of research – from ‘analog’ to digital
• Greater role for data and metadata
• Research Data Management will play a crucial role!
4
7. “Challenges” (Opportunities?)
• Research is digital, are we?
• Are we still needed for
discovery?
• In an open world, do we still
have a role for access to
digital content?
• Will print become invisible?
• Global content grows so
fast, our collections are
shrinking (relatively)
• Resources? Funding, Time,
Labour
9. www.bl.uk
Big Data, Little Data, No Data…(Borgman, 2014)
• Language of ‘Data’ taken from the Sciences, but can be
defined and managed more broadly in all disciplines
• Big Data requires computational methods for analysis and
visualisation – Volume, Velocity, Variety
• Cultural Heritage Data might be ‘messy’ or ‘dirty’ – may be
incomplete, have gaps or require additional metadata (e.g.
‘Box of 19th Century Theatre Posters’)
• Sensitive data(?) Can still occur in CH!
• Broad definition of ‘data’ in Cultural Heritage
9
10. www.bl.uk
UKRI Concordat on Open Research Data (2016)
• Data are ‘evidence that underpins the answer to the
research question, and can be used to validate findings
regardless of its form (e.g. print, digital, or physical).’
• The primary purpose of research data is to provide the
information necessary to support or validate a research
project's observations, findings or outputs
10
11. www.bl.uk
Why Manage Research Data?
• Make data Findable, Accessible,
Interoperable and Re-Useable (FAIR Data)
• Preserve data for long-term use and re-use
• Make Research Transparent/Open
(Validation of Research!) and Reproducible
• Funder and Publisher mandates
• Good Research Practice – GLAM
Institutions are Research Institutions!
11
12. www.bl.uk
Why is Research Data Management Important?
Good Professional Practice:
• Funder mandates and requirements
• Supports institutional integrity
• Supports collaboration through data sharing and re-use
• Reduces redundancy in research
Value to you as a Researcher/Institution:
• Reduce the risk of data loss
• Increased efficiency
• Validated and replicable research
• Increased sharing and re-use (increased possibilities for collaboration)
• Increased citations
• Increased Research Impact!
14. www.bl.uk
What does Managed Data Look like?
Well-managed data is:
• intelligible and verifiable, because it is well-documented
• findable, because it is well organised and uses useful
filenames
• protected against loss, corruption and authorised access,
because it is backed up and secured appropriately
• easy to share, because mechanisms for protecting
confidentiality and intellectual property have been considered
• maintainable, because it is managed in a way that suits the
research group that uses it
• compliant with relevant laws and policies
14
16. www.bl.uk
• To be Findable:
• F1. (meta)data are assigned a globally unique and eternally
persistent identifier.
F2. data are described with rich metadata.
F3. (meta)data are registered or indexed in a searchable
resource.
F4. metadata specify the data identifier.
• TO BE ACCESSIBLE:
• A1 (meta)data are retrievable by their identifier using a
standardized communications protocol.
A1.1 the protocol is open, free, and universally implementable.
A1.2 the protocol allows for an authentication and authorization
procedure, where necessary.
A2 metadata are accessible, even when the data are no longer
available.
16
17. www.bl.uk 17
TO BE INTEROPERABLE:
I1. (meta)data use a formal, accessible, shared, and broadly
applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles.
I3. (meta)data include qualified references to other (meta)data.
TO BE RE-USABLE:
R1. meta(data) have a plurality of accurate and relevant
attributes.
R1.1. (meta)data are released with a clear and accessible data
usage license.
R1.2. (meta)data are associated with their provenance.
R1.3. (meta)data meet domain-relevant community standards.
20. www.bl.uk
Research Data Management (in a Nutshell)
• Data Management Planning
• Data Preservation (Both Short- and Long-Term)
• Data Sharing (and Sensitive Data)
• Data Discovery, Access and Re-Use
20
21. www.bl.uk
Data Creation: Data Management Plans
• Should have a data management plan in place at the
beginning of a project, as standard practice
• Data management plans should provide an outline of uses,
responsibilities, ownership, access and sharing (licensing),
storage, maintenance and archiving (even disposal) of
research data and software
• Online tools for data management plans include DMPOnline
(https://dmponline.dcc.ac.uk/)
21
22. www.bl.uk
Data Sharing: Why Share Data?
• Funder and publisher mandates
• Collaborations – Interdisciplinary and (often) International
• Validation/ Transparency, Support for
Open Research
• Citations = Research Impact!
• Sensitive data – Not All Data Can be Shared!
24. www.bl.uk
Software as Data
• ‘Software is used to create, interpret, present, manipulate and
manage data’ (Software Sustainability Institute)
• Data: ‘recorded factual material commonly retained by and
accepted…as necessary to validate research findings’
(EPSRC)
• Software = Data!
26. www.bl.uk
Software should be preserved if:
• Software can’t be separated from the data or digital object.
• Software is classified as a research output
• Software has intrinsic value
• More resources available at the Software Sustainability
Institute:
https://www.software.ac.uk/software-sustainability-institute
27. www.bl.uk
Digital Preservation (Software and Data)
• Software is a digital object and is also often a vital prerequisite
for the preservation of other digital objects
• Storage, Retrieval, Reconstruction and Replay are all
complexities relating to code libraries, dependencies and
software engineering overall
• Planning is essential for subsequent preservation
• Software management should be part of a broader plan for
research data management.
27
28. www.bl.uk
Some Strategies for Digital Preservation
• Data integrity and file fixity checks (using checksums) for
source code
• Media and format migrations
• Refreshing (reduces bit-rot)
• Emulation (‘simulates’ the conditions of a legacy system)
• Replication – ‘Lots of Copies Keeps Stuff Safe’
• Encapsulation – linking content with all information required for
operation – rich metadata approach (e.g. ‘README’ file and
annotation)
• Version Control – metadata to support versioning of software
and data
28
29. Open Data and where to Find It
(and Store/Archive It, too…)
• Re3data.org – Directory of subject-specific repositories
• Zenodo.org – Open Data repository run by CERN
• Github – Software and code repository
30. www.bl.uk
Why Use Persistent Identifiers?
• Use of persistent identifiers has
increased as scholarly
communications become
increasingly digital.
• ORCIDs and DOIs support open
science through supporting
interoperability in research
infrastructures.
• For instance, DataCite,
CrossRef can use DOIs and
ORCID iDs in addition to other
metadata to map and link
documents, data and
researchers. (LOD)
31. www.bl.uk
DataCite and DataCite UK
• Non-profit organisation which provides infrastructure for DOIs,
(Digital Object Identifiers)
• DOIs make data discoverable, citable and link datasets with
other related research outputs
• The British Library is the DataCite hub for DOI creation in the
UK.
• https://www.datacite.org/
• ‘To help the research community locate, identify, and cite
research data with confidence.’
31
32. www.bl.uk
DOIs (Digital Object Identifiers)
• Persistent identifier used to uniquely identify objects (datasets,
software, journal articles, theses), standardised by the
International Standards Organisation (ISO)
• Presented as an alphanumeric code consisting of a prefix and
suffix separated by a slash ‘/’ . The ‘10’ at the start of the DOI
positions the DOI within DOI namespace. E.g.
10.1037/rmh0000008
• Uses a ‘handle’ system in which a DOI is ‘resolvable’ through
binding metadata (such as a URL) to the specific DOI that
describes it.
• DOI is persistent, so it is the publisher’s responsibility to
update the metadata attached to the DOI, otherwise, the DOI
will resolve to a dead link.
32
34. www.bl.uk
FREYA Ambassadors’ Programme
• 3-Year EU-funded Project to advance infrastructure for persistent identifiers as a
core component of Open Research
• For more info, or to join, contact info@project-freya.eu
• https://www.project-freya.eu/en/activities/ambassador-programme
• Funded partners of FREYA include: STFC, PANGAEA, DANS, DataCite and
CERN and the British Library
34
35. www.bl.uk
Build Bridges, Not Siloes!
• Use FAIR Data Principles, Open Metadata and Persistent
Identifiers for Data!
35
36. www.bl.uk
The British Library in Context
• National Library for
the United Kingdom
• Second Largest
Library in the World –
over 150 million items
in most known
languages
• Over 16,000 visitors
per day (on-site and
on-line)
• Legal Deposit
36
37. The British Library response to challenges
• Living Knowledge articulates the
vision of the British Library in
2023 as the most open, creative
and innovative institution of its
kind in the world.
• A new Service Strategy for
research and a new Content
Strategy.
• New approach for delivery that
brings together the researcher-
facing departments in joined-up
roadmap.
• Everything Available is a
strategic change management
portfolio designed to deliver the
transformation of the Library’s
services to researchers and
research organisations.
38. Six strategic priorities
• Unified discovery workflowFind
• Unified access workflow
• Registration and identity management
• Workspaces and tools
Use
• Digital collection unification
• Collection management as a service
Share
40. Digital service elements
Digitisation
•On demand
•For institutions
Metadata
•Enhance content
•Provide identifiers
•Build semantic links
•Licensing support
Preservation
•Born digital
•Digitised
•Print
•Preservation as
a service
Discovery
•BL & external
content
•Feed external
services (e.g.
Google)
•Discovery as a
service
•Single Digital
Presence for public
libraries
Analysis
•Text and data mining
•Machine interfaces
•Visualisation
•Machine learning
•Dedicated staff
support
Access
•Shared platform
•Institutional portals
•Machine interfaces
•Feed external
platforms
41. www.bl.uk
The UK ‘Research Data’ Landscape…
• UKRI – Data underpinning
research and policy must be
archived for 10+ years
• Data must be made as openly
available as possible (with
constraints for sensitive data)
• Data must have appropriate
metadata and be citable
41
42. Vision – Data Collections and Services
Our vision for the British Library is that research
data are as integrated into our collections,
research and services as text is today.
The British Library's users will be able to
consume research data online through tools that
enable it to be analysed, visualised and
understood by non-specialists.
43. www.bl.uk
British Library Data Strategy (2017)
• All will be easy to discover and linked to
related research outputs, be they text, data or
multimedia.”
43
44. www.bl.uk
Data Services at the British Library
• Development of Infrastructure to support research data
management for data use and re-use at the British Library
• DataCite UK
• FREYA Project for Persistent Identifier (PID) Infrastructure
• Data in the Research Repository
• Discovery Services for Research Data
• Software and Data Carpentry and Software as Data Initiatives
(TBA)
44
46. Data management training
Data Management Plan engagement
British Library Data Management Plans
Documented Data Management Processes
Data Management
Jo, BL staff member
I was working on a grant proposal for ESRC.
They require a data management plan, so when
I was given an outline plan that set out the
Library’s processes for data management, I was
able to reuse that and save myself days of extra
work!
47. Engaging and linking with others
Clarify approach to data collection
Data Creation
Sonja, Epidemiologist
I was able to use the British Library web archive
as a dataset, correlating positive and negative
messages about statin use with NHS
prescription data. The subset of data I extracted
is really useful to others, so I offered it to the
Library who now make it available alongside
their other datasets.
48. Digital shared storage
Data preservation services for third parties
Data Archiving
and Preservation
Robin, Consultant
We produce valuable reports and data on the
political environment of emerging market
economies. Now that the British Library is
archiving that data, we can ensure others get to
use it even if our consultancy closes down. We
can also give them DOIs, and track the impact
of the work we produce.
49. SHARE: Developing a repository platform
• Single BL repository platform
• Refresh national preservation
system (>5m items, petabyte-scale)
• Access layer with multiple
repositories, shared service model
• Repository pilot developed with:
Preservation Layer
Services Layer
Access Layer
EThOS Data.bl.uk
BL
Institutional
Repository
Partner
Repositories
53. Rosslyn, Social Historian
My research on perceptions of gender involves
looking at if and how gender-specific words evolve
into derogatory terms. The British Library gave me
great advice on which collections I could use, and
how to connect tools to them. This allowed me to
automate analysis and visualisation of the data,
finding things I didn’t expect.
New models of data access
Third-party data discovery
Discovery for Library data
Data Discovery,
Access, Reuse
54. Tools and skills for data exploration
Alice, Post-Graduate Researcher
Being able to persistently identify my data with a
DOI means I can make my research
reproducible. It also means that I can track when
my data is cited, which is really helpful when it
comes to looking at my research impact.
Widening access
DataCite UK
Data Discovery,
Access, Reuse
55. www.bl.uk
‘Take-Home’ Points
• Data in cultural heritage may be very broadly defined.
• Use FAIR Data Principles as best practice
• Plan for data management following the research data lifecycle
• Data Discovery – Build Bridges, not Siloes
• Consider software as ‘data’ in RDM
• Persistent Identifiers to build robust, citable and discoverable
metadata and link outputs
55
57. www.bl.uk
Благодарим ви, че ни отделихте от
времето си!
Thank You!
Questions?
Email: sarah.stewart@bl.uk
datasets@bl.uk
@Biostew
57
Notes de l'éditeur
Many Types of Data - Data can come in many forms – What types of data are there? What types of data do you use/generate? Please give some examples here (make these appear on slide) - digital, spatial, physical (in the form of specimens) and even software can be considered to be data.
What kind of data will you be generating?
Why share data? Data sharing may be mandated by your funder. Another researcher may want to use your data for their work and collaborate/cite your data. Data may be shared to validate your published results. Increase your citations and impact. Not all data can be shared – sensitive data may include ethical constraints such as medical data, personal identifiers or commercially sensitive data. These types of data are typically restrited and cannot always be shared.
The best way to make content available to our users is to help other organisations to manage and share their content.
The Library needs to make data core to what it does. And this is the ultimate aim – being able to find, access and use research data at the British Library should eventually become business as usual.
Includes software not just data, and is one part of one of the Library’s strategic change programmes about opening up content – Everything Available.
The strategy is built on four themes, each of which is split into more specific areas of work.
I’m going to briefly introduce each theme to give a flavour of the activities they cover. Each theme also comes with a scenario that is the kind of activity we hope to be able to support if the vision is achieved. These are all in a nice shiny booklet we have about the strategy, come and see me if you want a copy!
The data management theme largely has an internal focus. Its aim is to meet our data management and data management planning obligations as a funding recipient we’re an independent research organisation, we get funding from AHRC but also EU funding, both require data management plans.
If we have documented data management process and plans in place, any BL staff participating in research will be able to take advantage of those, this will go hand in hand with training and engagement.
Even then our aim is not to hold every bit of data in the UK. But we want to link any data that we have with data held by others. Data derived from our collections can help to provide important context for that held elsewhere, as shown in our case study. The breadth of our collections can provide important social, geographical and other contexts – both historical and contemporary.
The strategy does not explicitly define potential services such as these because the landscape is moving and we want to be able to predict and respond, rather than tie ourselves down to a service that may be relevant this week, but not in 18 months time.
However, some of the proposed work may relate directly to DataCite services, for instance helping support persistence of DataCite DOIs by bolting preservation on to the existing DataCite service, which has the core of requiring persistence.
Finally, discovery, access and reuse of data is the largest theme in the strategy. As an implication of creating new datasets, we will need to make sure that users can not only find them, but access and use them in an appropriate way. We should also be ensuring that users are able to find data no matter who holds it and where.
Within this theme, there is also an opportunity to widen access to data. We want to look at how we can provide access mechanisms and environments for restricted data that not only meet the requirements of data stewards, but also allow access the non-academic but still bona fide researchers that we see in the reading rooms.
We will also continue to support data accessibility and sharing through our work on DataCite.