An increasing number of researchers rely on computational methods to generate the results described in their publications. Research software created to this end is heterogeneous (e.g., scripts, libraries, packages, notebooks, etc.) and usually difficult to find, reuse, compare and understand due to its disconnected documentation (dispersed in manuals, readme files, web sites, and code comments) and a lack of structured metadata to describe it. In this talk I will describe the main challenges for finding, comparing and reusing research software, how structured metadata can help to address some of them, which are the best practices being proposed by the community; and current initiatives to aid their adoption by researchers within EOSC.
Impact: The talk addresses an important aspect of the EOSC infrastructure for quality research software by ensuring that software contributed to the EOSC ecosystem can be found, compared and reused by researchers. The talk also aims to address metadata quality of current research products, which is critical for successful adoption.
Presented at the EOSC symposium
Towards Reusable Research Software with Automated Metadata Extraction
1. Towards Reusable
Research Software
Daniel Garijo Verdejo
@dgarijov
daniel.garijo@upm.es
Ontology Engineering Group
Departamento de Inteligencia Artificial
Facultad de Informática
Universidad Politécnica de Madrid
2. Reproducibility: Open Research Data, Software and Methods
2
Scientific publication
Research Data Research Software Research Methods
EOSC Symposium: Infrastructure for quality research software
3. Challenges for (Re)using and Sharing Research Software
3
• What does the software component do?
Which of its methods should I use?
• How to transform my data to use the
software component?
• How to interpret the results produced by
the software component?
• How to invoke the software component?
• How to configure the software component
with the right parameters?
• How to compare against similar methods?
Software designer
Software user
• How to ease capturing the
dependencies and installation
instructions of my software?
• How to encapsulate my software so
it can be used with other data?
• How to describe my software so it
can be used by others?
• How to test if my software is ready
to be used by others?
EOSC Symposium: Infrastructure for quality research software
4. Community Initiatives and Standards
• Describing Research Software
• Schema.org & Codemeta
• Common Worflow Language (I/O)
• Packaging Research Artefacts (incl. software)
• Research Objects (RO-Crate)
• Aggregators (OpenAIRE, EOSC)
• General (e.g., Zenodo) &
domain-specific registries
• Scicodes (https://scicodes.net/)
4
Nine Best Practices for Research Software Registries and Repositories: A Concise Guide https://arxiv.org/abs/2012.13117
EOSC Symposium: Infrastructure for quality research software
5. Adopting annotation vocabularies: where are we at?
Software metadata is not abundant machine readable
5
EOSC Symposium: Infrastructure for quality research software
Can you please describe your
software component with metadata?
I already did! Did you read the
project readme?
Did you see the online
documentation?
Perhaps the you saw the
paper?
Many domain-specific registries are curated by
hand by experts
6. Automated Software Metadata Extraction
6
SOMEF
SOftware Metadata
Extraction Framework
https://github.com/KnowledgeCaptureAndDiscovery/somef/
[Mao et al 2019]: SoMEF: A Framework for Capturing Software Metadata from its Documentation. 2019 IEEE BigData REU Symposium. Los
Angeles, 2019
EOSC Symposium: Infrastructure for quality research software
Code repository
(readme)
Machine-readable file with software metadata:
• > 20 common metadata fields
• Installation instructions, description, invocation
command, license, author, citation, requirements,
examples, documentation, notebooks, etc.
• Analysis of readme and supp. Files (e.g., notebooks,
Dockerfiles)
• JSON, RDF(graph), Codemeta, RO (in progress)
7. Leveraging Software Metadata to create Knowledge Graphs
7
Explore input/output variables (interoperability)
Explore Software I/O files
(composition)
Knowledge Graphs with can link RS and its
components.
OKG-Soft: machine-readable Software Metadata:
• (From Schema.org) Attribution, license, funding,
usage examples...
• Executable software components
• Software invocation
• Input & output files, variables and units
• Containers used to encapsulate and run software
components
• Parameter validation and suggestion
[Garijo et al 2019]: OKG-Soft: An Open Knowledge Graph with Machine Readable Scientific Software Metadata. International
Conference on eScience, San Diego, USA. 2019
EOSC Symposium: Infrastructure for quality research software
8. Conclusions
Research Software Metadata should be actionable and useful for:
• Understanding the differences between two or more software
components
• Help portability (ROs)
• Add components in workflows (CWL + ROs)
• Help linking similar software methods
• Build automated comparison benchmarks
• Reduce the time needed to understand and adopt an existing
software component
• Author credit
8
EOSC Symposium: Infrastructure for quality research software