Presentation given at EuropeanaTech 2018 in Rotterdam, The Netherlands. Provides a summary of insights gained from working for about a decade on challenges related to temporal aspects of the web, persistence.
Herb Van de Sompel presentation on preserving links and content over time
1. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
Perseverance on Persistence
a future-note about the past
2. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
OAI-ORE
3. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
2006
• OAI-ORE observation: Scholarly assets are
rapidly becoming compound, consisting of
multiple resources with various:
• Relationships
• Interdependencies
• How to convey this compound-ness in an
interoperable manner so that applications
can access, consume such assets?
http://www.openarchives.org/ore/1.0/toc
4. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Address interoperability challenges from the perspective of the web
• The resource at the center of the universe
• The notion of a repository (or even of a web server) does
not exist in the architecture of the web
• Neither the notion of a Digital Object
• The tools of the interoperability trade are the primitives of the
web
ORE Insight 1 - Web-Centric Interoperability Paradigm
5. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Tools of the Web-Centric Interoperability Trade
• Resource
• URI
• HTTP as the API: HEAD/GET, POST, PUT, DELETE
• Representation
• Media Type
• Link
• Content Negotiation
• Typed Link
• Controlled Vocabularies for Typed Links
W3C
Architecture of
the World
Wide Web
RDF, RDFS,
OWL
6. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
7. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
OAI-ORE in EDM
Europeana v1.0 2009
8. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
The web-centric ORE approach allowed using off-the-shelf web
tools to archive evolving compound objects
• Evolving versions of Resource Maps, Aggregated Resources
were captured in a web archive
• But how to use the URI of the Aggregation or Resource Map to
see the status of an Aggregation at a specific moment in the
past?
ORE Insight 2 – How to Access Temporal State of an Aggregation
H. Van de Sompel (2007) Compound Information Object Prototype Demonstration
https://www.dropbox.com/s/dd7xd427y90q4jx/CT_Watch_hvds_20070703.mov?dl=0
9. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
H. Van de Sompel, M. L. Nelson, R. Sanderson (2013) RFC7089 - HTTP Framework for Time-
Based Access to Resource States – Memento. https://tools.ietf.org/html/rfc7089
Memento
10. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Tools of the Web-Centric Interoperability Trade – HTTP Stack
• Resource
• URI
• HTTP as the API
• Representation
• Media Types
• Link
• Content Negotiation, e.g. for preferred Media Type
• Typed Link
• Controlled Vocabularies for Typed Links
W3C
Architecture of
the World Wide
Web
HTTP Links,
IANA link
relation registry,
community link
relation types
HATEOAS – Hypermedia As The Engine Of Application State
http://en.wikipedia.org/wiki/HATEOAS
11. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Original Resource and Mementos
12. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Bridge from Present to Past
13. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Bridge from Present to Past
14. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Bridge from Past to Present
15. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
timegate Link: Link to Your Own History
Can link to preferred web
archive, but also:
• Maintain your own
resource version history
• timegate link to your
own history
• Distributed management of
resource history
• Uniform access to
resource history across
systems
• Follow links across
systems subject to time
16. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
No timegate Link – Client Intelligence
Client uses TimeGate of its
preferred web archive, but:
• Internet Archive is
massive, yet substantial
unique materials in other
archives
• Introduce aggregated
TimeGate: Memento
Aggregator
17. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Routing TimeGate Requests Using Machine Learning
Bornand, N., Balakireva, L., Van de Sompel, H. (2016) Routing Memento Requests Using Binary
Classifiers. JCDL16. https://arxiv.org/abs/1606.09136
• Memento Aggregator covers 20+ web archives
• Distributed systems problem: As the number of archives (and
incoming requests) grows, sending requests to each archive for
every incoming request is not feasible
• Response times
• Load on distributed archives
• After various optimization attempts, devised an approach using
binary classifiers per web archive:
• Trained on the basis of cached URIs, using URI features only
• Operational since 2016: 80% reduction in # queries. 1/3
reduction in response times. Recall 85%
18. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
From
Internet Archive
TodayToday Select Date Mar 20 2007 Apr 03 2007
Various Memento Tools (client/server)
https://github.com/machawk1/awesome-memento
19. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Pockets of Persistence
20. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Creating Pockets of Persistence
• With Memento’s time travel capability in place, what would it take to
support faithfully navigating the web of the Past?
• There are two major forces that hinder achieving this goal:
• Link rot: A link stops working all together
• Content drift: The linked content changes over time and may
eventually no longer be representative of the content that was
originally linked
• Without these forces at work, the web of the Present would be the
same as the web of the Past
• But that clearly is not the case
21. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Hyperlinks in Theory
22. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Hyperlinks in Reality
23. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Hyperlinks in Reality
24. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Link Rot
25. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Link Rot - PMC
Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, et al. (2014) Scholarly
context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
26. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Hyperlinks in Reality
27. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Content Drift
28. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Content Drift
29. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Content Drift
http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
30. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
No Content Drift
http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
31. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Content Drift - PMC
Shawn Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, et al. (2016) Scholarly context
not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
32. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Creating Pockets of Persistence
• What would it take to really support faithfully navigating the web of
the Past?
• This challenge exists for the entire web. Some communities with well
managed collections care about addressing it:
• Scholarly communication
• Cultural heritage
• Legal publications
• Journalism
• Wikipedia
• Why?
• Link Rot: Quality of Service
• Content Drift: integrity of the record, reliable evidence, revisiting
the state of knowledge, transparency of editorial process, …
33. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
US Supreme Court Opinion – Link Rot Activism
http://ssnat.com
34. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Two Types of Links from a Managed Collection
35. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Take 1 – PID Approach
PID
for
B
36. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Managed Collection => Managed Collection
37. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
PID Approach
Combat:
• Link Rot: Link to PID;
Redirect to current location
• Content Drift: Mint a PID
per version; Link to version
PID
With PID links:
• Web of Present = Web of
Past
38. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
39. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
40. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
cite-as Relation Type
Herbert Van de Sompel et al. (2018) cite-as: A Link Relation to Convey a Preferred URI for
Referencing. https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
http://signposting.org
41. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
PID Approach – Division of Labor
42. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Managed Collection => Web at Large
43. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
44. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
PID Approach
-
45. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Take 2 – Robust Links Approach
46. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Managed Collection => Web at Large
47. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Snapshot Approach
Combat:
• Link Rot & Content Drift:
Custodian of A creates
snapshot of B, in web
archive or locally
Regarding links:
• Intuition suggests linking to
the snapshot of B …
48. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Linking to Snapshot of B = Potentially Creating a Rotten Link
• Existing practice for linking to snapshots:
<a href=“URL of snapshot of B”>
• Problems with existing practice:
o Impossible to visit the original URI, if desired
o Requires the permanent existence/uptime of the archive that
holds the snapshot
- One link rot problem replaced by another
http://robustlinks.mementoweb.org/about/
49. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Permanent Existence/Uptime of Archives?
Remnant of discontinued web archive http://mummify.it captured on February 14 2014
https://web.archive.org/web/20140214233752/https://www.mummify.it/
50. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Permanent Existence/Uptime of Archives?
http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-
islamic-state-video/510074.html
51. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Permanent Existence/Uptime of Archives?
http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
52. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Decorate the Link
• Proposed practice for linking to captures:
<a href=“URL of snapshot of B”
data-originalurl=“B”
data-versiondate=“datetime of snapshot of B”>
<a href=“B”
data-versionurl=“URL of snapshot of B”
data-versiondate=“datetime of snapshot of B”>
http://robustlinks.mementoweb.org/spec/
53. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Robust Links: Link Decoration in Action
Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. In:
D-Lib Magazine. https://doi.org/10.1045/november2015-vandesompel
JavaScript makes the
link decorations actionable
54. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Robust Links: Refuse to Die
55. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
56. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Snapshot Approach – Division of Labor
57. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Managed Collection => Managed Collection
58. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Cool URI Approach
Combat:
• Link Rot: Link to B;
Redirect to current location
• Content Drift: Generic URI;
Version URIs
With Cool URI links:
• Tension between linking to
generic URI and version
URI
59. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Robust Links: Refuse to Die
60. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
61. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Cool URI Approach – Division of Labor
62. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Robust Links Approach
63. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Summary
PID RLLabor
-
64. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Robust Links for Linked Data?
Sanderson, R., Ciccarese, P., and Young, B. (2017) Web Annotation Vocabulary
W3C Recommendation 23 February 2017. https://www.w3.org/TR/annotation-vocab/
65. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Handling Resource Versions, Captures
B
B
t1
B
t2
66. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Systems with Resource Versions
67. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
DBpedia Snapshot Archive Using HDT, TPF, Memento
Vander Sande, M., Verborgh, R., Hochstenbach, P., and Van de Sompel, H. (2017) Towards
sustainable publishing and querying of distributed Linked Data archives.
Temporal: subject URI access ; ?s ?p ?o queries ; SPARQL queries
68. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Memento Tracer
http://tracer.mementoweb.org
69. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Resource Capture: Tension Between Scale and Quality
• Web crawling: optimized for scale
• Problems with capturing resources accessible via interactive
affordances
• webrecorder.io: optimized for quality
• Personal archiving
• User records web navigation session
• Not used for archiving at scale
• LOCKSS: optimized for scholarly journals
• Pages in Publisher/Journal portals share lay-out, affordances
• Heuristics per publisher/journal to improve capture quality
70. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Memento Tracer: New Sweet Spot Between Scale and Quality
• ~ web crawling: server side process to capture resources
• ~ LOCKSS: leverages insight that web publications in any given
portal are based on same template:
• share lay-out
• share interactive affordances
• ~ webrecorder.io: human guidance to achieve quality
• But, with Memento Tracer:
• user does not record a specific web publication
• user records heuristics that apply to a class of web publications
71. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Memento Tracer
72. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
A Trace for slideshare Presentations
{ "portal_url_match":
"(slideshare.net)/([^/]+)/([^/]+)",
"actions": [{ "action_order": "1",
"value": "div.j-next-btn.arrow-right",
"type": "CSSSelector",
"action": "repeated_click",
"repeat_until": {
"condition": "changes",
"type": "resource_url"
}
},
{ "action_order": "2",
"value": "div.notranslate.transcript.add-
padding-right.j-transcript a",
"type": "CSSSelector",
"action": "click"
}
], …
73. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Memento Tracer: Experimental
• Promising results, thus far
• Currently investigating challenges, including:
• User interface to support recording Traces for complex
sequences of interactions.
• Limitations of the browser event listener approach for recording
Traces.
• Language used to express Traces.
• Organization of the shared repository for Traces.
• Selection of a Trace for capturing a web publication in cases
where different page layouts and interactive affordances are
available for web publications that share a URI pattern.
74. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Demo: Recording a Trace for a Web Publication
https://github.com/www.gorillatoolkit/pkg/mux
75. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Demo: Capturing another Web Publication Using the Trace
https://github.com/mementoweb/node-solid-server
76. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Demo: Capturing another Web Publication Using the Trace
https://github.com/mementoweb/node-solid-server
77. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Demo: Playing Back the Captured Web Publication
Capture of https://github.com/mementoweb/node-solid-server
78. Herbert Van de Sompel @hvdsomp
EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
Perseverance on Persistence
a future-note about the past