This talk provides an overview of my work towards enabling Data DJs. That is enabling users to create, remix, record, and share their data analyses as easily as DJs make and share mixes. The talk touches on a variety of topics including linked data, scientific workflows, provenance, enterprise mashups and Facebook. It draws these topics into a unified research framework and discusses future research directions.
1. Paul Groth | Vrije Universiteit Amsterdam | pgroth@few.vu.nl Image: http://www.flickr.com/photos/tomk32/2988993409 / All images are under a creative commons license
22. Title: BLASTP with simplified results returned Description: This workflow Performs a blastp search on protein sequence, extracts sequence id within the blast report and retrives the corresponding seuqences.[sic] ≅
50. The Community http://www.flickr.com/photos/dunechaser/142079357/sizes/o/
51.
52.
53.
54.
Notes de l'éditeur
Title: I want to be a Data DJ! Abstract: This talk provides an overview of my work towards enabling Data DJs. That is enabling users to create, remix, record, and share their data analyses as easily as DJs make and share mixes. The talk touches on a variety of topics including linked data, scientific workflows, provenance, enterprise mashups and Facebook. It draws these topics into a unified research framework and discusses future research directions.
Because of I want an audience….not really….
Records Simple components (effects, fades) chained together: workflow Whole albums of dj Creativity (through on a new record – backtrack) – fast to novelty You can continually improve because it’s easy to revisit and remix The ability to remix enables combinatorial innovation
Intuitevly….
1800: Interchangeable parts 1900: Gasoline engine 1960: Integrated circuits 1995-now: Internet
Web services is lower case because not about SOA… Flickr, Google Maps, Twitter,
not easy enough for the user… or developers
Records = data and data discovery Turntables = components and composition Recording = capturing what’s gone on
Data
Common apis= sparql and rdf Things like factual and yql Machine readable data on the web
Common apis= “sparql and api
I see that there is a technique called “drive across country” and I go ahead and import it.
Also if we extract information this is exposed as its own RDF triple. (see the references field)
RDF Query Answering using Evolutionary Algorithms
Fault-tolerance Data movement Provenance tracking Validation Component Discovery Reproduction
A proliferation of boxes and arrow diagrams
Natural instruction…
How do people “naturally describe workflows”? Study with myExperiment workflows
- Workflow for estimating the maximum accuracy of a model for a set of test data
Linked data + mashup (workflow) = a new cool application, but then what? Need for provenance
IPOD has 451 parts provided by 10 suppliers… but apple trusts all of them http://pcic.merage.uci.edu/papers/2007/AppleiPod.pdf http://people.ischool.berkeley.edu/~hal/people/hal/NYTimes/2007-06-28.html The problem is not mixing and matching components the problem is the need for provenance
Get applications to record process documentation! Log data ! But the key here is to structure that data….
Guarantees that documentation will be captured… Attributable, finalizable, process reflecting, You can also just use log4j
Say it’s an
Condor dag…. Number of jobs
How many people have cell phones? How many people understand their cell phone contract?
I trust the contract because people I know have told me the
Mechanism design, trust because of enforcement
Trust based on the artifact itself
Availability of support for example
Trust based on experience… what you’ve seen before
Note that this is not to say these can’t work together