3. Envisioning a New Era of Research Reporting
Imagine…
• Live research reports that had multiple
end-user ‘views’ and which could
dynamically tailor their presentation to
each user Reproducible
• An authoring environment that absorbs Research
and encapsulates research workflows
and outputs from the lab experiments
• A report that can be dropped into an Interactive Collaboration
electronic lab workbench in order to Data
reconstitute an entire experiment
• A researcher working with multiple
reports on a Surface and having the Dynamic
ability to mash up data and workflows Documents
across experiments
• The ability to apply new analyses and Reputation
visualizations and to perform new in & Influence
silico experiments
4. Words & Pictures
• Papers/reports today describe chemical reactions/entities in a variety
of ways:
– common (or brand-name) labels
– identifiers and shorthand notations
– chemical formulae
– two- (and three-) dimensional graphical images of molecular structure.
• Describing chemical data becomes an exercise in typesetting and/or
graphics, and cross- and re-referencing existing chemical entities is
labor intensive.
– The resulting text is usually interpretable by humans but chemical data are
lost in the process, making it difficult to programmatically extract
meaningful information from such reports.
• The goals of Chem4Word are to:
– simplify the task of authoring a chemical document,
– do so in a way that produces a semantically meaningful document, facilitating
downstream tasks such as publishers workflows, entity extraction, and semantic
applications.
5. Chemistry Add-in for Word
aka Chem4Word
• Chem4Word allows chemists to create, edit and manipulate
chemistry in the Word environment, by
– Providing a built in dictionary of chemical structures
– Enabling online lookup of further structures via web services (e.g. Pubchem)
– Facilitating linking/embedding chemical structures inside a Word document
– Modification of chemical structures & representations of those structures
• Authoring is backed by semantic data in
Chemical Markup Language (CML), enabling:
– novel functionality in data checking during the authoring process
– chemistry-centric article reading support
– data-mining applications.
• Open source project (Outercurve Foundation); Apache 2.0 license
• ~500K downloads to date
17. Programmer View of Open XML Files
• ZIP Archive
• Document Parts
– XML Parts
– Binary Parts
– Typed (RFC 2616)
• Relationships
– Connections between parts
• Content Type Stream
– A specially-named stream
– Defines mappings from part names to content types
– Not itself a part, not URI addressable
• Folder structure for convenience only
48. To conclude..
Current publishing With Chem4Word
… is broken for data-rich science … the cycle is closed
Data publication difficult and unsupported Data preparation integrated into user workflow
Insufficient data to fully support research Open Standards promote Open Semantic
Science
49. Important Details
• Project Site
– http://research.microsoft.com/chem4word
• Binaries and source code
– http://chem4word.codeplex.com
• Facebook Page
– http://www.facebook.com/groups/186300551397797/
• Outercurve Foundation
– http://www.outercurve.org
50. Contributors
University of Cambridge Microsoft Research
• Peter Murray-Rust • Alex D. Wade
• Jim Downing • Savas Parastatidis
• Joe Townsend • Oscar Naim
• Pablo Fernicola
• Murray Sargent
• Geraldine Wade
• Tola Chhoeun
• Anthony Hanses
• Jim McGill
Editor's Notes
We’ll start by taking a look at two documents. The one on the left is a binary document that could be representative of the kind of binary document that is ubiquitous today. A search on the internet reveals that .DOC is the most widely deployed document format on the web, not counting what exists beyond corporate firewalls. There are literally billions of documents stored in binary format.On the right we have the same document after it has been migrated to Ecma Open XML format.
This is what the binary document looks like as rendered by Microsoft Word. It is a biography of William Shakespeare. The contents of this biography come from Wikipedia.
Here we see the Word 2007 rendering the Office Open XML version of the same document. As you can see it looks the same as the one you saw before.
However, what is inside the file is completely different. This is what the binary version of the document looks like. This type of file requires specialized one of a kind software to read it. It is also complex and it quite likely that any programmer trying to read and write these files easily make a mistake.
Now let’s take a look inside the Office Open XML version of the document. XML is a universal data interchange format that has proven itself in the enterprise and on the web. As you can see the contents of the document are in human readable XML format. It’s still XML format, and you have to play by the rules of XML however it will work with any of the XML tools that exist on the widest range of platforms.
Because as we shall see, Open XML is an industry standard ZIP file
To store XML parts.
Open XML keeps media in its native format such as JPEG, PNG, GIF, etc.
This is Word, note the added “Chemistry” tab at the top
This is the Chemistry “ribbon”
There are multiple ways to insert chemistry into a document. 1. The built in Chemistry “gallery” – a handy place to store the structures you use often
2. Insert from a local (CML) file3. Insert from a web service such as OPSIN or PubChem
Perform a string search against OPSIN
And the CML file is sent from OPSIN, and inserted into the document
To change the way that structure looks in the document, you can double-click it to launch the 2D editor, remove the labels on the atoms, flip, rotate, etc.
And that will modify the CML file and the image in the document
You can also change to other “views” of the molecule
The Chemistry Navigator pane shows you all of the chemical objects in the current file (and lets you jump to that section of the document)
The Chemistry Navigator also allows you to create a Linked Chemistry Zone (i.e. copy by reference to create another chemical reference in the document which is backed by the same CML structure), or an Unlinked Chemistry Zone (i.e. copy by value to make a new version of the underlying CML, so that it can be modified independent of the first one)
Linked Chemistry Zones allow you to have as many references as you want in the document that are all backed by a single CML molecule, which is stored in the DOCX (ZIP) file