SlideShare a Scribd company logo
1 of 63
Automated Metadata Creation:
    Possibilities and Pitfalls


     Presented by Wilhelmina Randtke
              June 10, 2012
           Nashville, Tennessee
At the annual meeting of the North American
           Serials Interest Group.


          Materials posted at
www.randtke.com/presentations/NASIG.html
Teaser: Preview of the sample project.




http://www.fsulawrc.com
Background: What is “metadata”?
Metadata = any indexing information
Examples:
  MARC records
  color, size, etc. to allow clothes shopping on a
    website
  writing on the spine of a book
  food labels
What we'll cover
   Automated indexing:
       Human vs machine indexing
       Range of tools for automated metadata creation:
        Techy and less techy.
       Sample projects
   A little background on relational databases
       Database design for a looseleaf (a resource that
        changes state over time).
   Sample project: The Florida Administrative
    Code 1970-1983
Automated Indexing:
     What’s easy for computers?
Computers like black and white decisions.
Computers are bad with discretion.
Word search vs. Subject headings
One Trillion


1,000,000,000,000

   webpages indexed in Google
        … 4 years ago …
Nevertheless…

… Human indexing is alive and well
How to fund indexing?
http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress
How to fund indexing?
How to fund indexing?
Who made the metadata:
                       Human or Machine?




How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html
Not automated indexing, but a related concept….


              Always try to think about
          how to reuse existing metadata.
High Tech automated metadata
           creation
The high end: Assigning subject
   headings with computer code
Some technologies:
• UIMA (Unstructured Information Management
  Architecture)
• GATE (General Architecture for Text
  Engineering)
• KEA (Keyphrase Extraction Algorithm)
Person’s role:
   Computer                       Select an appropriate
  Program for                       ontology.
  Automated                       Configure the
   Indexing                         program so that
                    Ontology        it’s looking at
                   Thesaurus        outside sources.
                                  Review the results
                                    and make sure the
                                    assigned subject
                                    headings are
      Item                          good.

                               Program’s role:
                                  Take ontology or
                                    thesaurus and
                                    apply it to each
Subject Headings                    item to give
                                    subject headings.
http://www.nzdl.org/Kea/examples1.html
The lower end: Deterministic fields
There’s an app for that
Scripts for extracting fields from a thesis posted
 on GitHub: https://github.com/ao5357/thesisbot
Batch OCR
Many tools exist to extract
 text from PDFS to Excel
Walkthrough – examining the extracted
 spreadsheets


http://fsulawrc.com/excelVBAfiles/index.html
How to plan the program
• Look for patterns
• Write step-by-step instructions about how to
  process the Excel file
  • Remember, NO DISCRETION, computers do not
    take well to discretion.
  • Good steps:
     • Go to the last line of the worksheet
     • Look for the letter a or A
     • Copy starting from the first number in the cell, up to and
       including the last number in the cell.
  • Bad steps:
     • Find the author’s name (this step needs to be broken into
       small “stupid” steps)
Writing the program
• Identify appropriate advisors.
  • Remember, most IT staff on a campus just install
    computers in offices, etc. Programming and database
    planning are rare skills. The worst IT personnel will not
    realize that they do not have these skills.
  • If an IT staff tells you they do not know how to do
    something, then go back to that person for advice on all
    future projects.
• Try to find entry level material on coding.
  • (Sadly, most computer programming instructions already
    assume you know some programming.)
• If outsourcing or collaborating, remember, the index is
  the ultimate goal. Understanding of the index needs
  to be in the picture. You probably have to bring it in.
Finding Advisors: Most campus IT
  is about carrying heavy objects
Finding Advisors: Most campus IT
  is about carrying heavy objects
Perfection?
How close to perfection can you get?
Let’s run some code:
 A spreadsheet with extracted text:
 http://fsulawrc.com/excelVBAfiles/23batch6A.xls
 Visual Basic script:
 http://fsulawrc.com/excelVBAfiles/VBAscriptFor
 FAC.docx
 The files:   You can retrieve some of these same files by
 searching 6A-1 in the main search for the database at
 www.fsulawrc.com
How much metadata was missing?
Field                       Number of empty fields        Percent of Field filled
                            (27,992 fields total, after
                            preliminary removal of
                            blank pages)
Chapt. No before dash       183                           99.3%
Chapt no after dash         2179                          92.2%
Page no.                    1766                          93.6%
Supp no (ie. Date page      3242                          88.4%
went into the looseleaf)
Replacing supplement (ie.   All                          0%
Data page was removed       (however, 105 fields were
from the looseleaf)         entered manually in order to
                            demonstrate the interface
                            and get funding for manual
                            metadata creation)
Cheap and fast
                  and incomplete
This is a search engine build on an index for the
 automated metadata only:
http://fsulawrc.com/automatedindex.php


It’s better than a shuffled pile of 30,000 pages.
It’s not very good.
If you are thousands of miles away, then this is
   better than print. If you are in the same room as
   organized print, print might be better.
Filling in the gaps
Code helps speed workflow, but still time consuming.




http://fsulawrc.com/phptest/chaptbeforedashfill.php
Last step:    Auditing for missing pages, by
  comparing instruction sheets that went out with
                  supplements




www.fsulawrc.com/supplementinstructionsheets.pdf
Task                            Hours spent                              Category of work
Inspecting looseleaf and        20 (high skill, high training)           Database work
planning a database
Digitization with sheetfed      35 (low skill, low training)             Digitization
scanner
Planning the code for           20 hours (high skill, high training)     Database work
automated indexing
Coding for the automated        35 hours (would be faster for someone    Automated metadata
indexing                        with a programming background)
Running script, and cleaning    35 hours (skilled staff)                 Automated metadata
up metadata
Loading database and            10 hours (would be about twice as fast for Database work
metadata on a server            someone with more database design
                                experience)
Coding online forms to speed    15 hours (skilled staff)                 Manual metadata
data entry
Training on documents and       15 hours (unskilled staff, but done before Manual metadata
database design                 the student assistant got setup with
                                computer forms and permissions)
Metadata entry for fields the   98.25 hours (unskilled staff)            Manual metadata
computer didn’t get
Auditing the database against   342.75 hours (skilled staff; includes    Auditing
instruction sheets which went   training time for student assistant)
Where did the time go?
       Tasks and Hours


                         Database Work

                         Digitization

                         Auditing

                         Manual Metadata Creation

                         Automated Metadata
                         Creation
Error rates
Automated metadata for Supplement Number: 2.4%
Human metadata for Supplement Number:         0.8%


Automated metadata for Page Number
 with systematic error:             1.0%
 with the systematic error removed: 0.3%
Human metadata for Page Number: 3.1%


Error rates for the thesis indexer on GitHub: 5% - 6%
Do error rates matter?
For computer rates, might be measuring OCR.
Most metadata will be words, not numbers.
• Words are easier for a computer to pull out.
  Misspellings are obvious when reviewing output.
• Words are easier for a person to pull out. Less
  fatigue.
Recommendations
• For practitioners:
  • Consider automating a process. Is it possible to
    index this without human involvement?
  • Understand what IT support is available. Support
    can be someone who picks the appropriate tool,
    then you apply it.
• For administrators
  • Allow work time for this type of experimentation.
Good resources to get started
• A-PDF to Excel Extractor
  • A program that takes text from PDFs and puts it in Excel.
  • www.a-pdf.com/to-excel/download.htm
  • This is an easy start to get source material into a format
    you can work with.
• Excel Visual Basic (VBA) Tutorials by Pan Pantziarka
  • Almost all training material on coding assumes you already
    know how to code. These tutorials are good, because they
    assume you do not already know something.
  • www.techbookreport.com/tutorials/excel_vba1.html
  • For more advanced instructions, use a search engine to
    read message boards.
Good resources to get started
• An eHow instructions telling you how to turn on the
  Developer Ribbon in Excel 2007
  • http://www.ehow.com/how_7175501_turn-developer-tab-
    excel-2007.html
     (use these same instructions for Excel 2010; older versions of
       Excel have the developer ribbon turned on by default)
  • How to get to the tab where you can do simple coding.
• How to Build a Search Engine
  • http://www.udacity.com/overview/Course/cs101/CourseR
    ev/apr2012
  • Takes you through how webcrawlers work, using the
    programming language Python. (A website is a string of
    text only, nothing more, so these concepts are similar to
    metadata extraction.)
Good resources to get started
• Wikipedia section on string processing algrithms.
  • http://en.wikipedia.org/wiki/String_%28computer_science
    %29#String_processing_algorithms
  • These six links go to lists of all the things you can do to
    strings. (Remember, a string is a string of letters – it’s
    what you will be working with.)
  • Use the terminology from here to know what term of art to
    put into a search engine so that you can find instructions
    on how to do that in whatever code you choose.
• Wikipedia page on relational databases
  • http://en.wikipedia.org/wiki/Relational_database
  • It will be useful for you to understand primary keys,
    foreign keys, and tables referencing each other.
Automated Metadata Creation:
    Possibilities and Pitfalls


     Presented by Wilhelmina Randtke
              June 10, 2012
           Nashville, Tennessee
At the annual meeting of the North American
           Serials Interest Group.


          Materials posted at
www.randtke.com/presentations/NASIG.html
Special thanks to:
  Jason Cronk
  Anna Annino
Automated Metadata Creation:
    Possibilities and Pitfalls


     Presented by Wilhelmina Randtke
              June 10, 2012
           Nashville, Tennessee
At the annual meeting of the North American
           Serials Interest Group.


          Materials posted at
www.randtke.com/presentations/NASIG.html

More Related Content

What's hot

Exploring Machine Learning for Libraries and Archives: Present and Future
Exploring Machine Learning for Libraries and Archives: Present and FutureExploring Machine Learning for Libraries and Archives: Present and Future
Exploring Machine Learning for Libraries and Archives: Present and FutureBohyun Kim
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integrationkeepingfoundthingsfound
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelTrey Grainger
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USCSri Ambati
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User IntentTrey Grainger
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
COMPUTER BASICS
 COMPUTER BASICS COMPUTER BASICS
COMPUTER BASICSRajat More
 

What's hot (13)

Exploring Machine Learning for Libraries and Archives: Present and Future
Exploring Machine Learning for Libraries and Archives: Present and FutureExploring Machine Learning for Libraries and Archives: Present and Future
Exploring Machine Learning for Libraries and Archives: Present and Future
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USC
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
COMPUTER BASICS
 COMPUTER BASICS COMPUTER BASICS
COMPUTER BASICS
 

Viewers also liked

Sessie Metadata Informatie aan Zee 2015
Sessie Metadata Informatie aan Zee 2015Sessie Metadata Informatie aan Zee 2015
Sessie Metadata Informatie aan Zee 2015Sigrid Vlaemynck
 
Antwoord op product informatie management in de Groothandel
Antwoord op product informatie management in de GroothandelAntwoord op product informatie management in de Groothandel
Antwoord op product informatie management in de GroothandelRogier Hosman
 
HBB Next - LinkedTV
HBB Next - LinkedTVHBB Next - LinkedTV
HBB Next - LinkedTVLinkedTV
 
20110926 Ambtenaar 2.0 open coffee over open data
20110926 Ambtenaar 2.0 open coffee over open data20110926 Ambtenaar 2.0 open coffee over open data
20110926 Ambtenaar 2.0 open coffee over open datafrankverschoor
 
20140924 toeristische data mobiele gidsen
20140924 toeristische data mobiele gidsen20140924 toeristische data mobiele gidsen
20140924 toeristische data mobiele gidsenMarc Portier
 
#SocialStoriesB2B: 8 inspirational content ideas!
#SocialStoriesB2B: 8 inspirational content ideas! #SocialStoriesB2B: 8 inspirational content ideas!
#SocialStoriesB2B: 8 inspirational content ideas! SOCIAL.INC
 
Health Plan Identifier! What is it and Why Do You Need It?
Health Plan Identifier! What is it and Why Do You Need It?Health Plan Identifier! What is it and Why Do You Need It?
Health Plan Identifier! What is it and Why Do You Need It?benefitexpress
 
So you want to be innovative?
So you want to be innovative?So you want to be innovative?
So you want to be innovative?Stefan Kolle
 
FRBR outside the box
FRBR outside the boxFRBR outside the box
FRBR outside the boxLukas Koster
 
Royal Opera House: Why we love linked data and the semantic web
Royal Opera House: Why we love linked data and the semantic webRoyal Opera House: Why we love linked data and the semantic web
Royal Opera House: Why we love linked data and the semantic webJamie Tetlow
 
Semantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & MetadataSemantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & Metadatarobin fay
 
Disinformation on the Web: impact, characteristics and detection of Wikipedia...
Disinformation on the Web: impact, characteristics and detection of Wikipedia...Disinformation on the Web: impact, characteristics and detection of Wikipedia...
Disinformation on the Web: impact, characteristics and detection of Wikipedia...voginip
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisationvoginip
 
LinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV
 

Viewers also liked (20)

Sessie Metadata Informatie aan Zee 2015
Sessie Metadata Informatie aan Zee 2015Sessie Metadata Informatie aan Zee 2015
Sessie Metadata Informatie aan Zee 2015
 
Antwoord op product informatie management in de Groothandel
Antwoord op product informatie management in de GroothandelAntwoord op product informatie management in de Groothandel
Antwoord op product informatie management in de Groothandel
 
De generatiekloof in de informatievoorziening
De generatiekloof in de informatievoorzieningDe generatiekloof in de informatievoorziening
De generatiekloof in de informatievoorziening
 
HBB Next - LinkedTV
HBB Next - LinkedTVHBB Next - LinkedTV
HBB Next - LinkedTV
 
20110926 Ambtenaar 2.0 open coffee over open data
20110926 Ambtenaar 2.0 open coffee over open data20110926 Ambtenaar 2.0 open coffee over open data
20110926 Ambtenaar 2.0 open coffee over open data
 
20140924 toeristische data mobiele gidsen
20140924 toeristische data mobiele gidsen20140924 toeristische data mobiele gidsen
20140924 toeristische data mobiele gidsen
 
#SocialStoriesB2B: 8 inspirational content ideas!
#SocialStoriesB2B: 8 inspirational content ideas! #SocialStoriesB2B: 8 inspirational content ideas!
#SocialStoriesB2B: 8 inspirational content ideas!
 
Health Plan Identifier! What is it and Why Do You Need It?
Health Plan Identifier! What is it and Why Do You Need It?Health Plan Identifier! What is it and Why Do You Need It?
Health Plan Identifier! What is it and Why Do You Need It?
 
So you want to be innovative?
So you want to be innovative?So you want to be innovative?
So you want to be innovative?
 
Bibliotheek 4.0
Bibliotheek 4.0Bibliotheek 4.0
Bibliotheek 4.0
 
Weg met de museumwebsite
Weg met de museumwebsiteWeg met de museumwebsite
Weg met de museumwebsite
 
De bib van Oostende brengt literatuur in de stad
De bib van Oostende brengt literatuur in de stadDe bib van Oostende brengt literatuur in de stad
De bib van Oostende brengt literatuur in de stad
 
FRBR outside the box
FRBR outside the boxFRBR outside the box
FRBR outside the box
 
Royal Opera House: Why we love linked data and the semantic web
Royal Opera House: Why we love linked data and the semantic webRoyal Opera House: Why we love linked data and the semantic web
Royal Opera House: Why we love linked data and the semantic web
 
Semantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & MetadataSemantic Web, Cataloging, & Metadata
Semantic Web, Cataloging, & Metadata
 
News Semantic Snapshot
News Semantic SnapshotNews Semantic Snapshot
News Semantic Snapshot
 
What is Batch Document Processing? A tutorial for document capture.
What is Batch Document Processing?  A tutorial for document capture.What is Batch Document Processing?  A tutorial for document capture.
What is Batch Document Processing? A tutorial for document capture.
 
Disinformation on the Web: impact, characteristics and detection of Wikipedia...
Disinformation on the Web: impact, characteristics and detection of Wikipedia...Disinformation on the Web: impact, characteristics and detection of Wikipedia...
Disinformation on the Web: impact, characteristics and detection of Wikipedia...
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisation
 
LinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providers
 

Similar to Automated metadata creation - Possibilities and pitfalls

TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosaPharo
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginnersSeoweon Yoo
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
10 Ways To Improve Your Code( Neal Ford)
10  Ways To  Improve  Your  Code( Neal  Ford)10  Ways To  Improve  Your  Code( Neal  Ford)
10 Ways To Improve Your Code( Neal Ford)guestebde
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software DatasetsTao Xie
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web ApplicationMichael Choi
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesNikos Katirtzis
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Nikos Katirtzis
 
Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaEdureka!
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a librarySEECS NUST
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020Ralf Laemmel
 
Find maximum bugs in limited time
Find maximum bugs in limited timeFind maximum bugs in limited time
Find maximum bugs in limited timebeched
 

Similar to Automated metadata creation - Possibilities and pitfalls (20)

TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosa
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginners
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
10 Ways To Improve Your Code
10 Ways To Improve Your Code10 Ways To Improve Your Code
10 Ways To Improve Your Code
 
10 Ways To Improve Your Code( Neal Ford)
10  Ways To  Improve  Your  Code( Neal  Ford)10  Ways To  Improve  Your  Code( Neal  Ford)
10 Ways To Improve Your Code( Neal Ford)
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilities
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...
 
Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | Edureka
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a library
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
 
Find maximum bugs in limited time
Find maximum bugs in limited timeFind maximum bugs in limited time
Find maximum bugs in limited time
 
Introduction
IntroductionIntroduction
Introduction
 

More from NASIG

Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...
Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...
Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...NASIG
 
The Serial Cohort: A Confederacy of Catalogers
The Serial Cohort: A Confederacy of CatalogersThe Serial Cohort: A Confederacy of Catalogers
The Serial Cohort: A Confederacy of CatalogersNASIG
 
Calculating how much your University spends on Open Access and what to do abo...
Calculating how much your University spends on Open Access and what to do abo...Calculating how much your University spends on Open Access and what to do abo...
Calculating how much your University spends on Open Access and what to do abo...NASIG
 
Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...
Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...
Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...NASIG
 
Analyzing workflows and improving communication across departments
Analyzing workflows and improving communication across departments Analyzing workflows and improving communication across departments
Analyzing workflows and improving communication across departments NASIG
 
Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...
Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...
Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...NASIG
 
Access to Supplemental Journal Article Materials
Access to Supplemental Journal Article Materials Access to Supplemental Journal Article Materials
Access to Supplemental Journal Article Materials NASIG
 
Communications and context: strategies for onboarding new e-resources librari...
Communications and context: strategies for onboarding new e-resources librari...Communications and context: strategies for onboarding new e-resources librari...
Communications and context: strategies for onboarding new e-resources librari...NASIG
 
Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...
Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...
Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...NASIG
 
Bloomsbury digital resources
Bloomsbury digital resourcesBloomsbury digital resources
Bloomsbury digital resourcesNASIG
 
Web accessibility in the institutional repository crafting user centered sub...
Web accessibility in the institutional repository  crafting user centered sub...Web accessibility in the institutional repository  crafting user centered sub...
Web accessibility in the institutional repository crafting user centered sub...NASIG
 
Linked Data at Smithsonian Libraries
Linked Data at Smithsonian Libraries Linked Data at Smithsonian Libraries
Linked Data at Smithsonian Libraries NASIG
 
Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration NASIG
 
Read & Publish – What It Takes to Implement a Seamless Model?
Read & Publish – What It Takes to Implement a Seamless Model?Read & Publish – What It Takes to Implement a Seamless Model?
Read & Publish – What It Takes to Implement a Seamless Model?NASIG
 
Mapping Domain Knowledge for Leading and Managing Change
Mapping Domain Knowledge for Leading and Managing ChangeMapping Domain Knowledge for Leading and Managing Change
Mapping Domain Knowledge for Leading and Managing ChangeNASIG
 
When to hold them when to fold them: reassessing big deals in 2020
When to hold them when to fold them: reassessing big deals in 2020When to hold them when to fold them: reassessing big deals in 2020
When to hold them when to fold them: reassessing big deals in 2020NASIG
 
Getting on the Same Page: Aligning ERM and LIbGuides Content
Getting on the Same Page: Aligning ERM and LIbGuides ContentGetting on the Same Page: Aligning ERM and LIbGuides Content
Getting on the Same Page: Aligning ERM and LIbGuides ContentNASIG
 
A multi-institutional model for advancing open access journals and reclaiming...
A multi-institutional model for advancing open access journals and reclaiming...A multi-institutional model for advancing open access journals and reclaiming...
A multi-institutional model for advancing open access journals and reclaiming...NASIG
 
Knowledge Bases: The Heart of Resource Management
Knowledge Bases: The Heart of Resource ManagementKnowledge Bases: The Heart of Resource Management
Knowledge Bases: The Heart of Resource ManagementNASIG
 
Practical approaches to linked data
Practical approaches to linked dataPractical approaches to linked data
Practical approaches to linked dataNASIG
 

More from NASIG (20)

Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...
Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...
Ctrl + Alt + Repeat: Strategies for Regaining Authority Control after a Migra...
 
The Serial Cohort: A Confederacy of Catalogers
The Serial Cohort: A Confederacy of CatalogersThe Serial Cohort: A Confederacy of Catalogers
The Serial Cohort: A Confederacy of Catalogers
 
Calculating how much your University spends on Open Access and what to do abo...
Calculating how much your University spends on Open Access and what to do abo...Calculating how much your University spends on Open Access and what to do abo...
Calculating how much your University spends on Open Access and what to do abo...
 
Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...
Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...
Measure Twice and Cut Once: How a Budget Cut Impacted Subscription Renewals f...
 
Analyzing workflows and improving communication across departments
Analyzing workflows and improving communication across departments Analyzing workflows and improving communication across departments
Analyzing workflows and improving communication across departments
 
Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...
Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...
Supporting Students: OER and Textbook Affordability Initiatives at a Mid-Size...
 
Access to Supplemental Journal Article Materials
Access to Supplemental Journal Article Materials Access to Supplemental Journal Article Materials
Access to Supplemental Journal Article Materials
 
Communications and context: strategies for onboarding new e-resources librari...
Communications and context: strategies for onboarding new e-resources librari...Communications and context: strategies for onboarding new e-resources librari...
Communications and context: strategies for onboarding new e-resources librari...
 
Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...
Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...
Full Text Coverage Ratios: A Simple Method of Article-Level Collections Analy...
 
Bloomsbury digital resources
Bloomsbury digital resourcesBloomsbury digital resources
Bloomsbury digital resources
 
Web accessibility in the institutional repository crafting user centered sub...
Web accessibility in the institutional repository  crafting user centered sub...Web accessibility in the institutional repository  crafting user centered sub...
Web accessibility in the institutional repository crafting user centered sub...
 
Linked Data at Smithsonian Libraries
Linked Data at Smithsonian Libraries Linked Data at Smithsonian Libraries
Linked Data at Smithsonian Libraries
 
Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration Walk this way: Online content platform migration experiences and collaboration
Walk this way: Online content platform migration experiences and collaboration
 
Read & Publish – What It Takes to Implement a Seamless Model?
Read & Publish – What It Takes to Implement a Seamless Model?Read & Publish – What It Takes to Implement a Seamless Model?
Read & Publish – What It Takes to Implement a Seamless Model?
 
Mapping Domain Knowledge for Leading and Managing Change
Mapping Domain Knowledge for Leading and Managing ChangeMapping Domain Knowledge for Leading and Managing Change
Mapping Domain Knowledge for Leading and Managing Change
 
When to hold them when to fold them: reassessing big deals in 2020
When to hold them when to fold them: reassessing big deals in 2020When to hold them when to fold them: reassessing big deals in 2020
When to hold them when to fold them: reassessing big deals in 2020
 
Getting on the Same Page: Aligning ERM and LIbGuides Content
Getting on the Same Page: Aligning ERM and LIbGuides ContentGetting on the Same Page: Aligning ERM and LIbGuides Content
Getting on the Same Page: Aligning ERM and LIbGuides Content
 
A multi-institutional model for advancing open access journals and reclaiming...
A multi-institutional model for advancing open access journals and reclaiming...A multi-institutional model for advancing open access journals and reclaiming...
A multi-institutional model for advancing open access journals and reclaiming...
 
Knowledge Bases: The Heart of Resource Management
Knowledge Bases: The Heart of Resource ManagementKnowledge Bases: The Heart of Resource Management
Knowledge Bases: The Heart of Resource Management
 
Practical approaches to linked data
Practical approaches to linked dataPractical approaches to linked data
Practical approaches to linked data
 

Recently uploaded

ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 

Recently uploaded (20)

ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 

Automated metadata creation - Possibilities and pitfalls

  • 1. Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html
  • 2. Teaser: Preview of the sample project. http://www.fsulawrc.com
  • 3. Background: What is “metadata”? Metadata = any indexing information Examples: MARC records color, size, etc. to allow clothes shopping on a website writing on the spine of a book food labels
  • 4. What we'll cover  Automated indexing:  Human vs machine indexing  Range of tools for automated metadata creation: Techy and less techy.  Sample projects  A little background on relational databases  Database design for a looseleaf (a resource that changes state over time).  Sample project: The Florida Administrative Code 1970-1983
  • 5. Automated Indexing: What’s easy for computers? Computers like black and white decisions. Computers are bad with discretion.
  • 6. Word search vs. Subject headings
  • 7. One Trillion 1,000,000,000,000 webpages indexed in Google … 4 years ago …
  • 9. How to fund indexing?
  • 11. How to fund indexing?
  • 12. How to fund indexing?
  • 13. Who made the metadata: Human or Machine? How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html
  • 14. Not automated indexing, but a related concept…. Always try to think about how to reuse existing metadata.
  • 15. High Tech automated metadata creation
  • 16. The high end: Assigning subject headings with computer code Some technologies: • UIMA (Unstructured Information Management Architecture) • GATE (General Architecture for Text Engineering) • KEA (Keyphrase Extraction Algorithm)
  • 17. Person’s role: Computer Select an appropriate Program for ontology. Automated Configure the Indexing program so that Ontology it’s looking at Thesaurus outside sources. Review the results and make sure the assigned subject headings are Item good. Program’s role: Take ontology or thesaurus and apply it to each Subject Headings item to give subject headings.
  • 19. The lower end: Deterministic fields
  • 20.
  • 21.
  • 22. There’s an app for that Scripts for extracting fields from a thesis posted on GitHub: https://github.com/ao5357/thesisbot
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40. Many tools exist to extract text from PDFS to Excel
  • 41.
  • 42.
  • 43. Walkthrough – examining the extracted spreadsheets http://fsulawrc.com/excelVBAfiles/index.html
  • 44. How to plan the program • Look for patterns • Write step-by-step instructions about how to process the Excel file • Remember, NO DISCRETION, computers do not take well to discretion. • Good steps: • Go to the last line of the worksheet • Look for the letter a or A • Copy starting from the first number in the cell, up to and including the last number in the cell. • Bad steps: • Find the author’s name (this step needs to be broken into small “stupid” steps)
  • 45. Writing the program • Identify appropriate advisors. • Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills. • If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects. • Try to find entry level material on coding. • (Sadly, most computer programming instructions already assume you know some programming.) • If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.
  • 46. Finding Advisors: Most campus IT is about carrying heavy objects
  • 47. Finding Advisors: Most campus IT is about carrying heavy objects
  • 48. Perfection? How close to perfection can you get? Let’s run some code: A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls Visual Basic script: http://fsulawrc.com/excelVBAfiles/VBAscriptFor FAC.docx The files: You can retrieve some of these same files by searching 6A-1 in the main search for the database at www.fsulawrc.com
  • 49. How much metadata was missing? Field Number of empty fields Percent of Field filled (27,992 fields total, after preliminary removal of blank pages) Chapt. No before dash 183 99.3% Chapt no after dash 2179 92.2% Page no. 1766 93.6% Supp no (ie. Date page 3242 88.4% went into the looseleaf) Replacing supplement (ie. All 0% Data page was removed (however, 105 fields were from the looseleaf) entered manually in order to demonstrate the interface and get funding for manual metadata creation)
  • 50. Cheap and fast and incomplete This is a search engine build on an index for the automated metadata only: http://fsulawrc.com/automatedindex.php It’s better than a shuffled pile of 30,000 pages. It’s not very good. If you are thousands of miles away, then this is better than print. If you are in the same room as organized print, print might be better.
  • 51. Filling in the gaps Code helps speed workflow, but still time consuming. http://fsulawrc.com/phptest/chaptbeforedashfill.php
  • 52. Last step: Auditing for missing pages, by comparing instruction sheets that went out with supplements www.fsulawrc.com/supplementinstructionsheets.pdf
  • 53. Task Hours spent Category of work Inspecting looseleaf and 20 (high skill, high training) Database work planning a database Digitization with sheetfed 35 (low skill, low training) Digitization scanner Planning the code for 20 hours (high skill, high training) Database work automated indexing Coding for the automated 35 hours (would be faster for someone Automated metadata indexing with a programming background) Running script, and cleaning 35 hours (skilled staff) Automated metadata up metadata Loading database and 10 hours (would be about twice as fast for Database work metadata on a server someone with more database design experience) Coding online forms to speed 15 hours (skilled staff) Manual metadata data entry Training on documents and 15 hours (unskilled staff, but done before Manual metadata database design the student assistant got setup with computer forms and permissions) Metadata entry for fields the 98.25 hours (unskilled staff) Manual metadata computer didn’t get Auditing the database against 342.75 hours (skilled staff; includes Auditing instruction sheets which went training time for student assistant)
  • 54. Where did the time go? Tasks and Hours Database Work Digitization Auditing Manual Metadata Creation Automated Metadata Creation
  • 55. Error rates Automated metadata for Supplement Number: 2.4% Human metadata for Supplement Number: 0.8% Automated metadata for Page Number with systematic error: 1.0% with the systematic error removed: 0.3% Human metadata for Page Number: 3.1% Error rates for the thesis indexer on GitHub: 5% - 6%
  • 56. Do error rates matter? For computer rates, might be measuring OCR. Most metadata will be words, not numbers. • Words are easier for a computer to pull out. Misspellings are obvious when reviewing output. • Words are easier for a person to pull out. Less fatigue.
  • 57. Recommendations • For practitioners: • Consider automating a process. Is it possible to index this without human involvement? • Understand what IT support is available. Support can be someone who picks the appropriate tool, then you apply it. • For administrators • Allow work time for this type of experimentation.
  • 58. Good resources to get started • A-PDF to Excel Extractor • A program that takes text from PDFs and puts it in Excel. • www.a-pdf.com/to-excel/download.htm • This is an easy start to get source material into a format you can work with. • Excel Visual Basic (VBA) Tutorials by Pan Pantziarka • Almost all training material on coding assumes you already know how to code. These tutorials are good, because they assume you do not already know something. • www.techbookreport.com/tutorials/excel_vba1.html • For more advanced instructions, use a search engine to read message boards.
  • 59. Good resources to get started • An eHow instructions telling you how to turn on the Developer Ribbon in Excel 2007 • http://www.ehow.com/how_7175501_turn-developer-tab- excel-2007.html (use these same instructions for Excel 2010; older versions of Excel have the developer ribbon turned on by default) • How to get to the tab where you can do simple coding. • How to Build a Search Engine • http://www.udacity.com/overview/Course/cs101/CourseR ev/apr2012 • Takes you through how webcrawlers work, using the programming language Python. (A website is a string of text only, nothing more, so these concepts are similar to metadata extraction.)
  • 60. Good resources to get started • Wikipedia section on string processing algrithms. • http://en.wikipedia.org/wiki/String_%28computer_science %29#String_processing_algorithms • These six links go to lists of all the things you can do to strings. (Remember, a string is a string of letters – it’s what you will be working with.) • Use the terminology from here to know what term of art to put into a search engine so that you can find instructions on how to do that in whatever code you choose. • Wikipedia page on relational databases • http://en.wikipedia.org/wiki/Relational_database • It will be useful for you to understand primary keys, foreign keys, and tables referencing each other.
  • 61. Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html
  • 62. Special thanks to: Jason Cronk Anna Annino
  • 63. Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html

Editor's Notes

  1. This is a short overview of the project:I built indexing to a state resource which is similar to the Code of Federal Regulations. Who here is familiar with the Code of Federal Regulations?The Code of Federal Regulations is a resource with all government rules. Laws are passed by congress. Rules are the agency interpretations of those laws. They are binding just like law. The Florida Administrative Code is the Florida version of this.They are both resources that change over time. The Code of Federal Regulations is printed once, but rules are continually changing. Every week there is a change. The Florida Administrative Code was printed in 3 ring binders in 1970, and then monthly supplements were put out until 1983. There were 127 supplements total. So, each month, some pages were taken out of the binders, and others were put in.The government did not keep a master copy of the pages as they were removed from the binder. Only two universities kept these materials: The Florida State University law library, and the University of Miami. Neither kept a complete set. Because of this, the resource is both binding law, and also very difficult to access. The Florida State University Law Library gets about 5 to 6 requests each year for old versions of the rules. In print, these were bound according to date removed. So, you have to start at the current version, look at the date of amendment closest after the date you want to find, and then go through all pages removed that month. You also have to do this in the law library, and you have to do this after figuring out how the pages are arranged. There are no instructions for searching the bound pages in the law library.To build the online database, I had to do two things. First, I had to get all this indexing information for each page separately. Any given page could stay in the binder only one month, or could stay in the binder for the whole 14 years. So, two adjacent pages on one date might not be adjacent on the next. Each page needed full indexing information. And… there were over 30,000 pages to index.Second, I had to design a database to hold this type of resource. All the digital library platforms – dSpace, ePrints, Digital Commons, ContentDM – all hold objects which are static, so they don’t change state over time. I had to design a database which allows the resource to be pulled as it appeared on a specific date. The metadata I pulled out is not Dublin Core or any other standard schema. It is metadata to locate the page within the larger resource.(!! Demo search of 6A-1 !!)
  2. Where this comes out, and we have all seen it before is in the different feel of key word search versus subject headings.So, a full text search is something that has only been possible with computers (other than maybe the Bible which got a Concordance long long ago).Meanwhile… subject headings often give really great results, but people tend to go for the keyword search.In research, the problem is: How to get all the relevant documents? No one will solve that.In librarianship and indexing, the problem is: How to fund making the subject headings?One of the current issues bouncing around a listserv I’m on is the U.S. government Office of the Law Revision Counsel in the U.S. House of Representatives thinking to maybe cancel the creation of the General Index of the U.S. Code – that is where all federal laws are codified – so to stop assigning subject headings to parts of federal law. The problem is, without that, and with keyword search because laws tend to be wordy and use the same stock words over and over again, keyword search is basically useless for finding the law you want. It’s a last step in pulling law. The problem also is, someone who doesn’t have good background in research may not realize this, and so maybe you can’t have this subsidized – you can’t pay to organize things centrally, and instead you have to pay much more by having a slower search for all the end users.In libraries, the problem you may have is that you do have staff for assigning subject headings, but then you can’t go as fast as the computer. So, subject headings for 2000 documents is better than keyword search only for 2000 documents. But… subject headings for 200 documents is probably really not as good as keyword search for 2000 documents. With Google, you have keyword search for more than you can grasp. In 1998, Google had indexed 26 million pages. In 2000, it passed the 1 billion mark. In 2008, it passed the one trillion mark.(numbers from Google’s blog at http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html )(!! Miscellaneous: MySQL has built in full text indexing which gives a horrible horrible search. At a slightly better level, Swish-E does indexing for a document set. http://en.wikipedia.org/wiki/SWISH-E !!)
  3. If you try to manually assign records, you will never keep up.
  4. If there is existing metadata, like a MARC record, or anything else in electronic form, then try to think: How can I process this automatically?It is possible to:Pair a MARC record with a digital object. So look at the object, identify a possible MARC record, and match the two up. Then send it to a person for a sanity check.Dump the information from that MARC record into a different schema, like Dublin Core.
  5. These are some “high end” technologies for assigning subject headings and doing automated indexing.UIMA – Used by the BBC to index some webpages about TV programming. Used by the US Department of Defense to index internal documents to meet a regulatory requirement.High end because the technology is more complicated. You probably have to have good IT background or access to someone with an IT background who you can work with. (And, remember, most IT staff in a library are focused on making sure the computers on the ground are working. Almost no one in a university builds the things people use that computer to go to and interact with. Almost all just make sure the computer will turn on and connect to whatever is there.)So, these are the more techy programs for indexing documents. They all work in the same general way.UIMA projects: http://www.ncbi.nlm.nih.gov/pubmed/22541592 Open semantic annotation of scientific publications using DOMEO.: Indexing medical documents using UIMA and developing an ontology (Annotation Ontology (AO)) for indexing these.BBC used UIMA to index their web pages on TV programs.The US Department of Defense used UIMA to index internal documents in order to meet regulatory requirements on record keeping.
  6. All of these work about the same way.The computer looks at the words in the document, in a big mash. Then you point that computer program at an ontology – so a word list – which is specific to the really broad topic that the paper is about. The ontology will look at word combinations from that discipline, a thesaurus can tie in with that so the thesaurus will take synonyms and merge them.The configuration says “Even though this word only happens a few times in the document, it’s more important to the total meaning of the document” “All these words mean the same thing.”The computer program applies the ontology to the item, then assigns subject headings.Some important points: This still has a person involved. The person has to select the ontology. Knowing which one to use to index an entire collection is similar to assigning a subject heading to an item. So, in traditional cataloging, you look at the item, examine it, then assign subject headings. In this type of cataloging, you look at the collection of items, examine them, then you assign an ontology to the collection, and you need expertise in ontologies to do this.The person still has to understand how to assign subject headings to be able to do this well. So, after running the program and assigning subject headings, it’s a good idea to look back at several items and subject headings and check that the computer assigned those correctly. If not, then you have to check what went wrong, and reconfigure what ontologies and thesaurus the computer is using.Out of these three things the person is doing, only one is a IT role – that is configuring the program. The other two are not about IT skills; they are about expertise in subject headings.
  7. This is an example of a comparison of human assigned and computer assigned subject headings from the Keyphrase Extraction Algorithm website.Ok, these look pretty good. Now, let’s go to the website, because we can see the whole example webpage: http://www.nzdl.org/Kea/examples1.htmlOK, so I’m on tab 1, for the FAO’s Agrovoc thesaurus. All of these really well-assigned subject headings are a result of a good pairing between the collection and the thesaurus.In tab 2, you can see a good pairing between the Medical Subject Headings thesaurus. Once again, it’s a good pairing.Tab 3 is another good pairing with a thesaurus.Tab 4 is KEA without using an ontology or thesaurus. The results are not very good.You cannot have the KEA program run all by its lonesome because it will not get meaningful results. It runs with a thesaurus. Someone picks that thesaurus. And someone had to make that thesaurus (many of them are proprietary, and you have to pay to tie in with it and use the thesaurus).
  8. The higher end was on assigning fields that need a value judgment. So, there is no judgment to assign the words in a document. Computers are good at that and an automatic keyword search generation is even built into MySQL, the popular database platform.Judgment is needed to do the hi-tech thing we just looked at: assigning subject headings.Now, we will look at the lower end: Subject headings are important, but we also index documents using many features which don’t take choice to identify and assign. Some other fields, like title, author, date, will be printed in the document and are an entire field in a MARC record, a Dublin Core record, in many different indexing schemes.Lot of indexing time is spent typing in deterministic fields. Those fields are easy for the computer to get.Because these are printed in the document, they are easy for the computer to pull out.
  9. And it turns out, there’s an app for that.This is a javascript program written to look at the first 5 pages of a PDF thesis, look at it, and pull out this basic indexing information. It is written for theses at the institution which created it, so you can’t run it on your theses and get the same results. You would have to modify it a little.The point is, that it is possible to pull out these fields.According to the programmer, this has a 5%-6% error rate in pulling out information. So, it’s pretty reliable.
  10. Both programs use the same methodology:A person will see the pdf as an image of a page.Thecomputer will see the pdf as a long string of text.Then you give it rules to process that text.
  11. Both of these programs, the one for indexing dissertations, and the one I wrote to index pages of the loose leaf, work similarlyThe one I did put content from PDFs into Microsoft Excel format, then read the text in the Excel spreadsheet.The thesisbot used javascript to go into the pdf header, and read the text in the pdf.
  12. It is just as easy to get the computer to see the text in a spreadsheet, as in a text file.
  13. Both are equally easy to create. The computer can very easily see the PDF looking like this. So, the choice is which is easier for you to work with.
  14. I have about 200 folders. Eachfolder has many pdfs in it.
  15. First you need to get text into the PDF. So, it can’t be just an image. It has to have an image, and also the text in computer readable format.To do this, you will run Optical Character Recognition (OCR) on the files. Youcan use whatever software you have. I used Adobe Acrobat 9, and will put instructions for this, because most institutions already have it. Adobe Acrobat 9 is better at batch processing than is X or versions 8 and prior. Version X will not let you process a folder and all subfolders at once. So, you must configure the program more frequently. For this project, Adobe Acrobat 9 required to be configured only once, then processed all files in all project folders. Adobe Acrobat X would have required to be configured once per folder, so more than 200 times. That is much more labor intensive.The goal with this batch processing is that a spare computer will run the OCR while the person does other tasks. Then, later when all files are processed, you can come pick up the files. The OCR for this project took about a week.*Note* These screenshots are a huge part of the slideshow, but only about 2 min of the presentation. This is because I hadn’t requested the Acrobat software be available at the conference.
  16. Create a new sequence.
  17. Name it.
  18. Select Commands
  19. Add Recognize Text using OCR.
  20. Click OK, to save the sequence.
  21. Edit sequence to select the folder you will run it on.
  22. Select folder. The folder you select can have subfolders, so you can only click once and start the program working on a very large project.
  23. Click run sequence.
  24. The reason for extracting to Excel is that Microsoft’s products have simple scripting tools built in. It is possible to go directly into the PDF with javascript, Python, or other full programming language. However, there is a steep learning curve to do this. If you have not programmed before, you may spend weeks trying to get into the format, and you haven’t even gotten to the meat of what you want to do.Microsoft has Visual Basic which is a simple scripting language that calls functions which are built into Excel and other programs. So, to print, you type a line that says “Print” in it then identifies the file. If you don’t choose a printer, it goes to whatever the default printer in MS Word or Excel is set to. You don’t have to write a print driver. When you manipulate a spreadsheet, you use commands like “Worksheet” and “Cell”. You cannot do as much with Visual Basic in Excel as with a full language, but the learning curve is much shallower. I do not have programming background, and this was what was possible for me. (I was much slower than a programmer. It took me about two weeks to write the script, and a programmer would take about 45 minutes.)Many many tools exist to do a batch extraction of text from PDFs to Excel. The tool I used is here: http://www.a-pdf.com/to-excel/download.htm
  25. You can find the extracted spreadsheets at that URL.So, you can click into them and look at what patterns were able to come out.I looked for patterns. I compared several PDFs to the corresponding worksheet, and looked for patterns in the worksheet that I could use to identify fields.
  26. Here is some background on IT staff in universities.The university is not Google. It is not building a system that people access through computer. No, most IT in universities is focused on making sure that the computers in campus buildings work. That involves installing software, and hauling heavy equipment around campus. That does not involve programming.When you look for advice, be aware of that limitation. If an IT staff tells you something is hard to do, it may actually be easy to do. Most of the time, that person is telling you that they don’t know how to do it. If an IT staff tells you that they don’t know how to do it, then that’s a sign of honesty and good self awareness. You should continue to go to that person for advice about future projects. At worst, they will say they don’t know (it’s what they will usually do), but they will not send you down a blind alley. They will not call something impossible when it’s actually pretty easy to do.
  27. Photo by ArtoTeräshttp://ajt.iki.fi/travel/debconf5/ The IT staff you are most likely to see in an organization are people who carry heavy things, make sure computers in the building work, etc. They are focused on workstations, not on the tools and resources that people use those workstations to access.This type of IT staff is unlikely to know programming, and if he does, it is just a lucky chance. His professional contacts will probably be people who work with the same issue: laying cable. Then, if he needs more cable, or a connector, he can borrow it. So, he isn’t even likely to be able to refer you to a programmer.
  28. Photo by GiorgosFidanasThe other highly visible type of IT staff are people who video events, and make up classes. Video and audio production are a specialized area of IT. These guys will know how to connect sound equipment, and their professional contacts will be people who do sound and video.They are professionally far away from computer programming.
  29. You can get a file of the extracted text from Ch. 6A-1, by clicking the first link.The second link will bring up the script I wrote to extract fields.Column B  identifierColumn C  Chapter number before the dashColumn D  Chapter number after the dashColumn E  Page numberColumn F  Supplement number (corresponds with the date that the page went into the binder)Column G  file nameI didn’t extract the date the page came out of the binder. This was a handwritten field which was not available for all pages. It had to be manually entered, or derived from instruction sheets.
  30. Some points:Chapter number before dash could be extracted from the way the pages were physically arranged. So, pages from the same chapter were adjacent in the binders. If pages had been shuffled, this would be much lower.Supplement number is the only field that you really cannot at all infer from the physical arrangement of pages. Therefore, this is a better representation of what you would get if you had a pretty predictable format for a field (ie. Only numbers), and your documents were randomly arranged, so you couldn’t “cheat” and capture fields based on nearby documents.
  31. The reason this search is bad is that there are too many missing fields. (!! Demo with search of 6A-1. !!)If I search and don’t find my page, then I have to check several errata areas with partial matches. If I don’t have an exact match, then I have to open many many files to get to what I want.
  32. This is the form for chapter number before the dash.The student assistant manually filled in missing fields using an online form. She entered an error code for blank pages, indexing errors in other fields if she noticed them, and scanning errors such as pages folded in half or very very crooked. 500 was a code for illegible, and 900 was a code for any other problems – like page folded in half, etc. Then I could go through and examine all records with a 900 recorded in any field.Then there was a form for each other field.
  33. These 590 pages of instructions that went out with 127 supplements over 14 years, were compared to the database. Ideally, everywhere where a page went in, we have a slot for a page (many pages were missing), and everywhere where a page went out we have a slot for a page (the audit for pages out was not completed, but was done up to and including supplement number 19).(!! Demo and open really long PDF. !!)So, in the final database of pages, there is a record for each page saying when it went into the binder, and when it came out. Auditing involved locating each page that came out of the binder, and then locating each page that went into the binder. So, if a page is missing, you still get a record, pull up a page that says “missing PDF” and you can prove the absence.
  34. This is how much time it took to complete each step of the Florida Administrative Code database.
  35. And here is that information visually:Time in:Database work = 50 hoursDigitization = 35 hoursAutomated metadata = 70 hoursManual metadata = 128.25 hoursAuditing = 342.75 hoursTime in:Database work = 50 hours = 8 %Digitization = 35 hours = 6 %Automated metadata = 70 hours = 11%Manual metadata = 128.25 hours = 20%Auditing = 342.75 hours = 55%Auditing involved checking for errors when a page couldn’t be found. So, a found record took only a few seconds to check, but a missing record took a few minutes. Here, about 10% of pages (about 3K pages) were missing. This meant that there were many more “stops” along the way. A complete set of pages probably would have had fewer errors.
  36. MANY INDEXING “ERRORS” ARE FROM MISPRINTS ON THE PAGE, especially for supplement number. Errors from misprints were counted as errors here. Both a person and computer would get these wrong.These two fields because the physical page organization in the print material was not a predictor for these fields. So, these better show indexing purely by looking at the page.For page number, there was one systematic error. A run of about 163 pages all had several X-es in a row. All those were the same error, so easily detected and corrected at once.When those are included, the error rate is only 1% for the computer, and is lower than for the person.Error rates for the person and computer are comparable, but the computer could not detect all the fields and records, while the person could.And finally, the error rate for the thesis indexing project in GitHub was five to six percent, according to the author.
  37. Most was omissions of 1s (they look like Ls) and 0s (they look like Os), or mixing up 5 with 6, or 3 looks like 8.For this project, “11” was not an obvious misspelling of “112”.If it were extracting words, “a1l” is an obvious misspelling of “all”. So, most projects will have obvious errors, which are easier to correct than the errors in this project.Words are also easier for a person to pull out.Errors in the human generated metadata are probably because this is something that involves mental fatigue. The online form for entering metadata by the assistant was something where the person looks at a number, types it in a box, hits submit, and repeats over and over for hours.Lifeguards who watch a pool will get a 15 minute break out of the hour. So, they watch the pool for 45 minutes, then they get a 15 minute break, then they watch the pool. That break is required by insurance, and it is because if you have to stare for too long, then it’s dangerous. The lifeguard will get mental fatigue and someone can drown while they are looking, and they won’t notice it.So, indexing something like this where it’s meaningless numbers is not easy for a person. Something where they read, make a decision, and are engaged will probably have fewer errors.Some other reasons errors don’t matter is that there were some typos on the actual pages. For example, a supplement might be labeled with the earlier supplement number, because they typesetting was not corrected. Typos and errors in printing were not the norm, but they came up regularly. Usually, the clerk who switched out the pages had made a pencil note, and this helped to detect and resolve indexing errors from typos.
  38. For librarians and project managers:Considerautomation:I did not know coding when I started, and actually a big barrier for me was trying to figure out how to get into the Visual Basic screen in Excel (This eHow tells you how to do it: http://www.ehow.com/how_7175501_turn-developer-tab-excel-2007.html ).It was still faster for me to learn enough to learn a little Visual Basic, then get to the messy spreadsheet, then run the script many times and clean up many messy spreadsheets, than it was to index manually. (So, about twice as much time went to automated metadata, but then that got 90 percent of many of those fields, so it was more efficient.)Understand IT resources:People who install computers or record video of events are not programmers. Very different skillset.If you cannot get a collaborator, try to find an advisor who can assess your skill level and point you to a tool YOU can use (not a perfect tool for a perfect programmer).For administrators:If you have never had coding projects before, understand the pacing for coding projects.Some projects look good as they approach completion. So, each step along the way looks sequential and looks “pretty”. Some projects look messy along the way, then clean at the end, and that is the natural progression. If you have not seen the type of project before, you need to understand that scripting is something where the incremental steps to get to the end tend to look bad. They don’t increment. Nothing happens for a long time, and then a lot of things happen fast.It took me about two weeks of part time work to write the script to pull out these fields. When you have a script that almost works, you try to run it and get a crash. No matter how close you are, until you are perfect you get a crash. My supervisors (reference librarians without technical services background) did not seem to understand this. To get work time to do the scripting part of this project, I reported to them that digitizing material and planning an unrelated workshop took longer than it took and hid some hours for scripting in that, because after when I reported that I spent a week working on scripting, that didn’t go over well since the script still crashed. Even though data entry went slower, there was no scrutiny on it, because it had incremental results. But, from the assessment, it was much less efficient.Scripting might be viewed like a renovation project where a wall has severe water damage. When the wall is removed, the room looks really bad during that intermediate stage. Then at the end of the project, the room looks much better because there is no more damaged drywall. If you were to demand pretty intermediate stages, then all the construction crew could do is to repaint, and leave you with the water damage underneath. Many other projects are similar, with ugly steps, so try to put scripting into that category.
  39. JasonCronk provided advice on the project structure.Anna Annino provided assistance with metadata.