Automated metadata creation - Possibilities and pitfalls

•Download as PPTX, PDF•

1 like•2,083 views

This program presents an overview of automated indexing and automated metadata creation, and then discuss a project completed last summer at the Florida State University Law Research Center (formerly Law Library) which used computer created metadata to index individual pages of a looseleaf resource. The program will cover an overview of machine created metadata. Internet search engines use this almost exclusively. Some library projects, and some database companies use automated indexing. The program will highlight an index and search designed to retrieve pages from a looseleaf resource as the page appeared on a specific date over a 20 year period. This search is located at www.fsulawrc.com . This project was indexed using scripting to extract most metadata. Staff then completed missing metadata fields and audited for errors. I will present on the cost-effectiveness of automated metadata creation, given error rates and costs for human and machine produced metadata, and an overall assessment of the potentials for digital library projects. The goal is to assist catalogers in knowing what is possible, what is difficult, and what is easy in using techniques for automated metadata creation. Presenter: Wilhelmina Randtke, Florida State University Libraries - Law Research Center

Education Technology

Automated Metadata Creation:
Possibilities and Pitfalls

Presented by Wilhelmina Randtke
June 10, 2012
Nashville, Tennessee
At the annual meeting of the North American
Serials Interest Group.

Materials posted at
www.randtke.com/presentations/NASIG.html

Teaser: Preview of the sample project.

http://www.fsulawrc.com

Background: What is “metadata”?
Metadata = any indexing information
Examples:
MARC records
color, size, etc. to allow clothes shopping on a
website
writing on the spine of a book
food labels

What we'll cover
 Automated indexing:
 Human vs machine indexing
 Range of tools for automated metadata creation:
Techy and less techy.
 Sample projects
 A little background on relational databases
 Database design for a looseleaf (a resource that
changes state over time).
 Sample project: The Florida Administrative
Code 1970-1983

Automated Indexing:
What’s easy for computers?
Computers like black and white decisions.
Computers are bad with discretion.

One Trillion

1,000,000,000,000

webpages indexed in Google
… 4 years ago …

Nevertheless…

… Human indexing is alive and well

http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress

Who made the metadata:
Human or Machine?

How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html

Not automated indexing, but a related concept….

Always try to think about
how to reuse existing metadata.

The high end: Assigning subject
headings with computer code
Some technologies:
• UIMA (Unstructured Information Management
Architecture)
• GATE (General Architecture for Text
Engineering)
• KEA (Keyphrase Extraction Algorithm)

Person’s role:
Computer Select an appropriate
Program for ontology.
Automated Configure the
Indexing program so that
Ontology it’s looking at
Thesaurus outside sources.
Review the results
and make sure the
assigned subject
headings are
Item good.

Program’s role:
Take ontology or
thesaurus and
apply it to each
Subject Headings item to give
subject headings.

There’s an app for that
Scripts for extracting fields from a thesis posted
on GitHub: https://github.com/ao5357/thesisbot

Many tools exist to extract
text from PDFS to Excel

Walkthrough – examining the extracted
spreadsheets

http://fsulawrc.com/excelVBAfiles/index.html

How to plan the program
• Look for patterns
• Write step-by-step instructions about how to
process the Excel file
• Remember, NO DISCRETION, computers do not
take well to discretion.
• Good steps:
• Go to the last line of the worksheet
• Look for the letter a or A
• Copy starting from the first number in the cell, up to and
including the last number in the cell.
• Bad steps:
• Find the author’s name (this step needs to be broken into
small “stupid” steps)

Writing the program
• Identify appropriate advisors.
• Remember, most IT staff on a campus just install
computers in offices, etc. Programming and database
planning are rare skills. The worst IT personnel will not
realize that they do not have these skills.
• If an IT staff tells you they do not know how to do
something, then go back to that person for advice on all
future projects.
• Try to find entry level material on coding.
• (Sadly, most computer programming instructions already
assume you know some programming.)
• If outsourcing or collaborating, remember, the index is
the ultimate goal. Understanding of the index needs
to be in the picture. You probably have to bring it in.

Finding Advisors: Most campus IT
is about carrying heavy objects

Perfection?
How close to perfection can you get?
Let’s run some code:
A spreadsheet with extracted text:
http://fsulawrc.com/excelVBAfiles/23batch6A.xls
Visual Basic script:
http://fsulawrc.com/excelVBAfiles/VBAscriptFor
FAC.docx
The files: You can retrieve some of these same files by
searching 6A-1 in the main search for the database at
www.fsulawrc.com

How much metadata was missing?
Field Number of empty fields Percent of Field filled
(27,992 fields total, after
preliminary removal of
blank pages)
Chapt. No before dash 183 99.3%
Chapt no after dash 2179 92.2%
Page no. 1766 93.6%
Supp no (ie. Date page 3242 88.4%
went into the looseleaf)
Replacing supplement (ie. All 0%
Data page was removed (however, 105 fields were
from the looseleaf) entered manually in order to
demonstrate the interface
and get funding for manual
metadata creation)

Cheap and fast
and incomplete
This is a search engine build on an index for the
automated metadata only:
http://fsulawrc.com/automatedindex.php

It’s better than a shuffled pile of 30,000 pages.
It’s not very good.
If you are thousands of miles away, then this is
better than print. If you are in the same room as
organized print, print might be better.

Filling in the gaps
Code helps speed workflow, but still time consuming.

http://fsulawrc.com/phptest/chaptbeforedashfill.php

Last step: Auditing for missing pages, by
comparing instruction sheets that went out with
supplements

www.fsulawrc.com/supplementinstructionsheets.pdf

Task Hours spent Category of work
Inspecting looseleaf and 20 (high skill, high training) Database work
planning a database
Digitization with sheetfed 35 (low skill, low training) Digitization
scanner
Planning the code for 20 hours (high skill, high training) Database work
automated indexing
Coding for the automated 35 hours (would be faster for someone Automated metadata
indexing with a programming background)
Running script, and cleaning 35 hours (skilled staff) Automated metadata
up metadata
Loading database and 10 hours (would be about twice as fast for Database work
metadata on a server someone with more database design
experience)
Coding online forms to speed 15 hours (skilled staff) Manual metadata
data entry
Training on documents and 15 hours (unskilled staff, but done before Manual metadata
database design the student assistant got setup with
computer forms and permissions)
Metadata entry for fields the 98.25 hours (unskilled staff) Manual metadata
computer didn’t get
Auditing the database against 342.75 hours (skilled staff; includes Auditing
instruction sheets which went training time for student assistant)

Where did the time go?
Tasks and Hours

Database Work

Digitization

Auditing

Manual Metadata Creation

Automated Metadata
Creation

Error rates
Automated metadata for Supplement Number: 2.4%
Human metadata for Supplement Number: 0.8%

Automated metadata for Page Number
with systematic error: 1.0%
with the systematic error removed: 0.3%
Human metadata for Page Number: 3.1%

Error rates for the thesis indexer on GitHub: 5% - 6%

Do error rates matter?
For computer rates, might be measuring OCR.
Most metadata will be words, not numbers.
• Words are easier for a computer to pull out.
Misspellings are obvious when reviewing output.
• Words are easier for a person to pull out. Less
fatigue.

Recommendations
• For practitioners:
• Consider automating a process. Is it possible to
index this without human involvement?
• Understand what IT support is available. Support
can be someone who picks the appropriate tool,
then you apply it.
• For administrators
• Allow work time for this type of experimentation.

Good resources to get started
• A-PDF to Excel Extractor
• A program that takes text from PDFs and puts it in Excel.
• www.a-pdf.com/to-excel/download.htm
• This is an easy start to get source material into a format
you can work with.
• Excel Visual Basic (VBA) Tutorials by Pan Pantziarka
• Almost all training material on coding assumes you already
know how to code. These tutorials are good, because they
assume you do not already know something.
• www.techbookreport.com/tutorials/excel_vba1.html
• For more advanced instructions, use a search engine to
read message boards.

Good resources to get started
• An eHow instructions telling you how to turn on the
Developer Ribbon in Excel 2007
• http://www.ehow.com/how_7175501_turn-developer-tab-
excel-2007.html
(use these same instructions for Excel 2010; older versions of
Excel have the developer ribbon turned on by default)
• How to get to the tab where you can do simple coding.
• How to Build a Search Engine
• http://www.udacity.com/overview/Course/cs101/CourseR
ev/apr2012
• Takes you through how webcrawlers work, using the
programming language Python. (A website is a string of
text only, nothing more, so these concepts are similar to
metadata extraction.)

Good resources to get started
• Wikipedia section on string processing algrithms.
• http://en.wikipedia.org/wiki/String_%28computer_science
%29#String_processing_algorithms
• These six links go to lists of all the things you can do to
strings. (Remember, a string is a string of letters – it’s
what you will be working with.)
• Use the terminology from here to know what term of art to
put into a search engine so that you can find instructions
on how to do that in whatever code you choose.
• Wikipedia page on relational databases
• http://en.wikipedia.org/wiki/Relational_database
• It will be useful for you to understand primary keys,
foreign keys, and tables referencing each other.

Special thanks to:
Jason Cronk
Anna Annino

What's hot

Exploring Machine Learning for Libraries and Archives: Present and FutureBohyun Kim

ItemMirror, XML & The Promise of Information Integrationkeepingfoundthingsfound

Introduction of Data ScienceJason Geng

South Big Data Hub: Text Data Analysis PanelTrey Grainger

Searching for MeaningTrey Grainger

Information Extraction from Text, presented @ DeloitteDeep Kayal

Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University

Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger

Applications of Machine Learning at USCSri Ambati

Balancing the Dimensions of User IntentTrey Grainger

Bioschemas WorkshopNiall Beard

Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger

COMPUTER BASICSRajat More

What's hot (13)

Exploring Machine Learning for Libraries and Archives: Present and Future

ItemMirror, XML & The Promise of Information Integration

Introduction of Data Science

South Big Data Hub: Text Data Analysis Panel

Searching for Meaning

Information Extraction from Text, presented @ Deloitte

Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment

Thought Vectors and Knowledge Graphs in AI-powered Search

Applications of Machine Learning at USC

Balancing the Dimensions of User Intent

Bioschemas Workshop

Natural Language Search with Knowledge Graphs (Haystack 2019)

COMPUTER BASICS

Viewers also liked

Sessie Metadata Informatie aan Zee 2015Sigrid Vlaemynck

Antwoord op product informatie management in de GroothandelRogier Hosman

De generatiekloof in de informatievoorzieningVlaamse Vereniging voor Bibliotheek, Archief & Documentatie vzw (VVBAD)

HBB Next - LinkedTVLinkedTV

20110926 Ambtenaar 2.0 open coffee over open datafrankverschoor

20140924 toeristische data mobiele gidsenMarc Portier

#SocialStoriesB2B: 8 inspirational content ideas! SOCIAL.INC

Health Plan Identifier! What is it and Why Do You Need It?benefitexpress

So you want to be innovative?Stefan Kolle

Bibliotheek 4.0Rosemie Callewaert

Weg met de museumwebsiteRosemie Callewaert

De bib van Oostende brengt literatuur in de stadVlaamse Vereniging voor Bibliotheek, Archief & Documentatie vzw (VVBAD)

FRBR outside the boxLukas Koster

Royal Opera House: Why we love linked data and the semantic webJamie Tetlow

Semantic Web, Cataloging, & Metadatarobin fay

News Semantic SnapshotJose Luis Redondo Garcia

What is Batch Document Processing? A tutorial for document capture.DocuFi, offering HAI and Infection Prevention Analytics

Disinformation on the Web: impact, characteristics and detection of Wikipedia...voginip

How to be successful with search in your organisationvoginip

LinkedTV - an added value enrichment solution for AV content providersLinkedTV

Viewers also liked (20)

Sessie Metadata Informatie aan Zee 2015

Antwoord op product informatie management in de Groothandel

De generatiekloof in de informatievoorziening

HBB Next - LinkedTV

20110926 Ambtenaar 2.0 open coffee over open data

20140924 toeristische data mobiele gidsen

#SocialStoriesB2B: 8 inspirational content ideas!

Health Plan Identifier! What is it and Why Do You Need It?

So you want to be innovative?

Bibliotheek 4.0

Weg met de museumwebsite

De bib van Oostende brengt literatuur in de stad

FRBR outside the box

Royal Opera House: Why we love linked data and the semantic web

Semantic Web, Cataloging, & Metadata

News Semantic Snapshot

What is Batch Document Processing? A tutorial for document capture.

Disinformation on the Web: impact, characteristics and detection of Wikipedia...

How to be successful with search in your organisation

LinkedTV - an added value enrichment solution for AV content providers

Similar to Automated metadata creation - Possibilities and pitfalls

TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis

Machine LearningRamiro Aduviri Velasco

OSCON 2014: Data Workflows for Machine LearningPaco Nathan

2014 01-ticosaPharo

Practical automation for beginnersSeoweon Yoo

Data Workflows for Machine Learning - Seattle DAMLPaco Nathan

Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan

10 Ways To Improve Your CodeConSanFrancisco123

10 Ways To Improve Your Code( Neal Ford)guestebde

Software Mining and Software DatasetsTao Xie

System design for Web ApplicationMichael Choi

Improving your team’s source code searching capabilitiesNikos Katirtzis

Improving your team's source code searching capabilities - Voxxed Thessalonik...Nikos Katirtzis

Artificial Intelligence with Python | EdurekaEdureka!

Building a Semantic search Engine in a librarySEECS NUST

Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

Keynote at-icpc-2020Ralf Laemmel

Find maximum bugs in limited timebeched

Introductionsarojbhavaraju5

Similar to Automated metadata creation - Possibilities and pitfalls (20)

TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis

Machine Learning

OSCON 2014: Data Workflows for Machine Learning

2014 01-ticosa

Practical automation for beginners

Data Workflows for Machine Learning - Seattle DAML

Data Workflows for Machine Learning - SF Bay Area ML

10 Ways To Improve Your Code

10 Ways To Improve Your Code( Neal Ford)

Software Mining and Software Datasets

System design for Web Application

Improving your team’s source code searching capabilities

Improving your team's source code searching capabilities - Voxxed Thessalonik...

Artificial Intelligence with Python | Edureka

Building a Semantic search Engine in a library

Dice.com Bay Area Search - Beyond Learning to Rank Talk

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Keynote at-icpc-2020

Find maximum bugs in limited time

Introduction

Recently uploaded

ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1

Influencing policy (training slides from Fast Track Impact)Mark Reed

How to Add Barcode on PDF Report in Odoo 17Celine George

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña

Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105

What is Model Inheritance in Odoo 17 ERPCeline George

Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20

How to do quick user assign in kanban in Odoo 17 ERPCeline George

Earth Day Presentation wow hello nice greatYousafMalik24

Difference Between Search & Browse Methods in Odoo 17Celine George

LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb

Recently uploaded (20)

ENGLISH6-Q4-W3.pptxqurter our high choom

Influencing policy (training slides from Fast Track Impact)

How to Add Barcode on PDF Report in Odoo 17

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx

Barangay Council for the Protection of Children (BCPC) Orientation.pptx

What is Model Inheritance in Odoo 17 ERP

Choosing the Right CBSE School A Comprehensive Guide for Parents

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx

Culture Uniformity or Diversity IN SOCIOLOGY.pptx

How to do quick user assign in kanban in Odoo 17 ERP

Earth Day Presentation wow hello nice great

Difference Between Search & Browse Methods in Odoo 17

LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx

Keynote by Prof. Wurzer at Nordex about IP-design

Science 7 Quarter 4 Module 2: Natural Resources.pptx

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf

Automated metadata creation - Possibilities and pitfalls

1. Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html

2. Teaser: Preview of the sample project. http://www.fsulawrc.com

3. Background: What is “metadata”? Metadata = any indexing information Examples: MARC records color, size, etc. to allow clothes shopping on a website writing on the spine of a book food labels

4. What we'll cover  Automated indexing:  Human vs machine indexing  Range of tools for automated metadata creation: Techy and less techy.  Sample projects  A little background on relational databases  Database design for a looseleaf (a resource that changes state over time).  Sample project: The Florida Administrative Code 1970-1983

5. Automated Indexing: What’s easy for computers? Computers like black and white decisions. Computers are bad with discretion.

6. Word search vs. Subject headings

7. One Trillion 1,000,000,000,000 webpages indexed in Google … 4 years ago …

8. Nevertheless… … Human indexing is alive and well

9. How to fund indexing?

10. http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress

11. How to fund indexing?

12. How to fund indexing?

13. Who made the metadata: Human or Machine? How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html

14. Not automated indexing, but a related concept…. Always try to think about how to reuse existing metadata.

15. High Tech automated metadata creation

16. The high end: Assigning subject headings with computer code Some technologies: • UIMA (Unstructured Information Management Architecture) • GATE (General Architecture for Text Engineering) • KEA (Keyphrase Extraction Algorithm)

17. Person’s role: Computer Select an appropriate Program for ontology. Automated Configure the Indexing program so that Ontology it’s looking at Thesaurus outside sources. Review the results and make sure the assigned subject headings are Item good. Program’s role: Take ontology or thesaurus and apply it to each Subject Headings item to give subject headings.

18. http://www.nzdl.org/Kea/examples1.html

19. The lower end: Deterministic fields

20.

21.

22. There’s an app for that Scripts for extracting fields from a thesis posted on GitHub: https://github.com/ao5357/thesisbot

23.

24.

25.

26.

27.

28.

29.

30.

31. Batch OCR

32.

33.

34.

35.

36.

37.

38.

39.

40. Many tools exist to extract text from PDFS to Excel

41.

42.

43. Walkthrough – examining the extracted spreadsheets http://fsulawrc.com/excelVBAfiles/index.html

44. How to plan the program • Look for patterns • Write step-by-step instructions about how to process the Excel file • Remember, NO DISCRETION, computers do not take well to discretion. • Good steps: • Go to the last line of the worksheet • Look for the letter a or A • Copy starting from the first number in the cell, up to and including the last number in the cell. • Bad steps: • Find the author’s name (this step needs to be broken into small “stupid” steps)

45. Writing the program • Identify appropriate advisors. • Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills. • If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects. • Try to find entry level material on coding. • (Sadly, most computer programming instructions already assume you know some programming.) • If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.

46. Finding Advisors: Most campus IT is about carrying heavy objects

47. Finding Advisors: Most campus IT is about carrying heavy objects

48. Perfection? How close to perfection can you get? Let’s run some code: A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls Visual Basic script: http://fsulawrc.com/excelVBAfiles/VBAscriptFor FAC.docx The files: You can retrieve some of these same files by searching 6A-1 in the main search for the database at www.fsulawrc.com

49. How much metadata was missing? Field Number of empty fields Percent of Field filled (27,992 fields total, after preliminary removal of blank pages) Chapt. No before dash 183 99.3% Chapt no after dash 2179 92.2% Page no. 1766 93.6% Supp no (ie. Date page 3242 88.4% went into the looseleaf) Replacing supplement (ie. All 0% Data page was removed (however, 105 fields were from the looseleaf) entered manually in order to demonstrate the interface and get funding for manual metadata creation)

50. Cheap and fast and incomplete This is a search engine build on an index for the automated metadata only: http://fsulawrc.com/automatedindex.php It’s better than a shuffled pile of 30,000 pages. It’s not very good. If you are thousands of miles away, then this is better than print. If you are in the same room as organized print, print might be better.

51. Filling in the gaps Code helps speed workflow, but still time consuming. http://fsulawrc.com/phptest/chaptbeforedashfill.php

52. Last step: Auditing for missing pages, by comparing instruction sheets that went out with supplements www.fsulawrc.com/supplementinstructionsheets.pdf

53. Task Hours spent Category of work Inspecting looseleaf and 20 (high skill, high training) Database work planning a database Digitization with sheetfed 35 (low skill, low training) Digitization scanner Planning the code for 20 hours (high skill, high training) Database work automated indexing Coding for the automated 35 hours (would be faster for someone Automated metadata indexing with a programming background) Running script, and cleaning 35 hours (skilled staff) Automated metadata up metadata Loading database and 10 hours (would be about twice as fast for Database work metadata on a server someone with more database design experience) Coding online forms to speed 15 hours (skilled staff) Manual metadata data entry Training on documents and 15 hours (unskilled staff, but done before Manual metadata database design the student assistant got setup with computer forms and permissions) Metadata entry for fields the 98.25 hours (unskilled staff) Manual metadata computer didn’t get Auditing the database against 342.75 hours (skilled staff; includes Auditing instruction sheets which went training time for student assistant)

54. Where did the time go? Tasks and Hours Database Work Digitization Auditing Manual Metadata Creation Automated Metadata Creation

55. Error rates Automated metadata for Supplement Number: 2.4% Human metadata for Supplement Number: 0.8% Automated metadata for Page Number with systematic error: 1.0% with the systematic error removed: 0.3% Human metadata for Page Number: 3.1% Error rates for the thesis indexer on GitHub: 5% - 6%

56. Do error rates matter? For computer rates, might be measuring OCR. Most metadata will be words, not numbers. • Words are easier for a computer to pull out. Misspellings are obvious when reviewing output. • Words are easier for a person to pull out. Less fatigue.

57. Recommendations • For practitioners: • Consider automating a process. Is it possible to index this without human involvement? • Understand what IT support is available. Support can be someone who picks the appropriate tool, then you apply it. • For administrators • Allow work time for this type of experimentation.

58. Good resources to get started • A-PDF to Excel Extractor • A program that takes text from PDFs and puts it in Excel. • www.a-pdf.com/to-excel/download.htm • This is an easy start to get source material into a format you can work with. • Excel Visual Basic (VBA) Tutorials by Pan Pantziarka • Almost all training material on coding assumes you already know how to code. These tutorials are good, because they assume you do not already know something. • www.techbookreport.com/tutorials/excel_vba1.html • For more advanced instructions, use a search engine to read message boards.

59. Good resources to get started • An eHow instructions telling you how to turn on the Developer Ribbon in Excel 2007 • http://www.ehow.com/how_7175501_turn-developer-tab- excel-2007.html (use these same instructions for Excel 2010; older versions of Excel have the developer ribbon turned on by default) • How to get to the tab where you can do simple coding. • How to Build a Search Engine • http://www.udacity.com/overview/Course/cs101/CourseR ev/apr2012 • Takes you through how webcrawlers work, using the programming language Python. (A website is a string of text only, nothing more, so these concepts are similar to metadata extraction.)

60. Good resources to get started • Wikipedia section on string processing algrithms. • http://en.wikipedia.org/wiki/String_%28computer_science %29#String_processing_algorithms • These six links go to lists of all the things you can do to strings. (Remember, a string is a string of letters – it’s what you will be working with.) • Use the terminology from here to know what term of art to put into a search engine so that you can find instructions on how to do that in whatever code you choose. • Wikipedia page on relational databases • http://en.wikipedia.org/wiki/Relational_database • It will be useful for you to understand primary keys, foreign keys, and tables referencing each other.

61. Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html

62. Special thanks to: Jason Cronk Anna Annino

63. Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html

Editor's Notes

This is a short overview of the project:I built indexing to a state resource which is similar to the Code of Federal Regulations. Who here is familiar with the Code of Federal Regulations?The Code of Federal Regulations is a resource with all government rules. Laws are passed by congress. Rules are the agency interpretations of those laws. They are binding just like law. The Florida Administrative Code is the Florida version of this.They are both resources that change over time. The Code of Federal Regulations is printed once, but rules are continually changing. Every week there is a change. The Florida Administrative Code was printed in 3 ring binders in 1970, and then monthly supplements were put out until 1983. There were 127 supplements total. So, each month, some pages were taken out of the binders, and others were put in.The government did not keep a master copy of the pages as they were removed from the binder. Only two universities kept these materials: The Florida State University law library, and the University of Miami. Neither kept a complete set. Because of this, the resource is both binding law, and also very difficult to access. The Florida State University Law Library gets about 5 to 6 requests each year for old versions of the rules. In print, these were bound according to date removed. So, you have to start at the current version, look at the date of amendment closest after the date you want to find, and then go through all pages removed that month. You also have to do this in the law library, and you have to do this after figuring out how the pages are arranged. There are no instructions for searching the bound pages in the law library.To build the online database, I had to do two things. First, I had to get all this indexing information for each page separately. Any given page could stay in the binder only one month, or could stay in the binder for the whole 14 years. So, two adjacent pages on one date might not be adjacent on the next. Each page needed full indexing information. And… there were over 30,000 pages to index.Second, I had to design a database to hold this type of resource. All the digital library platforms – dSpace, ePrints, Digital Commons, ContentDM – all hold objects which are static, so they don’t change state over time. I had to design a database which allows the resource to be pulled as it appeared on a specific date. The metadata I pulled out is not Dublin Core or any other standard schema. It is metadata to locate the page within the larger resource.(!! Demo search of 6A-1 !!)
Where this comes out, and we have all seen it before is in the different feel of key word search versus subject headings.So, a full text search is something that has only been possible with computers (other than maybe the Bible which got a Concordance long long ago).Meanwhile… subject headings often give really great results, but people tend to go for the keyword search.In research, the problem is: How to get all the relevant documents? No one will solve that.In librarianship and indexing, the problem is: How to fund making the subject headings?One of the current issues bouncing around a listserv I’m on is the U.S. government Office of the Law Revision Counsel in the U.S. House of Representatives thinking to maybe cancel the creation of the General Index of the U.S. Code – that is where all federal laws are codified – so to stop assigning subject headings to parts of federal law. The problem is, without that, and with keyword search because laws tend to be wordy and use the same stock words over and over again, keyword search is basically useless for finding the law you want. It’s a last step in pulling law. The problem also is, someone who doesn’t have good background in research may not realize this, and so maybe you can’t have this subsidized – you can’t pay to organize things centrally, and instead you have to pay much more by having a slower search for all the end users.In libraries, the problem you may have is that you do have staff for assigning subject headings, but then you can’t go as fast as the computer. So, subject headings for 2000 documents is better than keyword search only for 2000 documents. But… subject headings for 200 documents is probably really not as good as keyword search for 2000 documents. With Google, you have keyword search for more than you can grasp. In 1998, Google had indexed 26 million pages. In 2000, it passed the 1 billion mark. In 2008, it passed the one trillion mark.(numbers from Google’s blog at http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html )(!! Miscellaneous: MySQL has built in full text indexing which gives a horrible horrible search. At a slightly better level, Swish-E does indexing for a document set. http://en.wikipedia.org/wiki/SWISH-E !!)
If you try to manually assign records, you will never keep up.
If there is existing metadata, like a MARC record, or anything else in electronic form, then try to think: How can I process this automatically?It is possible to:Pair a MARC record with a digital object. So look at the object, identify a possible MARC record, and match the two up. Then send it to a person for a sanity check.Dump the information from that MARC record into a different schema, like Dublin Core.
These are some “high end” technologies for assigning subject headings and doing automated indexing.UIMA – Used by the BBC to index some webpages about TV programming. Used by the US Department of Defense to index internal documents to meet a regulatory requirement.High end because the technology is more complicated. You probably have to have good IT background or access to someone with an IT background who you can work with. (And, remember, most IT staff in a library are focused on making sure the computers on the ground are working. Almost no one in a university builds the things people use that computer to go to and interact with. Almost all just make sure the computer will turn on and connect to whatever is there.)So, these are the more techy programs for indexing documents. They all work in the same general way.UIMA projects: http://www.ncbi.nlm.nih.gov/pubmed/22541592 Open semantic annotation of scientific publications using DOMEO.: Indexing medical documents using UIMA and developing an ontology (Annotation Ontology (AO)) for indexing these.BBC used UIMA to index their web pages on TV programs.The US Department of Defense used UIMA to index internal documents in order to meet regulatory requirements on record keeping.
All of these work about the same way.The computer looks at the words in the document, in a big mash. Then you point that computer program at an ontology – so a word list – which is specific to the really broad topic that the paper is about. The ontology will look at word combinations from that discipline, a thesaurus can tie in with that so the thesaurus will take synonyms and merge them.The configuration says “Even though this word only happens a few times in the document, it’s more important to the total meaning of the document” “All these words mean the same thing.”The computer program applies the ontology to the item, then assigns subject headings.Some important points: This still has a person involved. The person has to select the ontology. Knowing which one to use to index an entire collection is similar to assigning a subject heading to an item. So, in traditional cataloging, you look at the item, examine it, then assign subject headings. In this type of cataloging, you look at the collection of items, examine them, then you assign an ontology to the collection, and you need expertise in ontologies to do this.The person still has to understand how to assign subject headings to be able to do this well. So, after running the program and assigning subject headings, it’s a good idea to look back at several items and subject headings and check that the computer assigned those correctly. If not, then you have to check what went wrong, and reconfigure what ontologies and thesaurus the computer is using.Out of these three things the person is doing, only one is a IT role – that is configuring the program. The other two are not about IT skills; they are about expertise in subject headings.
This is an example of a comparison of human assigned and computer assigned subject headings from the Keyphrase Extraction Algorithm website.Ok, these look pretty good. Now, let’s go to the website, because we can see the whole example webpage: http://www.nzdl.org/Kea/examples1.htmlOK, so I’m on tab 1, for the FAO’s Agrovoc thesaurus. All of these really well-assigned subject headings are a result of a good pairing between the collection and the thesaurus.In tab 2, you can see a good pairing between the Medical Subject Headings thesaurus. Once again, it’s a good pairing.Tab 3 is another good pairing with a thesaurus.Tab 4 is KEA without using an ontology or thesaurus. The results are not very good.You cannot have the KEA program run all by its lonesome because it will not get meaningful results. It runs with a thesaurus. Someone picks that thesaurus. And someone had to make that thesaurus (many of them are proprietary, and you have to pay to tie in with it and use the thesaurus).
The higher end was on assigning fields that need a value judgment. So, there is no judgment to assign the words in a document. Computers are good at that and an automatic keyword search generation is even built into MySQL, the popular database platform.Judgment is needed to do the hi-tech thing we just looked at: assigning subject headings.Now, we will look at the lower end: Subject headings are important, but we also index documents using many features which don’t take choice to identify and assign. Some other fields, like title, author, date, will be printed in the document and are an entire field in a MARC record, a Dublin Core record, in many different indexing schemes.Lot of indexing time is spent typing in deterministic fields. Those fields are easy for the computer to get.Because these are printed in the document, they are easy for the computer to pull out.
And it turns out, there’s an app for that.This is a javascript program written to look at the first 5 pages of a PDF thesis, look at it, and pull out this basic indexing information. It is written for theses at the institution which created it, so you can’t run it on your theses and get the same results. You would have to modify it a little.The point is, that it is possible to pull out these fields.According to the programmer, this has a 5%-6% error rate in pulling out information. So, it’s pretty reliable.
Both programs use the same methodology:A person will see the pdf as an image of a page.Thecomputer will see the pdf as a long string of text.Then you give it rules to process that text.
Both of these programs, the one for indexing dissertations, and the one I wrote to index pages of the loose leaf, work similarlyThe one I did put content from PDFs into Microsoft Excel format, then read the text in the Excel spreadsheet.The thesisbot used javascript to go into the pdf header, and read the text in the pdf.
It is just as easy to get the computer to see the text in a spreadsheet, as in a text file.
Both are equally easy to create. The computer can very easily see the PDF looking like this. So, the choice is which is easier for you to work with.
I have about 200 folders. Eachfolder has many pdfs in it.
First you need to get text into the PDF. So, it can’t be just an image. It has to have an image, and also the text in computer readable format.To do this, you will run Optical Character Recognition (OCR) on the files. Youcan use whatever software you have. I used Adobe Acrobat 9, and will put instructions for this, because most institutions already have it. Adobe Acrobat 9 is better at batch processing than is X or versions 8 and prior. Version X will not let you process a folder and all subfolders at once. So, you must configure the program more frequently. For this project, Adobe Acrobat 9 required to be configured only once, then processed all files in all project folders. Adobe Acrobat X would have required to be configured once per folder, so more than 200 times. That is much more labor intensive.The goal with this batch processing is that a spare computer will run the OCR while the person does other tasks. Then, later when all files are processed, you can come pick up the files. The OCR for this project took about a week.*Note* These screenshots are a huge part of the slideshow, but only about 2 min of the presentation. This is because I hadn’t requested the Acrobat software be available at the conference.
Create a new sequence.
Name it.
Select Commands
Add Recognize Text using OCR.
Click OK, to save the sequence.
Edit sequence to select the folder you will run it on.
Select folder. The folder you select can have subfolders, so you can only click once and start the program working on a very large project.
Click run sequence.
The reason for extracting to Excel is that Microsoft’s products have simple scripting tools built in. It is possible to go directly into the PDF with javascript, Python, or other full programming language. However, there is a steep learning curve to do this. If you have not programmed before, you may spend weeks trying to get into the format, and you haven’t even gotten to the meat of what you want to do.Microsoft has Visual Basic which is a simple scripting language that calls functions which are built into Excel and other programs. So, to print, you type a line that says “Print” in it then identifies the file. If you don’t choose a printer, it goes to whatever the default printer in MS Word or Excel is set to. You don’t have to write a print driver. When you manipulate a spreadsheet, you use commands like “Worksheet” and “Cell”. You cannot do as much with Visual Basic in Excel as with a full language, but the learning curve is much shallower. I do not have programming background, and this was what was possible for me. (I was much slower than a programmer. It took me about two weeks to write the script, and a programmer would take about 45 minutes.)Many many tools exist to do a batch extraction of text from PDFs to Excel. The tool I used is here: http://www.a-pdf.com/to-excel/download.htm
You can find the extracted spreadsheets at that URL.So, you can click into them and look at what patterns were able to come out.I looked for patterns. I compared several PDFs to the corresponding worksheet, and looked for patterns in the worksheet that I could use to identify fields.
Here is some background on IT staff in universities.The university is not Google. It is not building a system that people access through computer. No, most IT in universities is focused on making sure that the computers in campus buildings work. That involves installing software, and hauling heavy equipment around campus. That does not involve programming.When you look for advice, be aware of that limitation. If an IT staff tells you something is hard to do, it may actually be easy to do. Most of the time, that person is telling you that they don’t know how to do it. If an IT staff tells you that they don’t know how to do it, then that’s a sign of honesty and good self awareness. You should continue to go to that person for advice about future projects. At worst, they will say they don’t know (it’s what they will usually do), but they will not send you down a blind alley. They will not call something impossible when it’s actually pretty easy to do.
Photo by ArtoTeräshttp://ajt.iki.fi/travel/debconf5/ The IT staff you are most likely to see in an organization are people who carry heavy things, make sure computers in the building work, etc. They are focused on workstations, not on the tools and resources that people use those workstations to access.This type of IT staff is unlikely to know programming, and if he does, it is just a lucky chance. His professional contacts will probably be people who work with the same issue: laying cable. Then, if he needs more cable, or a connector, he can borrow it. So, he isn’t even likely to be able to refer you to a programmer.
Photo by GiorgosFidanasThe other highly visible type of IT staff are people who video events, and make up classes. Video and audio production are a specialized area of IT. These guys will know how to connect sound equipment, and their professional contacts will be people who do sound and video.They are professionally far away from computer programming.
You can get a file of the extracted text from Ch. 6A-1, by clicking the first link.The second link will bring up the script I wrote to extract fields.Column B  identifierColumn C  Chapter number before the dashColumn D  Chapter number after the dashColumn E  Page numberColumn F  Supplement number (corresponds with the date that the page went into the binder)Column G  file nameI didn’t extract the date the page came out of the binder. This was a handwritten field which was not available for all pages. It had to be manually entered, or derived from instruction sheets.
Some points:Chapter number before dash could be extracted from the way the pages were physically arranged. So, pages from the same chapter were adjacent in the binders. If pages had been shuffled, this would be much lower.Supplement number is the only field that you really cannot at all infer from the physical arrangement of pages. Therefore, this is a better representation of what you would get if you had a pretty predictable format for a field (ie. Only numbers), and your documents were randomly arranged, so you couldn’t “cheat” and capture fields based on nearby documents.
The reason this search is bad is that there are too many missing fields. (!! Demo with search of 6A-1. !!)If I search and don’t find my page, then I have to check several errata areas with partial matches. If I don’t have an exact match, then I have to open many many files to get to what I want.
This is the form for chapter number before the dash.The student assistant manually filled in missing fields using an online form. She entered an error code for blank pages, indexing errors in other fields if she noticed them, and scanning errors such as pages folded in half or very very crooked. 500 was a code for illegible, and 900 was a code for any other problems – like page folded in half, etc. Then I could go through and examine all records with a 900 recorded in any field.Then there was a form for each other field.
These 590 pages of instructions that went out with 127 supplements over 14 years, were compared to the database. Ideally, everywhere where a page went in, we have a slot for a page (many pages were missing), and everywhere where a page went out we have a slot for a page (the audit for pages out was not completed, but was done up to and including supplement number 19).(!! Demo and open really long PDF. !!)So, in the final database of pages, there is a record for each page saying when it went into the binder, and when it came out. Auditing involved locating each page that came out of the binder, and then locating each page that went into the binder. So, if a page is missing, you still get a record, pull up a page that says “missing PDF” and you can prove the absence.
This is how much time it took to complete each step of the Florida Administrative Code database.
And here is that information visually:Time in:Database work = 50 hoursDigitization = 35 hoursAutomated metadata = 70 hoursManual metadata = 128.25 hoursAuditing = 342.75 hoursTime in:Database work = 50 hours = 8 %Digitization = 35 hours = 6 %Automated metadata = 70 hours = 11%Manual metadata = 128.25 hours = 20%Auditing = 342.75 hours = 55%Auditing involved checking for errors when a page couldn’t be found. So, a found record took only a few seconds to check, but a missing record took a few minutes. Here, about 10% of pages (about 3K pages) were missing. This meant that there were many more “stops” along the way. A complete set of pages probably would have had fewer errors.
MANY INDEXING “ERRORS” ARE FROM MISPRINTS ON THE PAGE, especially for supplement number. Errors from misprints were counted as errors here. Both a person and computer would get these wrong.These two fields because the physical page organization in the print material was not a predictor for these fields. So, these better show indexing purely by looking at the page.For page number, there was one systematic error. A run of about 163 pages all had several X-es in a row. All those were the same error, so easily detected and corrected at once.When those are included, the error rate is only 1% for the computer, and is lower than for the person.Error rates for the person and computer are comparable, but the computer could not detect all the fields and records, while the person could.And finally, the error rate for the thesis indexing project in GitHub was five to six percent, according to the author.
Most was omissions of 1s (they look like Ls) and 0s (they look like Os), or mixing up 5 with 6, or 3 looks like 8.For this project, “11” was not an obvious misspelling of “112”.If it were extracting words, “a1l” is an obvious misspelling of “all”. So, most projects will have obvious errors, which are easier to correct than the errors in this project.Words are also easier for a person to pull out.Errors in the human generated metadata are probably because this is something that involves mental fatigue. The online form for entering metadata by the assistant was something where the person looks at a number, types it in a box, hits submit, and repeats over and over for hours.Lifeguards who watch a pool will get a 15 minute break out of the hour. So, they watch the pool for 45 minutes, then they get a 15 minute break, then they watch the pool. That break is required by insurance, and it is because if you have to stare for too long, then it’s dangerous. The lifeguard will get mental fatigue and someone can drown while they are looking, and they won’t notice it.So, indexing something like this where it’s meaningless numbers is not easy for a person. Something where they read, make a decision, and are engaged will probably have fewer errors.Some other reasons errors don’t matter is that there were some typos on the actual pages. For example, a supplement might be labeled with the earlier supplement number, because they typesetting was not corrected. Typos and errors in printing were not the norm, but they came up regularly. Usually, the clerk who switched out the pages had made a pencil note, and this helped to detect and resolve indexing errors from typos.
For librarians and project managers:Considerautomation:I did not know coding when I started, and actually a big barrier for me was trying to figure out how to get into the Visual Basic screen in Excel (This eHow tells you how to do it: http://www.ehow.com/how_7175501_turn-developer-tab-excel-2007.html ).It was still faster for me to learn enough to learn a little Visual Basic, then get to the messy spreadsheet, then run the script many times and clean up many messy spreadsheets, than it was to index manually. (So, about twice as much time went to automated metadata, but then that got 90 percent of many of those fields, so it was more efficient.)Understand IT resources:People who install computers or record video of events are not programmers. Very different skillset.If you cannot get a collaborator, try to find an advisor who can assess your skill level and point you to a tool YOU can use (not a perfect tool for a perfect programmer).For administrators:If you have never had coding projects before, understand the pacing for coding projects.Some projects look good as they approach completion. So, each step along the way looks sequential and looks “pretty”. Some projects look messy along the way, then clean at the end, and that is the natural progression. If you have not seen the type of project before, you need to understand that scripting is something where the incremental steps to get to the end tend to look bad. They don’t increment. Nothing happens for a long time, and then a lot of things happen fast.It took me about two weeks of part time work to write the script to pull out these fields. When you have a script that almost works, you try to run it and get a crash. No matter how close you are, until you are perfect you get a crash. My supervisors (reference librarians without technical services background) did not seem to understand this. To get work time to do the scripting part of this project, I reported to them that digitizing material and planning an unrelated workshop took longer than it took and hid some hours for scripting in that, because after when I reported that I spent a week working on scripting, that didn’t go over well since the script still crashed. Even though data entry went slower, there was no scrutiny on it, because it had incremental results. But, from the assessment, it was much less efficient.Scripting might be viewed like a renovation project where a wall has severe water damage. When the wall is removed, the room looks really bad during that intermediate stage. Then at the end of the project, the room looks much better because there is no more damaged drywall. If you were to demand pretty intermediate stages, then all the construction crew could do is to repaint, and leave you with the water damage underneath. Many other projects are similar, with ugly steps, so try to put scripting into that category.
JasonCronk provided advice on the project structure.Anna Annino provided assistance with metadata.

Automated metadata creation - Possibilities and pitfalls

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (20)

Similar to Automated metadata creation - Possibilities and pitfalls

Similar to Automated metadata creation - Possibilities and pitfalls (20)

More from NASIG

More from NASIG (20)

Recently uploaded

Recently uploaded (20)

Automated metadata creation - Possibilities and pitfalls

Editor's Notes