SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Cross-Language Information Retrieval
University of Arizona

Sumin Byeon

1
Overview
안드로이드 이메일 암호화&

Matching&
algorithm&

Bilingual&
corpus&
database&

Results&in&
English&

Android&email&encryp3on&

Google&
Search&

2
Background
•

Corpus - a collection of written text; a single word or multiple words, or even
phrases and sentences

•

Comparable corpus - a collection of text from pairs of languages referring to
the same domain[1]; (source text, target text) pair

•

N-gram - n-character or n-word slice of a longer string[2]. We refer n-character
slices by the term n-gram. We use 4-gram (four-gram or quad-gram)

•

Source language - the language of the original phrases

•

Target language - the language into which CLIR translates the original phrases
[1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264.
[2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print.

3
Motivation
•

Desire to acquire information even if the information is not
sufficiently available in their native language

•

Survey has shown people have a higher foreign language
proficiency level in reading than in writing

•

CLIR may bridge the gap between their desire to obtain
information and unavailability or under-availability of such
information in their native language

4
Goals
•

Allow users to query for domain-specific (i.e., computer science and software
engineering) information in their native language

•

Present relevant search results in the target language; the language in which
the largest amount of information is available

5
Components
•

Domain-specific bilingual corpus extraction from multiple sources

•

Corpus indexing

•

Querying and string matching

6
Corpus Extraction

7
Corpus Indexing
(S, T) -> (i1, h1), (i2, h2), …, (in, hn)

•

Java$

•

Quad-grams (k=4)

0:$Java$(20451)$

•

Fingerprint overlapping is okay, although it is not the most
space-efficient way

global$variable$

자바$

Frequency

전역 변수$

3:$bal_$(14870)$

50000

8:$aria$(14269)$

37500

25000

example$

예제$

12500

1:$xamp$(20451)$
0

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

86

88

90

92

95

97

99 103

8
Querying & Matching
Java$global$variable$example$$

Java$

자바$

0:$Java$(20451)$

0:$Java$(20451)$
1:$ava_$(24085)$

…$

global$variable$

8:$bal_$(14870)$

전역 변수$

3:$bal_$(14870)$

…$

8:$aria$(14269)$

13:$aria$(14269)$

…$
22:$xamp$(20451)$

example$

예제$

1:$xamp$(20451)$

9
Multiple Candidates
global&variable&

•
•

Longest match first
Confidence: how many times does this comparable
corpus pair appear in a set of documents?

3:&bal_&(14870)&
8:&aria&(14269)&

global&

•

Outcome of matching depends on the domain of the
documents stored in the database

전역 변수&

세계적인&

0:&loba&(25848)&

variable&

변수&

1:&aria&(14269)&

variable&

가변적인&

1:&aria&(14269)&
10
Indexing and Querying Recap

자바 전역 변수 예제!

자바 :!Java!
전역 :!transfer!
전역 :!all!parts!(of)!
전역 변수 :!global!variable!
변수 :!variable!
예제 :!example!

Java!global!variable!
example!!

11
Relationship with Content Addressability

자바 전역 변수 예제&
자바&

Java&

전역 변수&
예제&

global&variable&
example&

Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.&
Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor&
ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis&
suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae&
consectetur&risus.&global&variable&Aenean&justo&risus,&mollis&
sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,&
commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin&
pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non&
elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum&
lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,&
interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,&
ornare&quis&ligula.&

12
Evaluation
•

Matching
•
•

•

Did it translate all the search terms to the target language properly?
Did it preserve domain-specific information?

Searching
•

Hit ratio: # of relevant web pages / # of results on the first page

•

Total number of search results
13
Evaluation
•

재귀 열거 집합 - recursively enumerable sets
•

•

배낭 문제 시간 복잡도 - 배낭 issue the time complexity
•

•

(3/3, 1/1)

(3/4, 1/2)

가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter
energy efficiency maximization
•

(7/7, 4/4)
14
Evaluation
•

Query in source language “재귀 열거 집합”
•

•

Query in target language “recursively enumerable sets”
•

•

(6/10, 15,300)

(10/10, 105,000)

Google Translate result “Set of recursive enumeration”
•

(10/10, 1,990,000)
15
Evaluation
•

Query in source language “배낭 문제 시간 복잡도”
•

•

Query in target language “배낭 issue time complexity”
•

•

(10/10, 31,200)

(2/6, 2,270)

Google Translate result “Knapsack problem, the time complexity”
•

(10/10, 206,000)
16
Evaluation
•

Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화”
•

•

Query in target language “through virtualization datacenter energy efficiency
maximization”
•

•

(5/10, 36,100)

(8/10, 264,000)

Google Translate result “Maximize energy efficiency through data center
virtualization”
•

(10/10, 284,000)
17
Conclusion & Future Work
•

Preliminary results look satisfactory

•

Machine translation based CLIR appears to be more useful in many cases

•

Evaluation factors may not reflect the actual quality of the system

•

Labor-intensive evaluation process - need for an automated evaluation

•

Fuzzy matching based on lexical information (e.g., call, calls)

•

Fuzzy matching based on semantic information (e.g., maximize, maximizing,
maximization, maximum)
18

Contenu connexe

Tendances

Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Julien PLU
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationssChandan Deb
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
 
Python-Introduction-slides-pkt
Python-Introduction-slides-pktPython-Introduction-slides-pkt
Python-Introduction-slides-pktPradyumna Tripathy
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLPRobert Viseur
 
Topic Modelling and APIs
Topic Modelling and APIsTopic Modelling and APIs
Topic Modelling and APIsAli Kheyrollahi
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 

Tendances (20)

Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Working with text data
Working with text dataWorking with text data
Working with text data
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
Python-Introduction-slides-pkt
Python-Introduction-slides-pktPython-Introduction-slides-pkt
Python-Introduction-slides-pkt
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Topic Modelling and APIs
Topic Modelling and APIsTopic Modelling and APIs
Topic Modelling and APIs
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
SAC 2019 ester giallonardo
SAC 2019 ester giallonardoSAC 2019 ester giallonardo
SAC 2019 ester giallonardo
 
NLTK
NLTKNLTK
NLTK
 

En vedette

Ponsetti,bermudez,nellen,gaido
Ponsetti,bermudez,nellen,gaidoPonsetti,bermudez,nellen,gaido
Ponsetti,bermudez,nellen,gaidoaledalmasso
 
Actualog - Facebook для сложных технических изделий, материалов, оборудования
Actualog - Facebook для сложных технических изделий, материалов, оборудованияActualog - Facebook для сложных технических изделий, материалов, оборудования
Actualog - Facebook для сложных технических изделий, материалов, оборудованияActualog
 
Mano miestas Tokijus
Mano miestas TokijusMano miestas Tokijus
Mano miestas Tokijustokyo18
 
第7章 语法制导翻译和中间代码生成
第7章 语法制导翻译和中间代码生成第7章 语法制导翻译和中间代码生成
第7章 语法制导翻译和中间代码生成tjpucompiler
 
Blog pp cultural diversity
Blog pp cultural diversityBlog pp cultural diversity
Blog pp cultural diversityPaulineHeadley
 
د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر
 د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر
د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_رrawan102
 
Presentation to Global Hair & Fashion Group Members
Presentation to Global Hair & Fashion Group MembersPresentation to Global Hair & Fashion Group Members
Presentation to Global Hair & Fashion Group MembersCandi Williams
 
Internet marketing overview
Internet marketing overviewInternet marketing overview
Internet marketing overviewTom Gray
 
動畫表演
動畫表演動畫表演
動畫表演zi_yong
 
Professional Business Results & Selected Accomplishments
Professional Business Results & Selected AccomplishmentsProfessional Business Results & Selected Accomplishments
Professional Business Results & Selected Accomplishmentsmjleib
 
день семьи
день семьидень семьи
день семьиSokol194
 
Presentación t3
Presentación t3Presentación t3
Presentación t3pll-latam
 
Depositos de agua (SPANISH)
Depositos de agua (SPANISH)Depositos de agua (SPANISH)
Depositos de agua (SPANISH)Silos Cordoba
 
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda Consultant
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda ConsultantSuccess Story - Dr Sonica Krishan Author, Speaker, Ayurveda Consultant
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda ConsultantDrSonica Krishan
 
iPad Crazy Session
iPad Crazy SessioniPad Crazy Session
iPad Crazy SessionKdeethomas1
 
東京ソーシャルデザイン研究所Ver4ドラフト
東京ソーシャルデザイン研究所Ver4ドラフト東京ソーシャルデザイン研究所Ver4ドラフト
東京ソーシャルデザイン研究所Ver4ドラフトTakayuki Toda
 

En vedette (20)

Ponsetti,bermudez,nellen,gaido
Ponsetti,bermudez,nellen,gaidoPonsetti,bermudez,nellen,gaido
Ponsetti,bermudez,nellen,gaido
 
Actualog - Facebook для сложных технических изделий, материалов, оборудования
Actualog - Facebook для сложных технических изделий, материалов, оборудованияActualog - Facebook для сложных технических изделий, материалов, оборудования
Actualog - Facebook для сложных технических изделий, материалов, оборудования
 
Mano miestas Tokijus
Mano miestas TokijusMano miestas Tokijus
Mano miestas Tokijus
 
第7章 语法制导翻译和中间代码生成
第7章 语法制导翻译和中间代码生成第7章 语法制导翻译和中间代码生成
第7章 语法制导翻译和中间代码生成
 
Blog pp cultural diversity
Blog pp cultural diversityBlog pp cultural diversity
Blog pp cultural diversity
 
د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر
 د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر
د _______ _د_____ç_د_خ_è _____ث___â_د__ _د___ç___»___è_ر
 
Presentation to Global Hair & Fashion Group Members
Presentation to Global Hair & Fashion Group MembersPresentation to Global Hair & Fashion Group Members
Presentation to Global Hair & Fashion Group Members
 
Internet marketing overview
Internet marketing overviewInternet marketing overview
Internet marketing overview
 
動畫表演
動畫表演動畫表演
動畫表演
 
Professional Business Results & Selected Accomplishments
Professional Business Results & Selected AccomplishmentsProfessional Business Results & Selected Accomplishments
Professional Business Results & Selected Accomplishments
 
K401 L2
K401 L2K401 L2
K401 L2
 
день семьи
день семьидень семьи
день семьи
 
Uyoc
UyocUyoc
Uyoc
 
Schoo01 130906042632-
Schoo01 130906042632-Schoo01 130906042632-
Schoo01 130906042632-
 
Presentación t3
Presentación t3Presentación t3
Presentación t3
 
Depositos de agua (SPANISH)
Depositos de agua (SPANISH)Depositos de agua (SPANISH)
Depositos de agua (SPANISH)
 
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda Consultant
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda ConsultantSuccess Story - Dr Sonica Krishan Author, Speaker, Ayurveda Consultant
Success Story - Dr Sonica Krishan Author, Speaker, Ayurveda Consultant
 
iPad Crazy Session
iPad Crazy SessioniPad Crazy Session
iPad Crazy Session
 
東京ソーシャルデザイン研究所Ver4ドラフト
東京ソーシャルデザイン研究所Ver4ドラフト東京ソーシャルデザイン研究所Ver4ドラフト
東京ソーシャルデザイン研究所Ver4ドラフト
 
Gamze bilg ödevi
Gamze bilg ödeviGamze bilg ödevi
Gamze bilg ödevi
 

Similaire à Cross-Language Information Retrieval

Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rAlexandria University
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
Noun Paraphrasing Based on a Variety of Contexts
Noun Paraphrasing Based on a Variety of ContextsNoun Paraphrasing Based on a Variety of Contexts
Noun Paraphrasing Based on a Variety of ContextsTomoyuki Kajiwara
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...geraintduck
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big DataSameer Wadkar
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
தமிழ்க்கணிமை கட்டமைப்பு
தமிழ்க்கணிமை கட்டமைப்புதமிழ்க்கணிமை கட்டமைப்பு
தமிழ்க்கணிமை கட்டமைப்புBalaSundaraRaman (Sundar)
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentCALPER
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Using selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectivesUsing selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectivesAndrés Vargas
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3Nick Grattan
 
Mixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsMixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsScott Fraundorf
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningLena Shakurova
 

Similaire à Cross-Language Information Retrieval (20)

Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using r
 
C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Noun Paraphrasing Based on a Variety of Contexts
Noun Paraphrasing Based on a Variety of ContextsNoun Paraphrasing Based on a Variety of Contexts
Noun Paraphrasing Based on a Variety of Contexts
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Plagirism checker
Plagirism checkerPlagirism checker
Plagirism checker
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
தமிழ்க்கணிமை கட்டமைப்பு
தமிழ்க்கணிமை கட்டமைப்புதமிழ்க்கணிமை கட்டமைப்பு
தமிழ்க்கணிமை கட்டமைப்பு
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 Development
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Using selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectivesUsing selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectives
 
IR.pptx
IR.pptxIR.pptx
IR.pptx
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
Ir 03
Ir   03Ir   03
Ir 03
 
Mixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsMixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random Effects
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 

Plus de Sumin Byeon

PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]Sumin Byeon
 
BD Talk 2017 봄 - 원정코딩
BD Talk 2017 봄 - 원정코딩BD Talk 2017 봄 - 원정코딩
BD Talk 2017 봄 - 원정코딩Sumin Byeon
 
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법Sumin Byeon
 
Are Credit Cards Evil
Are Credit Cards EvilAre Credit Cards Evil
Are Credit Cards EvilSumin Byeon
 
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법NDC 2016 마이크로토크 - 프로그래머가 투자하는 법
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법Sumin Byeon
 
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기Sumin Byeon
 
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가더 나은 번역기는 나의 삶을 어떻게 바꾸었는가
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가Sumin Byeon
 
2015 PyCon - 프로그래머가 이사하는 법
2015 PyCon - 프로그래머가 이사하는 법2015 PyCon - 프로그래머가 이사하는 법
2015 PyCon - 프로그래머가 이사하는 법Sumin Byeon
 
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담Sumin Byeon
 
SLINKY: Static Linking Reloaded
SLINKY: Static Linking ReloadedSLINKY: Static Linking Reloaded
SLINKY: Static Linking ReloadedSumin Byeon
 
Project Proposal: Translation Example Search Engine
Project Proposal: Translation Example Search EngineProject Proposal: Translation Example Search Engine
Project Proposal: Translation Example Search EngineSumin Byeon
 
Self-Tuning Wireless Network Power Management
Self-Tuning Wireless Network Power ManagementSelf-Tuning Wireless Network Power Management
Self-Tuning Wireless Network Power ManagementSumin Byeon
 
Error tolerant search
Error tolerant searchError tolerant search
Error tolerant searchSumin Byeon
 
Git with bitbucket
Git with bitbucketGit with bitbucket
Git with bitbucketSumin Byeon
 
Git with bitbucket (draft)
Git with bitbucket (draft)Git with bitbucket (draft)
Git with bitbucket (draft)Sumin Byeon
 
RNA Secondary Structure Prediction
RNA Secondary Structure PredictionRNA Secondary Structure Prediction
RNA Secondary Structure PredictionSumin Byeon
 

Plus de Sumin Byeon (16)

PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
 
BD Talk 2017 봄 - 원정코딩
BD Talk 2017 봄 - 원정코딩BD Talk 2017 봄 - 원정코딩
BD Talk 2017 봄 - 원정코딩
 
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법
NDC 2017 마이크로토크 - 프로그래머가 뉴스 읽는 법
 
Are Credit Cards Evil
Are Credit Cards EvilAre Credit Cards Evil
Are Credit Cards Evil
 
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법NDC 2016 마이크로토크 - 프로그래머가 투자하는 법
NDC 2016 마이크로토크 - 프로그래머가 투자하는 법
 
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
 
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가더 나은 번역기는 나의 삶을 어떻게 바꾸었는가
더 나은 번역기는 나의 삶을 어떻게 바꾸었는가
 
2015 PyCon - 프로그래머가 이사하는 법
2015 PyCon - 프로그래머가 이사하는 법2015 PyCon - 프로그래머가 이사하는 법
2015 PyCon - 프로그래머가 이사하는 법
 
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담
[야생의 땅: 듀랑고]의 식물 생태계를 담당하는 21세기 정원사의 OpenCL 경험담
 
SLINKY: Static Linking Reloaded
SLINKY: Static Linking ReloadedSLINKY: Static Linking Reloaded
SLINKY: Static Linking Reloaded
 
Project Proposal: Translation Example Search Engine
Project Proposal: Translation Example Search EngineProject Proposal: Translation Example Search Engine
Project Proposal: Translation Example Search Engine
 
Self-Tuning Wireless Network Power Management
Self-Tuning Wireless Network Power ManagementSelf-Tuning Wireless Network Power Management
Self-Tuning Wireless Network Power Management
 
Error tolerant search
Error tolerant searchError tolerant search
Error tolerant search
 
Git with bitbucket
Git with bitbucketGit with bitbucket
Git with bitbucket
 
Git with bitbucket (draft)
Git with bitbucket (draft)Git with bitbucket (draft)
Git with bitbucket (draft)
 
RNA Secondary Structure Prediction
RNA Secondary Structure PredictionRNA Secondary Structure Prediction
RNA Secondary Structure Prediction
 

Dernier

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Dernier (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Cross-Language Information Retrieval

  • 3. Background • Corpus - a collection of written text; a single word or multiple words, or even phrases and sentences • Comparable corpus - a collection of text from pairs of languages referring to the same domain[1]; (source text, target text) pair • N-gram - n-character or n-word slice of a longer string[2]. We refer n-character slices by the term n-gram. We use 4-gram (four-gram or quad-gram) • Source language - the language of the original phrases • Target language - the language into which CLIR translates the original phrases [1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264. [2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print. 3
  • 4. Motivation • Desire to acquire information even if the information is not sufficiently available in their native language • Survey has shown people have a higher foreign language proficiency level in reading than in writing • CLIR may bridge the gap between their desire to obtain information and unavailability or under-availability of such information in their native language 4
  • 5. Goals • Allow users to query for domain-specific (i.e., computer science and software engineering) information in their native language • Present relevant search results in the target language; the language in which the largest amount of information is available 5
  • 6. Components • Domain-specific bilingual corpus extraction from multiple sources • Corpus indexing • Querying and string matching 6
  • 8. Corpus Indexing (S, T) -> (i1, h1), (i2, h2), …, (in, hn) • Java$ • Quad-grams (k=4) 0:$Java$(20451)$ • Fingerprint overlapping is okay, although it is not the most space-efficient way global$variable$ 자바$ Frequency 전역 변수$ 3:$bal_$(14870)$ 50000 8:$aria$(14269)$ 37500 25000 example$ 예제$ 12500 1:$xamp$(20451)$ 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 86 88 90 92 95 97 99 103 8
  • 9. Querying & Matching Java$global$variable$example$$ Java$ 자바$ 0:$Java$(20451)$ 0:$Java$(20451)$ 1:$ava_$(24085)$ …$ global$variable$ 8:$bal_$(14870)$ 전역 변수$ 3:$bal_$(14870)$ …$ 8:$aria$(14269)$ 13:$aria$(14269)$ …$ 22:$xamp$(20451)$ example$ 예제$ 1:$xamp$(20451)$ 9
  • 10. Multiple Candidates global&variable& • • Longest match first Confidence: how many times does this comparable corpus pair appear in a set of documents? 3:&bal_&(14870)& 8:&aria&(14269)& global& • Outcome of matching depends on the domain of the documents stored in the database 전역 변수& 세계적인& 0:&loba&(25848)& variable& 변수& 1:&aria&(14269)& variable& 가변적인& 1:&aria&(14269)& 10
  • 11. Indexing and Querying Recap 자바 전역 변수 예제! 자바 :!Java! 전역 :!transfer! 전역 :!all!parts!(of)! 전역 변수 :!global!variable! 변수 :!variable! 예제 :!example! Java!global!variable! example!! 11
  • 12. Relationship with Content Addressability 자바 전역 변수 예제& 자바& Java& 전역 변수& 예제& global&variable& example& Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.& Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor& ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis& suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae& consectetur&risus.&global&variable&Aenean&justo&risus,&mollis& sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,& commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin& pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non& elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum& lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,& interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,& ornare&quis&ligula.& 12
  • 13. Evaluation • Matching • • • Did it translate all the search terms to the target language properly? Did it preserve domain-specific information? Searching • Hit ratio: # of relevant web pages / # of results on the first page • Total number of search results 13
  • 14. Evaluation • 재귀 열거 집합 - recursively enumerable sets • • 배낭 문제 시간 복잡도 - 배낭 issue the time complexity • • (3/3, 1/1) (3/4, 1/2) 가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter energy efficiency maximization • (7/7, 4/4) 14
  • 15. Evaluation • Query in source language “재귀 열거 집합” • • Query in target language “recursively enumerable sets” • • (6/10, 15,300) (10/10, 105,000) Google Translate result “Set of recursive enumeration” • (10/10, 1,990,000) 15
  • 16. Evaluation • Query in source language “배낭 문제 시간 복잡도” • • Query in target language “배낭 issue time complexity” • • (10/10, 31,200) (2/6, 2,270) Google Translate result “Knapsack problem, the time complexity” • (10/10, 206,000) 16
  • 17. Evaluation • Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화” • • Query in target language “through virtualization datacenter energy efficiency maximization” • • (5/10, 36,100) (8/10, 264,000) Google Translate result “Maximize energy efficiency through data center virtualization” • (10/10, 284,000) 17
  • 18. Conclusion & Future Work • Preliminary results look satisfactory • Machine translation based CLIR appears to be more useful in many cases • Evaluation factors may not reflect the actual quality of the system • Labor-intensive evaluation process - need for an automated evaluation • Fuzzy matching based on lexical information (e.g., call, calls) • Fuzzy matching based on semantic information (e.g., maximize, maximizing, maximization, maximum) 18