SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Leveraging Semantic and Lexical Matching to Improve the
Recall of Document Retrieval Systems: A Hybrid Approach
PR-285
https://www.flaticon.com/kr/authors/freepik
https://arxiv.org/pdf/2010.01195.pdf
1. Research Background
1. Research Background
Information Retrieval (IR)
• Lexical approach
3/24
Inverted Index Retrieval task
https://giyatto.tistory.com/2 https://devopedia.org/information-retrieval 1) Retrieval stage 2) re-ranking stage
Lexical approach (ex. BM25) : Query document term
1. Research Background
Lexical approach …
• Vocabulary mismatch problem
4/24
폐쇄된 도로와 빙판 고속도로로 Idaho 에서 17 중 교통사고로
적어도 한 명의 운전자가 사망하고 ,
Sierra Nevada 고속도로의 빙결 구간에서 투어 버스 추락,
시애틀 근처에서 100 대의 차량 사고가 발생
Oklahoma와 South Carolina는 각각 3명의 사망자(fatalities)를
기록
Arizona, Kentucky, Missouri, Utah, Virginia에는 각각 두 명이 있
었다. 한 해 동안 한 번의 번개로 인한 사망을 기록한 것은
Washington D.C.; Kansas, Montana, North Dakota,
• Lexical approach query term .
1. Research Background
Retrieval stage semantic model
5/24
• Semantic models tend to have lower recall
[CVPR 2013]
• Using neural networks for retrieval had a very high cost
- Query embedding document embedding
MIPS Query
PR-272
1. Research Background
Related work
6/24
• Semantic approach
Optimized Product Quantization
[CVPR 2013]
• lexical - semantic hybrid approach
- BERT-based re-ranking models (Q-D relevance score BERT)
QA systems, conversational agents, and product search lexical approach
- BERT inverted-index
- Neural network Query expansion : query
- Latent Semantic Indexing (LSI)
- Neural network based document embedding -> kNN based search
1. Research Background
Objective & Approach : a hybrid retrieval approach
• We propose a lexical-semantic hybrid retrieval approach .
7/24
• Commercial system hybrid
• end-to-end weakly supervised approach
• lexical-only approach
• Lexical model, sementic model, combination model
2. Methods
2. Methods
The hybrid retrieval approach
9/24
• Lexical based approach ( BM25 (Anserini toolkit); https://littlefoxdiary.tistory.com/12 )
.
• Weakly supervised learning , training data .
• Approximate kNN search system latency .
• Open-source .
Hybrid system
c
c
c
NN for semantic retrieval model
BM25
2. Methods
BERT based semantic retrieval model
10/24
NN for semantic retrieval model
• BERT architecture : 6 layers, a hidden size of 256, and 4 attention heads
• Training params : Adam optimizer, learning rate of 5e-4 and a batch size of 32 for 5 million training steps.
• Vocabulary : 7500 words
2. Methods 11/24
Document – Query dataset
“Weather Realted Fatalites”
“Information Retrieval”
…
“Sunghoon Joo”
BM25
50 문서
20 문서
3 문서
5 문장 추출
1. Query ( 5 tri-grams bi-grams)
2. BM25 document 10 query
3. query terms 5 : Doc-Query pairing
(BERT ,
.)
5 문장 추출
2. Methods 12/24
Document – Query dataset
“Weather Realted Fatalites” Over the last five years, weather-related
fatalites are down 19% from 2015 …
Over the last five years, Sunghoon Joo are down
19% from 2015 …
1
0
Over the last five years, traffic related fatalites are
down 19% from 2015 …
1
0
1 0.65
Over the last five years, traffic related projects are
down 19% from 2015 …
1 0.55
“Information Retrieval” Information dataset (IR) is finding material (usually
documents) of an unstructured nature …
1 0.6
2. Methods 13/24
Document – Query dataset
• A TREC collection (disks 1&2)
• 441,676 news-wire
• 51, 200 TREC topics query .
• Training data set : 3.8M bi-gram queries, 1.7M tri-gram queries, and about 1B training
examples (passage-query pairs).
2. Methods
Hybrid Merging – RM3
14/24
• RM3 semantic result list .
• Anserini toolkit opensource .
• Query processing time
Hybrid system relevance model RM3 *
c
c
c
*Nasreen Abdul-Jaleel et al., UMass at TREC 2004: Novelty and HARD
https://www.cl.cam.ac.uk/teaching/1617/InfoRtrv/lecture7-relevance-feedback.pdf
• c Lexical Result List RM3 , 2c c .
3. Experimental Results
3. Experimental Results
Experiment 1: Hybrid approach Lexical approach
• Lexical approach semantic approach (neural model re-ranking stage )
• Hybrid approach Lexical approach c (semantic approach list )
16/24
Lexical Result
List (c)
• --------
• --------
• --------
• --------
…
Semantic
Result List (c)
• --------
• --------
• --------
• --------
…
Lexical Result List에서 쿼리와 관계없는 결과를
Semantic Result List에서 대체함
3. Experimental Results
Experiment 2: Hybrid approach
17/24
• RM3 re-ranking ? (1000 lexical list 500 RM3 )
Semantic result list 2000개 고정 →
lexical result lists를 바꿔가며 실험 (∈ {500, 1500, 1000,
2000}) → Merge (최종 문서 수는 초기 lexical based
results와 같게)
• RM3 Merging : Hybrid approach
• Semantic result set Lexical result list merge ,
3. Experimental Results
Experiment 3: Query hybrid approach
• queries hybrid approach (40%) (50%) .
• Hybrid approach query robust
18/24
3. Experimental Results
• Query set (Q1) Hybrid approach
.
19/24
Experiment 3: Query hybrid approach
검색 성능에 따라 Q1, Q2, Q3, Q4 로 Query set을 구성함
df (document frequency) : (the, an, a …)
idf :
• Query Hybrid approach
(Neural net. )
• Query (high idf) lexical model
, hybrid approach
3. Experimental Results
Experiment 4: lexical result semantic result
20/24
• Hybrid approach topic coverage .
• Lexical approach Semantic approach .
Tf-idf vectorization →
t-SNE visualization
(a)
(b) (c)
3. Experimental Results 21/24
50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → Term list 작성 → Jaccard index 계산
• Lexical approach Semantic approach (low Jaccard index).
Experiment 4: lexical result semantic result
3. Experimental Results 22/24
• Lexical approach (BM25).
• Semantic approach .
50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → 각 250개 문서에 대해 길이 분석
Experiment 4: lexical result semantic result
4. Conclusion
4. Conclusions 24/24
Thank you.
• Retrieval stage Lexical approach Semantic
approach .
• Hybrid approach ,
Hybrid approach Lexical, Semantic approach
.
• Future works:
1) Hybrid approach
2)
3) QA system, recommendation, conversational agents information
retrieval task

Contenu connexe

Similaire à PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open DataBlerina Spahiu
 
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16MLconf
 
Data-mining the Semantic Web
Data-mining the Semantic WebData-mining the Semantic Web
Data-mining the Semantic WebFrank Lynam
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables祺傑 林
 
The Matrix: connecting and re-using digital records of archaeological investi...
The Matrix: connecting and re-using digital records of archaeological investi...The Matrix: connecting and re-using digital records of archaeological investi...
The Matrix: connecting and re-using digital records of archaeological investi...Keith.May
 
Reproducible research(1)
Reproducible research(1)Reproducible research(1)
Reproducible research(1)건웅 문
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...Adel Sabour
 
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeSchema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeAndre Freitas
 
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationBlerina Spahiu
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
 

Similaire à PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach (20)

Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16
Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16
 
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
 
Data-mining the Semantic Web
Data-mining the Semantic WebData-mining the Semantic Web
Data-mining the Semantic Web
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Algoritmi Genetici
Algoritmi GeneticiAlgoritmi Genetici
Algoritmi Genetici
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
The Matrix: connecting and re-using digital records of archaeological investi...
The Matrix: connecting and re-using digital records of archaeological investi...The Matrix: connecting and re-using digital records of archaeological investi...
The Matrix: connecting and re-using digital records of archaeological investi...
 
Reproducible research(1)
Reproducible research(1)Reproducible research(1)
Reproducible research(1)
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeSchema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
 
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 

Plus de Sunghoon Joo

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterSunghoon Joo
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersSunghoon Joo
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfSunghoon Joo
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...Sunghoon Joo
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...Sunghoon Joo
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningSunghoon Joo
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingSunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesSunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchSunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...Sunghoon Joo
 

Plus de Sunghoon Joo (20)

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Dernier

cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabusViolet Violet
 
Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvementVijayMuni2
 
Graphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesGraphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesDIPIKA83
 
Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Bahzad5
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Akarthi keyan
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technologyabdulkadirmukarram03
 
Gender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 ProjectGender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 Projectreemakb03
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide LaboratoryBahzad5
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....santhyamuthu1
 
Power System electrical and electronics .pptx
Power System electrical and electronics .pptxPower System electrical and electronics .pptx
Power System electrical and electronics .pptxMUKULKUMAR210
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...soginsider
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Amil baba
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxKISHAN KUMAR
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxLMW Machine Tool Division
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchrohitcse52
 
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
Engineering Mechanics  Chapter 5  Equilibrium of a Rigid BodyEngineering Mechanics  Chapter 5  Equilibrium of a Rigid Body
Engineering Mechanics Chapter 5 Equilibrium of a Rigid BodyAhmadHajasad2
 

Dernier (20)

cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabus
 
Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvement
 
Présentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdfPrésentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdf
 
Graphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesGraphics Primitives and CG Display Devices
Graphics Primitives and CG Display Devices
 
Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part A
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technology
 
Gender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 ProjectGender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 Project
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
 
Power System electrical and electronics .pptx
Power System electrical and electronics .pptxPower System electrical and electronics .pptx
Power System electrical and electronics .pptx
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
 
Lecture 4 .pdf
Lecture 4                              .pdfLecture 4                              .pdf
Lecture 4 .pdf
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptx
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
 
Lecture 2 .pptx
Lecture 2                            .pptxLecture 2                            .pptx
Lecture 2 .pptx
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Lecture 2 .pdf
Lecture 2                           .pdfLecture 2                           .pdf
Lecture 2 .pdf
 
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
Engineering Mechanics  Chapter 5  Equilibrium of a Rigid BodyEngineering Mechanics  Chapter 5  Equilibrium of a Rigid Body
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
 

PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

  • 1. Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach PR-285 https://www.flaticon.com/kr/authors/freepik https://arxiv.org/pdf/2010.01195.pdf
  • 3. 1. Research Background Information Retrieval (IR) • Lexical approach 3/24 Inverted Index Retrieval task https://giyatto.tistory.com/2 https://devopedia.org/information-retrieval 1) Retrieval stage 2) re-ranking stage Lexical approach (ex. BM25) : Query document term
  • 4. 1. Research Background Lexical approach … • Vocabulary mismatch problem 4/24 폐쇄된 도로와 빙판 고속도로로 Idaho 에서 17 중 교통사고로 적어도 한 명의 운전자가 사망하고 , Sierra Nevada 고속도로의 빙결 구간에서 투어 버스 추락, 시애틀 근처에서 100 대의 차량 사고가 발생 Oklahoma와 South Carolina는 각각 3명의 사망자(fatalities)를 기록 Arizona, Kentucky, Missouri, Utah, Virginia에는 각각 두 명이 있 었다. 한 해 동안 한 번의 번개로 인한 사망을 기록한 것은 Washington D.C.; Kansas, Montana, North Dakota, • Lexical approach query term .
  • 5. 1. Research Background Retrieval stage semantic model 5/24 • Semantic models tend to have lower recall [CVPR 2013] • Using neural networks for retrieval had a very high cost - Query embedding document embedding MIPS Query PR-272
  • 6. 1. Research Background Related work 6/24 • Semantic approach Optimized Product Quantization [CVPR 2013] • lexical - semantic hybrid approach - BERT-based re-ranking models (Q-D relevance score BERT) QA systems, conversational agents, and product search lexical approach - BERT inverted-index - Neural network Query expansion : query - Latent Semantic Indexing (LSI) - Neural network based document embedding -> kNN based search
  • 7. 1. Research Background Objective & Approach : a hybrid retrieval approach • We propose a lexical-semantic hybrid retrieval approach . 7/24 • Commercial system hybrid • end-to-end weakly supervised approach • lexical-only approach • Lexical model, sementic model, combination model
  • 9. 2. Methods The hybrid retrieval approach 9/24 • Lexical based approach ( BM25 (Anserini toolkit); https://littlefoxdiary.tistory.com/12 ) . • Weakly supervised learning , training data . • Approximate kNN search system latency . • Open-source . Hybrid system c c c NN for semantic retrieval model BM25
  • 10. 2. Methods BERT based semantic retrieval model 10/24 NN for semantic retrieval model • BERT architecture : 6 layers, a hidden size of 256, and 4 attention heads • Training params : Adam optimizer, learning rate of 5e-4 and a batch size of 32 for 5 million training steps. • Vocabulary : 7500 words
  • 11. 2. Methods 11/24 Document – Query dataset “Weather Realted Fatalites” “Information Retrieval” … “Sunghoon Joo” BM25 50 문서 20 문서 3 문서 5 문장 추출 1. Query ( 5 tri-grams bi-grams) 2. BM25 document 10 query 3. query terms 5 : Doc-Query pairing (BERT , .) 5 문장 추출
  • 12. 2. Methods 12/24 Document – Query dataset “Weather Realted Fatalites” Over the last five years, weather-related fatalites are down 19% from 2015 … Over the last five years, Sunghoon Joo are down 19% from 2015 … 1 0 Over the last five years, traffic related fatalites are down 19% from 2015 … 1 0 1 0.65 Over the last five years, traffic related projects are down 19% from 2015 … 1 0.55 “Information Retrieval” Information dataset (IR) is finding material (usually documents) of an unstructured nature … 1 0.6
  • 13. 2. Methods 13/24 Document – Query dataset • A TREC collection (disks 1&2) • 441,676 news-wire • 51, 200 TREC topics query . • Training data set : 3.8M bi-gram queries, 1.7M tri-gram queries, and about 1B training examples (passage-query pairs).
  • 14. 2. Methods Hybrid Merging – RM3 14/24 • RM3 semantic result list . • Anserini toolkit opensource . • Query processing time Hybrid system relevance model RM3 * c c c *Nasreen Abdul-Jaleel et al., UMass at TREC 2004: Novelty and HARD https://www.cl.cam.ac.uk/teaching/1617/InfoRtrv/lecture7-relevance-feedback.pdf • c Lexical Result List RM3 , 2c c .
  • 16. 3. Experimental Results Experiment 1: Hybrid approach Lexical approach • Lexical approach semantic approach (neural model re-ranking stage ) • Hybrid approach Lexical approach c (semantic approach list ) 16/24 Lexical Result List (c) • -------- • -------- • -------- • -------- … Semantic Result List (c) • -------- • -------- • -------- • -------- … Lexical Result List에서 쿼리와 관계없는 결과를 Semantic Result List에서 대체함
  • 17. 3. Experimental Results Experiment 2: Hybrid approach 17/24 • RM3 re-ranking ? (1000 lexical list 500 RM3 ) Semantic result list 2000개 고정 → lexical result lists를 바꿔가며 실험 (∈ {500, 1500, 1000, 2000}) → Merge (최종 문서 수는 초기 lexical based results와 같게) • RM3 Merging : Hybrid approach • Semantic result set Lexical result list merge ,
  • 18. 3. Experimental Results Experiment 3: Query hybrid approach • queries hybrid approach (40%) (50%) . • Hybrid approach query robust 18/24
  • 19. 3. Experimental Results • Query set (Q1) Hybrid approach . 19/24 Experiment 3: Query hybrid approach 검색 성능에 따라 Q1, Q2, Q3, Q4 로 Query set을 구성함 df (document frequency) : (the, an, a …) idf : • Query Hybrid approach (Neural net. ) • Query (high idf) lexical model , hybrid approach
  • 20. 3. Experimental Results Experiment 4: lexical result semantic result 20/24 • Hybrid approach topic coverage . • Lexical approach Semantic approach . Tf-idf vectorization → t-SNE visualization (a) (b) (c)
  • 21. 3. Experimental Results 21/24 50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → Term list 작성 → Jaccard index 계산 • Lexical approach Semantic approach (low Jaccard index). Experiment 4: lexical result semantic result
  • 22. 3. Experimental Results 22/24 • Lexical approach (BM25). • Semantic approach . 50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → 각 250개 문서에 대해 길이 분석 Experiment 4: lexical result semantic result
  • 24. 4. Conclusions 24/24 Thank you. • Retrieval stage Lexical approach Semantic approach . • Hybrid approach , Hybrid approach Lexical, Semantic approach . • Future works: 1) Hybrid approach 2) 3) QA system, recommendation, conversational agents information retrieval task