PR-285: Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach
[Saar Kuzi et al., 2020]
Paper link: https://arxiv.org/pdf/2010.01195.pdf
Video presentation link: https://youtu.be/QfkcN4SZ1Po
reviewed by Sunghoon Joo (주성훈)
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach
1. Leveraging Semantic and Lexical Matching to Improve the
Recall of Document Retrieval Systems: A Hybrid Approach
PR-285
https://www.flaticon.com/kr/authors/freepik
https://arxiv.org/pdf/2010.01195.pdf
3. 1. Research Background
Information Retrieval (IR)
• Lexical approach
3/24
Inverted Index Retrieval task
https://giyatto.tistory.com/2 https://devopedia.org/information-retrieval 1) Retrieval stage 2) re-ranking stage
Lexical approach (ex. BM25) : Query document term
4. 1. Research Background
Lexical approach …
• Vocabulary mismatch problem
4/24
폐쇄된 도로와 빙판 고속도로로 Idaho 에서 17 중 교통사고로
적어도 한 명의 운전자가 사망하고 ,
Sierra Nevada 고속도로의 빙결 구간에서 투어 버스 추락,
시애틀 근처에서 100 대의 차량 사고가 발생
Oklahoma와 South Carolina는 각각 3명의 사망자(fatalities)를
기록
Arizona, Kentucky, Missouri, Utah, Virginia에는 각각 두 명이 있
었다. 한 해 동안 한 번의 번개로 인한 사망을 기록한 것은
Washington D.C.; Kansas, Montana, North Dakota,
• Lexical approach query term .
5. 1. Research Background
Retrieval stage semantic model
5/24
• Semantic models tend to have lower recall
[CVPR 2013]
• Using neural networks for retrieval had a very high cost
- Query embedding document embedding
MIPS Query
PR-272
9. 2. Methods
The hybrid retrieval approach
9/24
• Lexical based approach ( BM25 (Anserini toolkit); https://littlefoxdiary.tistory.com/12 )
.
• Weakly supervised learning , training data .
• Approximate kNN search system latency .
• Open-source .
Hybrid system
c
c
c
NN for semantic retrieval model
BM25
10. 2. Methods
BERT based semantic retrieval model
10/24
NN for semantic retrieval model
• BERT architecture : 6 layers, a hidden size of 256, and 4 attention heads
• Training params : Adam optimizer, learning rate of 5e-4 and a batch size of 32 for 5 million training steps.
• Vocabulary : 7500 words
11. 2. Methods 11/24
Document – Query dataset
“Weather Realted Fatalites”
“Information Retrieval”
…
“Sunghoon Joo”
BM25
50 문서
20 문서
3 문서
5 문장 추출
1. Query ( 5 tri-grams bi-grams)
2. BM25 document 10 query
3. query terms 5 : Doc-Query pairing
(BERT ,
.)
5 문장 추출
12. 2. Methods 12/24
Document – Query dataset
“Weather Realted Fatalites” Over the last five years, weather-related
fatalites are down 19% from 2015 …
Over the last five years, Sunghoon Joo are down
19% from 2015 …
1
0
Over the last five years, traffic related fatalites are
down 19% from 2015 …
1
0
1 0.65
Over the last five years, traffic related projects are
down 19% from 2015 …
1 0.55
“Information Retrieval” Information dataset (IR) is finding material (usually
documents) of an unstructured nature …
1 0.6
13. 2. Methods 13/24
Document – Query dataset
• A TREC collection (disks 1&2)
• 441,676 news-wire
• 51, 200 TREC topics query .
• Training data set : 3.8M bi-gram queries, 1.7M tri-gram queries, and about 1B training
examples (passage-query pairs).
14. 2. Methods
Hybrid Merging – RM3
14/24
• RM3 semantic result list .
• Anserini toolkit opensource .
• Query processing time
Hybrid system relevance model RM3 *
c
c
c
*Nasreen Abdul-Jaleel et al., UMass at TREC 2004: Novelty and HARD
https://www.cl.cam.ac.uk/teaching/1617/InfoRtrv/lecture7-relevance-feedback.pdf
• c Lexical Result List RM3 , 2c c .
16. 3. Experimental Results
Experiment 1: Hybrid approach Lexical approach
• Lexical approach semantic approach (neural model re-ranking stage )
• Hybrid approach Lexical approach c (semantic approach list )
16/24
Lexical Result
List (c)
• --------
• --------
• --------
• --------
…
Semantic
Result List (c)
• --------
• --------
• --------
• --------
…
Lexical Result List에서 쿼리와 관계없는 결과를
Semantic Result List에서 대체함
17. 3. Experimental Results
Experiment 2: Hybrid approach
17/24
• RM3 re-ranking ? (1000 lexical list 500 RM3 )
Semantic result list 2000개 고정 →
lexical result lists를 바꿔가며 실험 (∈ {500, 1500, 1000,
2000}) → Merge (최종 문서 수는 초기 lexical based
results와 같게)
• RM3 Merging : Hybrid approach
• Semantic result set Lexical result list merge ,
21. 3. Experimental Results 21/24
50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → Term list 작성 → Jaccard index 계산
• Lexical approach Semantic approach (low Jaccard index).
Experiment 4: lexical result semantic result
22. 3. Experimental Results 22/24
• Lexical approach (BM25).
• Semantic approach .
50개의 queries 선별 → 각 query당 5개의 문서 (lexical, semantic 각각) → 각 250개 문서에 대해 길이 분석
Experiment 4: lexical result semantic result