SlideShare une entreprise Scribd logo
1  sur  29
IMPROVED QUERY REFORMULATION FOR
CONCEPT LOCATION USING CODERANK AND
DOCUMENT STRUCTURES
Mohammad Masudur Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan, Canada
International Conference on Automated Software Engineering
(ASE 2017), Urbana-Champaign, IL, USA
AN EXAMPLE CHANGE REQUEST
2
Field Content
Issue ID 31110
Product eclipse.jdt.debug
Title Debbugger Source Lookup does not work with variables
Description In the Debugger Source Lookup dialog I can also select
variables for source lookup. (Advanced... > Add
Variables). I selected the variable which points to the
archive containing the source file for the type, but the
debugger still claims that he cannot find the source
SEARCH KEYWORD SELECTION
3
Field Content
Issue ID 31110
Product eclipse.jdt.debug
Title Debbugger Source Lookup does not work with
variables
Description In the Debugger Source Lookup dialog I can also
select variables for source lookup. (Advanced... > Add
Variables). I selected the variable which points to the
archive containing the source file for the type, but the
debugger still claims that he cannot find the source.
CHANGE REQUEST TO CODE MAPPING
4
Field Content
Issue ID 31110
Product eclipse.jdt.debug
Title Debbugger Source Lookup does not work with
variables
Description In the Debugger Source Lookup dialog I can also
select variables for source lookup. (Advanced... > Add
Variables). I selected the variable which points to the
archive containing the source file for the type, but the
debugger still claims that
he cannot find the source
BASELINE SEARCH QUERIES
5
Technique Query QE
Baseline debugger source lookup 79
Baseline debugger source lookup work variables 77
Baseline
query
Baseline +
Expansion terms
Pseudo-relevance Feedback
TRADITIONAL QUERY REFORMULATIONS
6
Technique Reformulated Query QE
RSV 1990 debugger source lookup work variables +
launch configuration jdt java debug
30
Sisman &
Kak 2013
debugger source lookup work variables +
test exception suite core code
51
Refoqus
2013
debugger source lookup work variables +
launch jdt configuration classpath project
12
Technique Query QE
Baseline debugger source lookup 79
Baseline debugger source lookup work variables 77
BIG PICTURE: TERM WEIGHTING
7


RFDd t
t
n
D
dftIDFTF log)),log(1()(
Baseline
query
Baseline +
Expansion terms
BIG PICTURE: TERM WEIGHTING
8


RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different semantics
• Different structures
OUR CONTRIBUTIONS (2)
 Novel term weighting method – CodeRank
 Novel query reformulation technique -- ACER
9
CODERANK: TERM WEIGHTING FOR SOURCE
CODE TERMS
10
CODERANK CALCULATION: STEP I
11
CODERANK CALCULATION: STEP II
12
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
CODERANK CALCULATION: STEP III
13


)(
)10(
|)(|
)(
)1()(
iVInj j
j
i
VOut
VS
VS 
Most important face
in this crowd
1. resolve
2. required
3. launch
4. classpath
5. runtime
ACER: QUERY REFORMULATION USING
CODERANK & MACHINE LEARNING
14
ACER: SCHEMATIC DIAGRAM
15
ACER: SELECTION OF THE BEST QUERY
REFORMULATION
16
Ref. candidate
(method sig.)
Ref. candidate
(field sig.)
Ref. candidate
(method + field sigs)
Data re-samplingMachine learning
(Ensemble learning)
Select of the best
reformulation
Reformulated
query
ACER: QUERY REFORMULATIONS
17
Technique Query QE
Baseline debugger source lookup 79
Baseline debugger source lookup work variables 77
Refoqus
2013
debugger source lookup work variables +
launch jdt configuration classpath project
12
CodeRank
(method)
debugger source lookup work variables +
launch debug resolve required classpath
02
CodeRank
(field)
debugger source lookup work variables +
label classpath system resolution launch
06
CodeRank
(both)
debugger source lookup work variables +
java type launch classpath label
16
ACER debugger source lookup work variables +
launch debug resolve required classpath
02
EXPERIMENTAL DATASET
18
8 Projects (Apache + Eclipse)
GitHub commits &
Change set
BugZilla + JIRA issues
1,675 change
requests
EXPERIMENTAL SETUP
19
Change
request
Baseline
query
Reformulated
query
Code search
Our ranks
Baseline
ranks
Compare
Query Effectiveness (QE)
Mean Reciprocal Rank (MRR)
Top-K Accuracy
RESEARCH QUESTIONS (5)
 RQ1: Does ACER improve baseline queries
significantly?
 RQ2: Does CodeRank perform better than the
traditional term weights (e.g., TF-IDF)?
 RQ3: Does document structure make a
difference in query reformulation?
 RQ4: How stemming, query length and relevance
feedback size affect our performance?
 RQ5: Does ACER outperform the state-of-the-art
in query reformulation for concept location?
20
ANSWERING RQ1: QUERY EFFECTIVENESS OVER
BASELINE
21
Query Pairs Improved (MRD Worsened
(MRD)
P-value Preserved
CodeRankmethod vs.
Baseline
58.93% (-61) 37.99% (+131) 0.007* 3.08%
CodeRankfield vs.
Baseline
52.51% (-51) 44.57% (+151) 0.063 2.91%
CodeRankboth vs.
Baseline
58.62% (-51) 38.19% (+136) *0.018* 3.20%
ACER vs. Baseline 71.05% (-81) 2.51% (+104) <0.001* 26.44%
*= Significant difference between improvements and worsening, MRD = Mean Rank
Difference
ANSWERING RQ2: CODERANK VS. TRADITIONAL
TERM WEIGHTS
22
ANSWERING RQ3: DO SOURCE DOCUMENT
STRUCTURES MATTER?
23
ANSWERING RQ3: DO SOURCE DOCUMENT
STRUCTURES MATTER?
24
ANSWERING RQ4: IMPACT OF
REFORMULATION LENGTH
25
RQ5: COMPARISON WITH EXISTING METHODS
26*Our performance is significantly higher for each metric
than the state-of-the-art
1. CodeRank
2. Document contexts
3. Data re-sampling
TAKE-HOME MESSAGES
 Reformulation of a search query is highly challenging
for the developers, costs lots of efforts.
 Traditional term weights are not sufficient enough.
 We provide CodeRank that exploits source term
semantics and source document contexts.
 We provide ACER that provides the best from a set of
reformulation candidates prepared by CodeRank.
 Experiments with 1,675 change requests from 8 OSS
systems of Apache & Eclipse.
 71% of queries improved, only 3% worsened by ACER.
 Comparison with five methods including the state-of-the-
art validates our approach. 27
THANK YOU !!! QUESTIONS?
28
More details on CodeRank & ACER:
http://www.usask.ca/~masud.rahman/acer/
Contact: masud.rahman@usask.ca
Masud Rahman
RQ5: COMPARISON WITH EXISTING METHODS
29Our Top-K accuracy is clearly higher for various K-values
than the state-of-the-art

Contenu connexe

Similaire à ACER-ASE2017-slides

Mksong proposal-slide
Mksong proposal-slideMksong proposal-slide
Mksong proposal-slidemksong
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
Agile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tddAgile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tddSrinivasa GV
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptxKtonNguyn2
 
The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...Ali Ouni
 
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationkrws
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018Manish Pandey
 
The Pill for Your Migration Hell
The Pill for Your Migration HellThe Pill for Your Migration Hell
The Pill for Your Migration HellDatabricks
 
1z0-419 Oracle Application Development Framework 12c Essentials Test
1z0-419 Oracle Application Development Framework 12c Essentials Test1z0-419 Oracle Application Development Framework 12c Essentials Test
1z0-419 Oracle Application Development Framework 12c Essentials TestHollandLillian
 
Agile Development in .NET
Agile Development in .NETAgile Development in .NET
Agile Development in .NETdanhermes
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems MongoDB
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
 
Euro python 2015 writing quality code
Euro python 2015   writing quality codeEuro python 2015   writing quality code
Euro python 2015 writing quality coderadek_j
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeMasud Rahman
 
Mca5033 open source db systems
Mca5033 open source db systemsMca5033 open source db systems
Mca5033 open source db systemssmumbahelp
 

Similaire à ACER-ASE2017-slides (20)

Mksong proposal-slide
Mksong proposal-slideMksong proposal-slide
Mksong proposal-slide
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
STRICT-SANER2015
STRICT-SANER2015STRICT-SANER2015
STRICT-SANER2015
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Agile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tddAgile_goa_2013_clean_code_tdd
Agile_goa_2013_clean_code_tdd
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...The Use of Development History in Software Refactoring Using a Multi-Objectiv...
The Use of Development History in Software Refactoring Using a Multi-Objectiv...
 
I explore
I exploreI explore
I explore
 
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localization
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
The Pill for Your Migration Hell
The Pill for Your Migration HellThe Pill for Your Migration Hell
The Pill for Your Migration Hell
 
1z0-419 Oracle Application Development Framework 12c Essentials Test
1z0-419 Oracle Application Development Framework 12c Essentials Test1z0-419 Oracle Application Development Framework 12c Essentials Test
1z0-419 Oracle Application Development Framework 12c Essentials Test
 
Agile Development in .NET
Agile Development in .NETAgile Development in .NET
Agile Development in .NET
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
Euro python 2015 writing quality code
Euro python 2015   writing quality codeEuro python 2015   writing quality code
Euro python 2015 writing quality code
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
Mca5033 open source db systems
Mca5033 open source db systemsMca5033 open source db systems
Mca5033 open source db systems
 

Plus de Masud Rahman

HereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityHereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityMasud Rahman
 
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...Masud Rahman
 
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of SaskatchewanPhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of SaskatchewanMasud Rahman
 
PhD proposal of Masud Rahman
PhD proposal of Masud RahmanPhD proposal of Masud Rahman
PhD proposal of Masud RahmanMasud Rahman
 
PhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanPhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanMasud Rahman
 
Doctoral Symposium of Masud Rahman
Doctoral Symposium of Masud RahmanDoctoral Symposium of Masud Rahman
Doctoral Symposium of Masud RahmanMasud Rahman
 
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...Masud Rahman
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationMasud Rahman
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015Masud Rahman
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017Masud Rahman
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016Masud Rahman
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-MeetingMasud Rahman
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018Masud Rahman
 

Plus de Masud Rahman (20)

HereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityHereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie University
 
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
 
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of SaskatchewanPhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of Saskatchewan
 
PhD proposal of Masud Rahman
PhD proposal of Masud RahmanPhD proposal of Masud Rahman
PhD proposal of Masud Rahman
 
PhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanPhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud Rahman
 
Doctoral Symposium of Masud Rahman
Doctoral Symposium of Masud RahmanDoctoral Symposium of Masud Rahman
Doctoral Symposium of Masud Rahman
 
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-Localization
 
MSR2017-Challenge
MSR2017-ChallengeMSR2017-Challenge
MSR2017-Challenge
 
MSR2017-RevHelper
MSR2017-RevHelperMSR2017-RevHelper
MSR2017-RevHelper
 
MSR2015-Challenge
MSR2015-ChallengeMSR2015-Challenge
MSR2015-Challenge
 
MSR2014-Challenge
MSR2014-ChallengeMSR2014-Challenge
MSR2014-Challenge
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015
 
CMPT-842-BRACK
CMPT-842-BRACKCMPT-842-BRACK
CMPT-842-BRACK
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017
 
RACK-SANER2016
RACK-SANER2016RACK-SANER2016
RACK-SANER2016
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016
 
CORRECT-ICSE2016
CORRECT-ICSE2016CORRECT-ICSE2016
CORRECT-ICSE2016
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-Meeting
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

ACER-ASE2017-slides

  • 1. IMPROVED QUERY REFORMULATION FOR CONCEPT LOCATION USING CODERANK AND DOCUMENT STRUCTURES Mohammad Masudur Rahman, Chanchal K. Roy Department of Computer Science University of Saskatchewan, Canada International Conference on Automated Software Engineering (ASE 2017), Urbana-Champaign, IL, USA
  • 2. AN EXAMPLE CHANGE REQUEST 2 Field Content Issue ID 31110 Product eclipse.jdt.debug Title Debbugger Source Lookup does not work with variables Description In the Debugger Source Lookup dialog I can also select variables for source lookup. (Advanced... > Add Variables). I selected the variable which points to the archive containing the source file for the type, but the debugger still claims that he cannot find the source
  • 3. SEARCH KEYWORD SELECTION 3 Field Content Issue ID 31110 Product eclipse.jdt.debug Title Debbugger Source Lookup does not work with variables Description In the Debugger Source Lookup dialog I can also select variables for source lookup. (Advanced... > Add Variables). I selected the variable which points to the archive containing the source file for the type, but the debugger still claims that he cannot find the source.
  • 4. CHANGE REQUEST TO CODE MAPPING 4 Field Content Issue ID 31110 Product eclipse.jdt.debug Title Debbugger Source Lookup does not work with variables Description In the Debugger Source Lookup dialog I can also select variables for source lookup. (Advanced... > Add Variables). I selected the variable which points to the archive containing the source file for the type, but the debugger still claims that he cannot find the source
  • 5. BASELINE SEARCH QUERIES 5 Technique Query QE Baseline debugger source lookup 79 Baseline debugger source lookup work variables 77 Baseline query Baseline + Expansion terms Pseudo-relevance Feedback
  • 6. TRADITIONAL QUERY REFORMULATIONS 6 Technique Reformulated Query QE RSV 1990 debugger source lookup work variables + launch configuration jdt java debug 30 Sisman & Kak 2013 debugger source lookup work variables + test exception suite core code 51 Refoqus 2013 debugger source lookup work variables + launch jdt configuration classpath project 12 Technique Query QE Baseline debugger source lookup 79 Baseline debugger source lookup work variables 77
  • 7. BIG PICTURE: TERM WEIGHTING 7   RFDd t t n D dftIDFTF log)),log(1()( Baseline query Baseline + Expansion terms
  • 8. BIG PICTURE: TERM WEIGHTING 8   RFDd t t n D dftIDFTF log)),log(1()( • Different semantics • Different structures
  • 9. OUR CONTRIBUTIONS (2)  Novel term weighting method – CodeRank  Novel query reformulation technique -- ACER 9
  • 10. CODERANK: TERM WEIGHTING FOR SOURCE CODE TERMS 10
  • 12. CODERANK CALCULATION: STEP II 12 resolveRuntimeClasspathEntry Resolve Runtime Classpath Entry
  • 13. CODERANK CALCULATION: STEP III 13   )( )10( |)(| )( )1()( iVInj j j i VOut VS VS  Most important face in this crowd 1. resolve 2. required 3. launch 4. classpath 5. runtime
  • 14. ACER: QUERY REFORMULATION USING CODERANK & MACHINE LEARNING 14
  • 16. ACER: SELECTION OF THE BEST QUERY REFORMULATION 16 Ref. candidate (method sig.) Ref. candidate (field sig.) Ref. candidate (method + field sigs) Data re-samplingMachine learning (Ensemble learning) Select of the best reformulation Reformulated query
  • 17. ACER: QUERY REFORMULATIONS 17 Technique Query QE Baseline debugger source lookup 79 Baseline debugger source lookup work variables 77 Refoqus 2013 debugger source lookup work variables + launch jdt configuration classpath project 12 CodeRank (method) debugger source lookup work variables + launch debug resolve required classpath 02 CodeRank (field) debugger source lookup work variables + label classpath system resolution launch 06 CodeRank (both) debugger source lookup work variables + java type launch classpath label 16 ACER debugger source lookup work variables + launch debug resolve required classpath 02
  • 18. EXPERIMENTAL DATASET 18 8 Projects (Apache + Eclipse) GitHub commits & Change set BugZilla + JIRA issues 1,675 change requests
  • 19. EXPERIMENTAL SETUP 19 Change request Baseline query Reformulated query Code search Our ranks Baseline ranks Compare Query Effectiveness (QE) Mean Reciprocal Rank (MRR) Top-K Accuracy
  • 20. RESEARCH QUESTIONS (5)  RQ1: Does ACER improve baseline queries significantly?  RQ2: Does CodeRank perform better than the traditional term weights (e.g., TF-IDF)?  RQ3: Does document structure make a difference in query reformulation?  RQ4: How stemming, query length and relevance feedback size affect our performance?  RQ5: Does ACER outperform the state-of-the-art in query reformulation for concept location? 20
  • 21. ANSWERING RQ1: QUERY EFFECTIVENESS OVER BASELINE 21 Query Pairs Improved (MRD Worsened (MRD) P-value Preserved CodeRankmethod vs. Baseline 58.93% (-61) 37.99% (+131) 0.007* 3.08% CodeRankfield vs. Baseline 52.51% (-51) 44.57% (+151) 0.063 2.91% CodeRankboth vs. Baseline 58.62% (-51) 38.19% (+136) *0.018* 3.20% ACER vs. Baseline 71.05% (-81) 2.51% (+104) <0.001* 26.44% *= Significant difference between improvements and worsening, MRD = Mean Rank Difference
  • 22. ANSWERING RQ2: CODERANK VS. TRADITIONAL TERM WEIGHTS 22
  • 23. ANSWERING RQ3: DO SOURCE DOCUMENT STRUCTURES MATTER? 23
  • 24. ANSWERING RQ3: DO SOURCE DOCUMENT STRUCTURES MATTER? 24
  • 25. ANSWERING RQ4: IMPACT OF REFORMULATION LENGTH 25
  • 26. RQ5: COMPARISON WITH EXISTING METHODS 26*Our performance is significantly higher for each metric than the state-of-the-art 1. CodeRank 2. Document contexts 3. Data re-sampling
  • 27. TAKE-HOME MESSAGES  Reformulation of a search query is highly challenging for the developers, costs lots of efforts.  Traditional term weights are not sufficient enough.  We provide CodeRank that exploits source term semantics and source document contexts.  We provide ACER that provides the best from a set of reformulation candidates prepared by CodeRank.  Experiments with 1,675 change requests from 8 OSS systems of Apache & Eclipse.  71% of queries improved, only 3% worsened by ACER.  Comparison with five methods including the state-of-the- art validates our approach. 27
  • 28. THANK YOU !!! QUESTIONS? 28 More details on CodeRank & ACER: http://www.usask.ca/~masud.rahman/acer/ Contact: masud.rahman@usask.ca Masud Rahman
  • 29. RQ5: COMPARISON WITH EXISTING METHODS 29Our Top-K accuracy is clearly higher for various K-values than the state-of-the-art

Notes de l'éditeur

  1. Good morning, everyone. Introduce yourself. Today, I am going to talk about a query reformulation technique for concept location where we used an advanced term weighting method and performed machine learning.
  2. Now, this is a real software change request. Here these two sections are important, and they contain information about the requested change.
  3. Now when a request like this is submitted, a developer tries to find out important terms. Then they use those terms for finding the source code to change probably using a search engine like Lucene
  4. That is, they try to map the concepts discussed in the change request to appropriate source code sections like this. This is how, the term comes– “concept location” if you want me to define it.
  5. But this concept location is NOT an easy task. For example, these two very reasonable queries from the change request do not perform well. This second one returns correct results at 77th positions, which is not acceptable of course. So, what is needed here is– the reformulation of the query for better. Now, there are traditional tool supports for doing that. What most of them do is, they throw in the initial query to the search engine, collect the results, and then collect most important terms from those results for The reformulation of the initial poor query.
  6. Now, these are the reformulated queries from three existing such methods. Now, they did some improvements in the ranking, and return results a bit closer to the top. But as you can see, they are not clearly enough. Developers want the results at the top positions, so they are still costly for practical use.
  7. Now, we investigate this part of the reformulation process, and found that Most of the existing techniques are using this equation for determining importance of a term. That is, they are selecting TF-IDF to find the words for query reformulation. In other words, they are relying on the frequency of a term as a proxy to its importance.
  8. Now, this is a metric which has been on the play from last the century. It was proposed in the 70s. It is a good metric, but it was actually proposed for regular texts such as news articles. On the other hand, we are dealing with source code here. Now, regular texts and source code have different semantics and different structures. They are not the same So, metrics for regular texts are not appropriate for the source code– this is our hypothesis.
  9. So, we made two contributions here. We propose CodeRank– a novel and appropriate term weighting method for source code. We propose ACER -- a novel query reformulation technique that uses this term weight.
  10. First comes CodeRank.
  11. Now, what we did? We extract important artifacts from source code such as method signature, formal parameters and field signatures from the code. We mostly used AST parsing and regular expressions for this. The idea is – signatures capture more rich intent than other texts. For example, method signatures provide the intent whereas the method body implements the intent with lots of noise.
  12. Now once such items are extracted, we split them. Now as we see, these single terms share some kind of semantics to convey a broader semantic. That is, they complement each other in this context. Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
  13. Now, once the graph is developed, we use a popular graph-based algorithm called PageRank for determining the node importance. OK Lets go visual. In a crowd, the most important person is the one whom everybody is looking at. It can be also seen as votes. The person who is voted the most is the leader. We also follow that concept in the context of our term graph. That is, the term which is connected the most with other terms is an important term. Now, this scoring is a recursive process, we finally get a ranked list of important terms which can used as reformulation terms.
  14. Now comes the ACER, the second contribution.
  15. This is the schematic diagram of our approach. So far we talked about these parts of our approach. Now we will zoom in this part.
  16. Once the CodeRank is calculated, we collect multiple reformulation candidates for a given initial query. As we discussed, a source document has various contexts– method signature, field signature and so on. We make use of such contexts, and develop multiple reformulation candidates. Now, since we have multiple options, we have to choose the best reformulation. In order to do that, we apply machine learning. In particular, we determine the quality of each candidate using 20 quality metrics that mostly came from IR domain. Then we use a regression-tree based classifier and suggest the best reformulated query.
  17. Now lets see what is the outcome. Here, we have created three reformulation candidates using CodeRank and source document contexts. Then our ML classifier returns the best option, and it returns the result at the 2nd position. Now, if we look closely, our technique identifies two unique terms which made the real difference in performance.
  18. For experiments, we select 8 subject systems from Apache and Eclipse. We collect 1.5 thousand change requests/bug reports from BugZilla and JIRA, We use the report title as our query and prepare the gold set by consulting the commit history of those projects from GitHub. These are the widely accepted approach to do experiments in this area.
  19. For experiments, We collect our queries and the baseline queries , and feed them to a code search engine. Then we collect their results/ranks and compare. For evaluation/validation, we used these four performance metrics.
  20. Now, in our experiment, we answer these five research questions.
  21. In the first research question, we compare our queries with the baseline queries. As we see, method signature based reformulation performs the best than the other two options. However, the Machine Learning selects the best among the three, and provides the best performance. For example, our reformulation improves 71% of the queries, preserves 26% and degrades only 03% of the queries. So, obviously, we are improving more queries than degrading.
  22. In the second research question, we compare CodeRank with the traditional term weights – Term Frequency and TF-IDF. We see that TF performs better than TF-IDF, which is interesting. Anyway, when compared with our CodeRank, we see that TF performs better initially but then CodeRank outperforms it later., especially for 10-15 reformulation terms. That is, few highly frequent terms are really important, but yes, CodeRank is more reliable than Term Frequency for term importance.
  23. In the third research question, we show how document structures/contexts make a difference. These are the number of improved queries by various reformulation candidates. Now we see 19% of the total improvement are unique to each single contexts. That is if, we consider only method signatures for query reformulation, we miss the improvements made by field signature based reformulations. Again, if we consider the whole texts rather than signatures, we also miss some query improvements. This is not only for CodeRank, this is also true if we employ term frequency in those contexts. Thus, document contexts matter for query reformulation.
  24. Now, when we consider query improvements by ACER and Term frequency in terms of Vern diagram, We also found that 66% overlaps, but ACER provides a unique set of improvements which is three times that of TF. Now ACER does document structures and TF does not. And we see the difference here.
  25. In the fourth research question, we do the calibration for reformulation length. We found the best performance is achieved when the reformulation length is between 10 to 15. This where CodeRank saturates.
  26. In the fifth research question, we compare our query improvement and worsening ratios with the existing methods. We see our median improvement is much higher than others. More importantly, we degrade a very low amount of queries compared to the others. Obviously, these measures are significantly higher. Thus, according to our investigation, ACER is the winner. But we must also admit, the ML-based approach is less scalable, and now we are working on the tool.
  27. Thus, these are take-home messages. Query reformulation is a challenging task for the developers. Google does not work local source code repository. Traditional term weights are not clearly sufficient or appropriate for source code. We provide CodeRank, a novel term weight for source code. We provide ACER, an improved reformulation technique. Our technique improves about 71% of the queries and degrades only a handful queries. Comparison with the state-of-the-art shows the promising aspect of our method.
  28. Thanks for your time and attention. I am ready to have a few questions.
  29. When we consider various Top-K accuracy, we got similar findings. Our method located concepts correctly for 80% of the change requests whereas they did for 60% of them at best. This shows the potential of our technique.