SlideShare une entreprise Scribd logo
1  sur  35
A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Masud Rahman
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal K. Roy
@masud233
6
MASUD RAHMAN: ACADEMICS
2
2019
PhD (In Progress),
University of Saskatchewan
(Award: Dr. Keith Geddes Award)
2014
MSc, University of Saskatchewan
(Award: Best MSc Thesis Nomination)
2009
BSc, Khulna University, Bangladesh
(Award: President Gold Medal)
MasudRahman,UofS
TALK OUTLINE
3
MasudRahman,UofS
Part 1: Background Concepts
Part 2: Literature Review
Part 4: Q & A
Part 3: Future Opportunities
4
MasudRahman,UofS
Part 1: Background Concepts
P1 P2 P3 P4
5
MasudRahman,UofS
A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Source Code Search
Automated Query
Reformulation
BACKGROUND CONCEPTS
1 2
P1 P2 P3 P4
MCAS: A SOFTWARE BUG THAT KILLS
6
MasudRahman,UofS
P1 P2 P3 P4
Boeing 737 MAX 8
A TALE OF SOURCE CODE SEARCH
7
MasudRahman,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
P1 P2 P3 P4
QUERY REFORMULATION: 2 WORKING CONTEXTS
8
MasudRahman,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3 P4
VOCABULARY MISMATCH PROBLEM
9
MasudRahman,UofS
P1 P2 P3 P4
Both are correct and wrong!
Boeing
Customer Boeing
Developer
QUERY REFORMULATION: 3 TYPES
10
MasudRahman,UofS
Fix MCAS Bug
Fix MCAS Bug
Lion Air
MCAS Bug
Boeing 737 Max
Software Issue
+
Query Expansion Query Reduction Query Replacement
P1 P2 P3 P4
QUERY REFORMULATION: 3 STEPS
11
MasudRahman,UofS
Initial Query Code Search
Relevance Feedback
Feedback
documents
Text mining
Candidate term
weighting
Candidate term
ranking
11
1
2
22
3
P1 P2 P3 P4
QUERY REFORMULATION: IMPLICATIONS
12
MasudRahman,UofS
Automated Query
Reformulation
Benefits
• 20% Improvement
• Redefine information needs
• Reduced cost & efforts
Costs
• Hurts good queries
• Topic drifting
• Difficult queries
P1 P2 P3 P4
13
MasudRahman,UofS
Part 2: Systematic Literature
Review
P1 P2 P3 P4
SYSTEMATIC LITERATURE REVIEW: 6 STEPS
14
MasudRahman,UofS
Research questions Search keywords Literature search
Literature bulkNoise filtrationPrimary studies
In-depth
investigation
7 RQs
P1 P2 P3 P4
SYSTEMATIC LITERATURE REVIEW: PRIMARY
STUDY SELECTION
15
MasudRahman,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
P1 P2 P3 P4
Filter by
Full texts
Information retrieval, IR, text retrieval, TR, bug localization,
concept location, feature location, FLT, concern location, Internet-
scale code search, code search engine, search engine, local code
search, code search, source code search, and code search query.
Query reformulation, query expansion, query reduction, query
formulation, query refinement, automated query expansion, AQE,
query suggestion, query recommendation, term selection, query
replacement, query difficulty, query quality, keyword selection,
keyword extraction, search term identification, search query,
search term, and search keyword.
+
OUR RESEARCH QUESTIONS
16
MasudRahman,UofS
RQ1: Which methods, algorithms and data sources have been used for automated
query reformulations targeting code search in the literature?
P1 P2 P3 P4
RQ2: Which methods, metrics or subject systems have been used to evaluate and
validate the researches on automated query reformulations?
RQ3: What are the major challenges of automated query reformulations intended
for code search? How many of them have been solved to date by the literature?
RQ4: How much activities of research on automated query reformulations have
been performed to date? What are the venues that these researches got published
at?
RQ5: What are the differences and similarities between query reformulations for
local code search and query reformulations for Internet-scale code search?
RQ6: Which one is more appropriate among term weighting, query-term co-
occurrence and thesaurus-based approaches for query keyword selection?
RQ7: What are the scopes for future work in the area of automated query
reformulation targeting the code search?
RQ1: WHICH METHODS & ALGORITHMS ARE
USED BY LITERATURE?
17
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4
RQ1: WHICH ALGORITHMS & REFORMULATION TYPES
ARE USED BY LITERATURE?
18
MasudRahman,UofS
P1 P2 P3 P4
RQ2: WHICH EVALUATION & VALIDATION
SETTINGS ARE EMPLOYED?
19
MasudRahman,UofS
P1 P2 P3 P4
RQ3: WHAT ARE COMMON CHALLENGES &
LIMITATIONS OF EXISTING LITERATURE?
20
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4
RQ3: WHAT ARE COMMON CHALLENGES & LIMITATIONS OF
EXISTING LITERATURE?
21
MasudRahman,UofS
P1 P2 P3 P4
RQ4: PUBLICATION STATS & INTERESTS ON QUERY
REFORMULATION RESEARCH
22
MasudRahman,UofS
P1 P2 P3 P4
RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
23
MasudRahman,UofS
TW = Term Weighting, TQC = Term-Query Co-occurrence, TS =
Thesaurus, ON = Ontology, SLM = Search Log Mining, ML = Machine
Learning, HM = Heuristics & Miscellaneous
P1 P2 P3 P4
RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
24
MasudRahman,UofS
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4
RQ6: CHALLENGES WITH THREE KEYWORD
SELECTION METHODS
25
MasudRahman,UofS
Method #Study CH1 CH2 CH3 CH6
Term Weighting 22 (39%) 36% 18% 91% 50%
Term-Query Co-occurrence 11 (20%) 9% 27% 64% 91%
Thesaurus 17 (30%) 12% 12% 47% 41%
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4
26
MasudRahman,UofS
Part 3: Future Opportunities
P1 P2 P3 P4
27
MasudRahman,,UofS
R1: KEYWORD SELECTION FROM BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3 P4
R2: TERM WEIGHTING FOR SOURCE CODE
28


RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
P1 P2 P3 P4
R3: SOLVE VOCABULARY MISMATCH ISSUE
29
MasudRahman,UofS
Customer
Developer
Past
Developer
Bug Report
Codebase
P1 P2 P3 P4
SOLUTION: SEMANTIC HYPERSPACE
30
MasudRahman,UofS
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
Cosine distance = Semantic
relevance
P1 P2 P3 P4
R4: GENETIC ALGORITHM FOR QUERIES
31
MasudRahman,UofS
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4
R5: STACK OVERFLOW FOR QUERY
32
MasudRahman,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects
P1 P2 P3 P4
33
http://www.usask.ca/~masud.rahman
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
MasudRahman,UofS
Part IV: Q & A
P1 P2 P3 P4
DICE, ROCCHIO, RSV
34
MasudRahman,UofS
PROBABILISTIC TERM WEIGHTING
35
MasudRahman,UofS
KLD

Contenu connexe

Tendances

11.greatway triple playproposal110422
11.greatway triple playproposal11042211.greatway triple playproposal110422
11.greatway triple playproposal110422
carlosvicunav
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformation
zertux
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 

Tendances (20)

Kalman Filter and its Application
Kalman Filter and its ApplicationKalman Filter and its Application
Kalman Filter and its Application
 
Nonlinear methods of analysis of electrophysiological data and Machine learni...
Nonlinear methods of analysis of electrophysiological data and Machine learni...Nonlinear methods of analysis of electrophysiological data and Machine learni...
Nonlinear methods of analysis of electrophysiological data and Machine learni...
 
PhD Research Proposal - Qualifying Exam
PhD Research Proposal - Qualifying ExamPhD Research Proposal - Qualifying Exam
PhD Research Proposal - Qualifying Exam
 
Non-negative Matrix Factorization
Non-negative Matrix FactorizationNon-negative Matrix Factorization
Non-negative Matrix Factorization
 
11.greatway triple playproposal110422
11.greatway triple playproposal11042211.greatway triple playproposal110422
11.greatway triple playproposal110422
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformation
 
kalman filtering "From Basics to unscented Kaman filter"
 kalman filtering "From Basics to unscented Kaman filter" kalman filtering "From Basics to unscented Kaman filter"
kalman filtering "From Basics to unscented Kaman filter"
 
Radio Signal Classification with Deep Neural Networks
Radio Signal Classification with Deep Neural NetworksRadio Signal Classification with Deep Neural Networks
Radio Signal Classification with Deep Neural Networks
 
Ismrm 2018 e-poster
Ismrm 2018 e-posterIsmrm 2018 e-poster
Ismrm 2018 e-poster
 
Kalman Filter Presentation
Kalman Filter PresentationKalman Filter Presentation
Kalman Filter Presentation
 
Preliminary Examination Proposal Slides
Preliminary Examination Proposal SlidesPreliminary Examination Proposal Slides
Preliminary Examination Proposal Slides
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Kalman filter for Beginners
Kalman filter for BeginnersKalman filter for Beginners
Kalman filter for Beginners
 
Seminar On Kalman Filter And Its Applications
Seminar On  Kalman  Filter And Its ApplicationsSeminar On  Kalman  Filter And Its Applications
Seminar On Kalman Filter And Its Applications
 
Elint Interception & Analysis
Elint Interception & AnalysisElint Interception & Analysis
Elint Interception & Analysis
 
Ground penetrating radar / Yere Nüfuz Eden Radar
Ground penetrating radar / Yere Nüfuz Eden RadarGround penetrating radar / Yere Nüfuz Eden Radar
Ground penetrating radar / Yere Nüfuz Eden Radar
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimization
 
REMOTE RADIO HEAD TESTING: 5G case study
REMOTE RADIO HEAD TESTING: 5G  case studyREMOTE RADIO HEAD TESTING: 5G  case study
REMOTE RADIO HEAD TESTING: 5G case study
 
Radar 2009 a 15 parameter estimation and tracking part 1
Radar 2009 a 15 parameter estimation and tracking part 1Radar 2009 a 15 parameter estimation and tracking part 1
Radar 2009 a 15 parameter estimation and tracking part 1
 
Associative memory network
Associative memory networkAssociative memory network
Associative memory network
 

Similaire à PhD Comprehensive exam of Masud Rahman

Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-MondalISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-Mondal
University of Saskatchewan
 
Supporting program comprehension with source code summarization
Supporting program comprehension with source code summarizationSupporting program comprehension with source code summarization
Supporting program comprehension with source code summarization
Masud Rahman
 
Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similaire à PhD Comprehensive exam of Masud Rahman (20)

Doctoral Symposium of Masud Rahman
Doctoral Symposium of Masud RahmanDoctoral Symposium of Masud Rahman
Doctoral Symposium of Masud Rahman
 
PhD proposal of Masud Rahman
PhD proposal of Masud RahmanPhD proposal of Masud Rahman
PhD proposal of Masud Rahman
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
 
Slide 26 sept2017v2
Slide 26 sept2017v2Slide 26 sept2017v2
Slide 26 sept2017v2
 
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of SaskatchewanPhD Seminar - Masud Rahman, University of Saskatchewan
PhD Seminar - Masud Rahman, University of Saskatchewan
 
ISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-MondalISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-Mondal
 
CORRECT-ICSE2016
CORRECT-ICSE2016CORRECT-ICSE2016
CORRECT-ICSE2016
 
Tpa 2013
Tpa 2013Tpa 2013
Tpa 2013
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
 
RACK-SANER2016
RACK-SANER2016RACK-SANER2016
RACK-SANER2016
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-Meeting
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
 
Relevance Improvements at Cengage - Ivan Provalov
Relevance Improvements at Cengage - Ivan ProvalovRelevance Improvements at Cengage - Ivan Provalov
Relevance Improvements at Cengage - Ivan Provalov
 
Supporting program comprehension with source code summarization
Supporting program comprehension with source code summarizationSupporting program comprehension with source code summarization
Supporting program comprehension with source code summarization
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
Presentation
PresentationPresentation
Presentation
 
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
 

Plus de Masud Rahman

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
Masud Rahman
 

Plus de Masud Rahman (20)

HereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityHereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie University
 
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-Localization
 
MSR2017-Challenge
MSR2017-ChallengeMSR2017-Challenge
MSR2017-Challenge
 
MSR2017-RevHelper
MSR2017-RevHelperMSR2017-RevHelper
MSR2017-RevHelper
 
MSR2015-Challenge
MSR2015-ChallengeMSR2015-Challenge
MSR2015-Challenge
 
MSR2014-Challenge
MSR2014-ChallengeMSR2014-Challenge
MSR2014-Challenge
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015
 
STRICT-SANER2015
STRICT-SANER2015STRICT-SANER2015
STRICT-SANER2015
 
CMPT-842-BRACK
CMPT-842-BRACKCMPT-842-BRACK
CMPT-842-BRACK
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slides
 
CMPT470-usask-guest-lecture
CMPT470-usask-guest-lectureCMPT470-usask-guest-lecture
CMPT470-usask-guest-lecture
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018
 
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
 
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationImproving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
 
Exploiting Context in Dealing with Programming Errors and Exceptions
Exploiting Context in Dealing with Programming Errors and ExceptionsExploiting Context in Dealing with Programming Errors and Exceptions
Exploiting Context in Dealing with Programming Errors and Exceptions
 
SOAP--Simple Object Access Protocol
SOAP--Simple Object Access ProtocolSOAP--Simple Object Access Protocol
SOAP--Simple Object Access Protocol
 

Dernier

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 

PhD Comprehensive exam of Masud Rahman

  • 1. A SYSTEMATIC LITERATURE REVIEW OF AUTOMATED QUERY REFORMULATIONS IN SOURCE CODE SEARCH Masud Rahman Department of Computer Science University of Saskatchewan, Canada Advisor: Dr. Chanchal K. Roy @masud233 6
  • 2. MASUD RAHMAN: ACADEMICS 2 2019 PhD (In Progress), University of Saskatchewan (Award: Dr. Keith Geddes Award) 2014 MSc, University of Saskatchewan (Award: Best MSc Thesis Nomination) 2009 BSc, Khulna University, Bangladesh (Award: President Gold Medal) MasudRahman,UofS
  • 3. TALK OUTLINE 3 MasudRahman,UofS Part 1: Background Concepts Part 2: Literature Review Part 4: Q & A Part 3: Future Opportunities
  • 5. 5 MasudRahman,UofS A SYSTEMATIC LITERATURE REVIEW OF AUTOMATED QUERY REFORMULATIONS IN SOURCE CODE SEARCH Source Code Search Automated Query Reformulation BACKGROUND CONCEPTS 1 2 P1 P2 P3 P4
  • 6. MCAS: A SOFTWARE BUG THAT KILLS 6 MasudRahman,UofS P1 P2 P3 P4 Boeing 737 MAX 8
  • 7. A TALE OF SOURCE CODE SEARCH 7 MasudRahman,UofS Boeing Customer MCAS Bug report Boeing Developer Code search Query Suggestion Query Reformulation Boeing Codebase P1 P2 P3 P4
  • 8. QUERY REFORMULATION: 2 WORKING CONTEXTS 8 MasudRahman,UofS Local code search (e.g., bug localization) Internet-scale code search Boeing codebase GitHub P1 P2 P3 P4
  • 9. VOCABULARY MISMATCH PROBLEM 9 MasudRahman,UofS P1 P2 P3 P4 Both are correct and wrong! Boeing Customer Boeing Developer
  • 10. QUERY REFORMULATION: 3 TYPES 10 MasudRahman,UofS Fix MCAS Bug Fix MCAS Bug Lion Air MCAS Bug Boeing 737 Max Software Issue + Query Expansion Query Reduction Query Replacement P1 P2 P3 P4
  • 11. QUERY REFORMULATION: 3 STEPS 11 MasudRahman,UofS Initial Query Code Search Relevance Feedback Feedback documents Text mining Candidate term weighting Candidate term ranking 11 1 2 22 3 P1 P2 P3 P4
  • 12. QUERY REFORMULATION: IMPLICATIONS 12 MasudRahman,UofS Automated Query Reformulation Benefits • 20% Improvement • Redefine information needs • Reduced cost & efforts Costs • Hurts good queries • Topic drifting • Difficult queries P1 P2 P3 P4
  • 13. 13 MasudRahman,UofS Part 2: Systematic Literature Review P1 P2 P3 P4
  • 14. SYSTEMATIC LITERATURE REVIEW: 6 STEPS 14 MasudRahman,UofS Research questions Search keywords Literature search Literature bulkNoise filtrationPrimary studies In-depth investigation 7 RQs P1 P2 P3 P4
  • 15. SYSTEMATIC LITERATURE REVIEW: PRIMARY STUDY SELECTION 15 MasudRahman,UofS ACM DL CrossRef DBLP Mendeley Google Scholar IEEE Xplore ProQuest ScienceDirect SpringerLink Web of Science Wiley Online Lib 2871 2317 562 Initial results Impurity removal Filter by Title 195 Filter by Abstract 93 Merging & Duplicate removal 56 Primary studies P1 P2 P3 P4 Filter by Full texts Information retrieval, IR, text retrieval, TR, bug localization, concept location, feature location, FLT, concern location, Internet- scale code search, code search engine, search engine, local code search, code search, source code search, and code search query. Query reformulation, query expansion, query reduction, query formulation, query refinement, automated query expansion, AQE, query suggestion, query recommendation, term selection, query replacement, query difficulty, query quality, keyword selection, keyword extraction, search term identification, search query, search term, and search keyword. +
  • 16. OUR RESEARCH QUESTIONS 16 MasudRahman,UofS RQ1: Which methods, algorithms and data sources have been used for automated query reformulations targeting code search in the literature? P1 P2 P3 P4 RQ2: Which methods, metrics or subject systems have been used to evaluate and validate the researches on automated query reformulations? RQ3: What are the major challenges of automated query reformulations intended for code search? How many of them have been solved to date by the literature? RQ4: How much activities of research on automated query reformulations have been performed to date? What are the venues that these researches got published at? RQ5: What are the differences and similarities between query reformulations for local code search and query reformulations for Internet-scale code search? RQ6: Which one is more appropriate among term weighting, query-term co- occurrence and thesaurus-based approaches for query keyword selection? RQ7: What are the scopes for future work in the area of automated query reformulation targeting the code search?
  • 17. RQ1: WHICH METHODS & ALGORITHMS ARE USED BY LITERATURE? 17 MasudRahman,UofS Grounded Theory • Open coding • Axial coding • Selective coding 1 2 P1 P2 P3 P4
  • 18. RQ1: WHICH ALGORITHMS & REFORMULATION TYPES ARE USED BY LITERATURE? 18 MasudRahman,UofS P1 P2 P3 P4
  • 19. RQ2: WHICH EVALUATION & VALIDATION SETTINGS ARE EMPLOYED? 19 MasudRahman,UofS P1 P2 P3 P4
  • 20. RQ3: WHAT ARE COMMON CHALLENGES & LIMITATIONS OF EXISTING LITERATURE? 20 MasudRahman,UofS Grounded Theory • Open coding • Axial coding • Selective coding 1 2 P1 P2 P3 P4
  • 21. RQ3: WHAT ARE COMMON CHALLENGES & LIMITATIONS OF EXISTING LITERATURE? 21 MasudRahman,UofS P1 P2 P3 P4
  • 22. RQ4: PUBLICATION STATS & INTERESTS ON QUERY REFORMULATION RESEARCH 22 MasudRahman,UofS P1 P2 P3 P4
  • 23. RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH & INTERNET-SCALE CODE SEARCH 23 MasudRahman,UofS TW = Term Weighting, TQC = Term-Query Co-occurrence, TS = Thesaurus, ON = Ontology, SLM = Search Log Mining, ML = Machine Learning, HM = Heuristics & Miscellaneous P1 P2 P3 P4
  • 24. RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH & INTERNET-SCALE CODE SEARCH 24 MasudRahman,UofS CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak Evaluation, CH7 = Unverified Assumptions P1 P2 P3 P4
  • 25. RQ6: CHALLENGES WITH THREE KEYWORD SELECTION METHODS 25 MasudRahman,UofS Method #Study CH1 CH2 CH3 CH6 Term Weighting 22 (39%) 36% 18% 91% 50% Term-Query Co-occurrence 11 (20%) 9% 27% 64% 91% Thesaurus 17 (30%) 12% 12% 47% 41% CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak Evaluation, CH7 = Unverified Assumptions P1 P2 P3 P4
  • 26. 26 MasudRahman,UofS Part 3: Future Opportunities P1 P2 P3 P4
  • 27. 27 MasudRahman,,UofS R1: KEYWORD SELECTION FROM BUG REPORT Title Description ID Query QE 1. Custom search results view iresource 2. Custom search results search results view 3. element iresource provider level tree 4. Custom search results hierarchically java search results 1331 636 01 570 Lower QE is better P1 P2 P3 P4
  • 28. R2: TERM WEIGHTING FOR SOURCE CODE 28   RFDd t t n D dftIDFTF log)),log(1()( • Different syntax • Different semantics • Different structures P1 P2 P3 P4
  • 29. R3: SOLVE VOCABULARY MISMATCH ISSUE 29 MasudRahman,UofS Customer Developer Past Developer Bug Report Codebase P1 P2 P3 P4
  • 30. SOLUTION: SEMANTIC HYPERSPACE 30 MasudRahman,UofS Word 1 P (1, 5, 6, 7, ….., N) Word 2 P (2, 4, 6, 9, ….., N) Word 2 Cosine distance = Semantic relevance P1 P2 P3 P4
  • 31. R4: GENETIC ALGORITHM FOR QUERIES 31 MasudRahman,UofS Method Search Query QE Baseline {title + description} 25 STRICT[140] {tab classpath enabled buttons user entry} 86 TF-IDF {button entry bootstrap enabled incorrectly moving} 177 GA {open reflect tab bottom entry classpath} 01 Title Description Lower QE is better P1 P2 P3 P4
  • 32. R5: STACK OVERFLOW FOR QUERY 32 MasudRahman,UofS Convert image to gray scale without losing transparency BufferedImage Grayscale ImageEdit ColorConvertOp File Transparency ColorSpace BufferedImageOp Graphics ImageEffects P1 P2 P3 P4

Notes de l'éditeur

  1. Hello everyone! Good afternoon! Thank you all for coming and attending this talk. My name is Masud Rahman. I am a PhD student in the Department of Computer Science. I work with Dr. Chanchal K. Roy. Today, I will be talking about automated query reformulations for code search.
  2. A little bit of background about Me: Currently, I am a PhD student at USASK. I completed my MSc in Software Engineering from the same university in 2014. Before that, I completed my BSc in Computer Science & Engineering from Khulna University, back in 2009. Got couple of awards.
  3. Today, my talk will be divided into four sections. In the first section, I will provide a background overview on automated query reformulations & code search in general. In the second section, I will present a systematic literature review of automated query reformulations. In the third section, I will discuss about the future research opportunities in this domain. Finally, we will have a Q&A session.
  4. Part 1: Background concepts.
  5. If we look at my talk’s title, we can see two major concepts. Source code search Automated query reformulations. Now we will go into details. But, lets look at two recent events.
  6. You are looking at two aircrafts -- Ethiopian airlines and Lion Air Indonesia. These are nose-down situation. Due to these nose down situations, we have two fatal crashes in a single calendar year. These crashes took 346 precious human lives and cost trillions of dollars. Now, the culprit is MCAS. This is a software component that was added to Boeing 737-Max 8 version. The summary is, this is a faulty component, not well designed, and ultimately leads to crash. That is why, Boeing 737 Max planes are grounded right now.
  7. Now, lets say, a Boeing customer has submitted a bug report. Now, a Boeing developer is responsible to locate and repair the faulty code triggering that bug. As a frequent practice, developer chooses a few important keywords and attempts to locate the buggy code within the Boeing codebase. But the study shows that 88% of the keywords chosen by the developer are incorrect. That is, they do not return the buggy code. So, the obvious next step is to reformulate the query through automated tool supports, so that the buggy code could be located. There are also tools that take a bug report and suggest appropriate search queries in the first place. Now, the developer not only searches in the Boeing codebase, she might search in the Internet-scale codebase such as GitHub as well.
  8. So as discussed, the code search could be done in two working contexts. It could be in a local codebase such as Boeing. It could also be in the large-scale open source repository such as GitHub. Now, based on these contexts, there are different challenges in query reformulation. The local codebase is small, domain specific and organized. On the contrary, GitHub is huge, cross-domain and very noisy. So, yes, they need different strategies to suggest queries for them.
  9. We can reformulate a search query in three ways. -- It could be query expansion by adding new keywords. -- It could be query reduction by discarding the noisy keywords. -- Or it could be total query replacement by using a new set of keywords.
  10. Now, there are many steps in query reformulation. But three major steps are common. First, you need to collect feedback on the given query. The query is executed and top-K results are collected for developer inspection. The developers marks them whether they look relevant or irrelevant. This is step-I. In the second step, these annotated results are mined using various text mining tools, and candidate keywords are selected using various keyword selection methods. In the third step, the most important keywords are returned to the developer for query expansion.
  11. Now, there are automated query reformulations and semi-automated query reformulations. Positives: It can improve code search performance up to 20%, which is significant. It helps to redefine the information needs. Developers often not sure which keywords to choose, automated suggestion can help them. Also reduces the cost and efforts in code search, i.e., in software maintenance such as bug fixing. Negatives: Automated reformulation might degrade the already good queries. If you have already good one, you need to stop. It has a chance of topic drifting. That ism through reformulations, the original topic might be lost.
  12. Now, we are done with Background concepts, Part 1. Now, we are going into Part 2 -- Systematic Literature Review.
  13. Systematic literature review starts with several research questions. We ask 7 research questions in our survey about automated query reformulations. These questions are broken down into search keywords. Then we use these keywords, and retrieve a bulk of literature from various publication databases. Then we perform several steps for noise filtration, and select a set of primary studies on automated query reformulations. Then we do in-depth investigation on these primary studies. Now, lets look take a closer look on this section.
  14. We choose 11 publication databases, and collect about ~3K results from these databases on automated query reformulations. Then we remove the impurities from the results. Sometimes, keyword matching can produce unexpected results. For example, these results contain studies from database management systems, multimedia retrieval or image retrieval. Since we are looking for query reformulation for code search, we only keep results for code search, and discard the rest. This step provides ~2300 results. That is still huge. Then we filter the results by title and abstract. That is, we look at the title and abstract, and determine whether they are related to code search and query reformulations or not. These steps provide us 195 results. Then, we do the merging and duplication removal. Still the topics of a few results were not clear to us. We thus read their full texts, especially the Introduction part. Finally, we reach to a collection of 56 studies after all these filtration steps. We call these studies as the primary studies on automated query reformulations.
  15. We answer 7 research questions in our systematic survey. We answer three general questions about methodology, evaluation and challenges/limitations from the existing literature. We answer one statistical question. Then we answer three specialized questions including future research opportunities.
  16. In the first research question, we identify which underlying algorithms and methodologies are used by the existing literature. In order to do that, we used Grounded theory approach. It is a well known method for qualitative research. How do we do that? Well, we read the Introduction and methodology section of each of the 56 primary studies, and identify the algorithms and technology used. Since we are trying to develop a theory about the existing literature, we apply three types of coding in the Grounded Theory. --Open coding: In this stage, we describe each study with a list of appropriate key phrases. The idea is to keep an open mind, and use as much as key phrases possible. -- Axial coding: In this stage, we try to make connections among different key phrases, and color code similar phrases. We basically look for topical similarity. -- Selective coding: In this stage, we develop the underlying variables for a dependent variable. Thus, we develop a mental model about existing literature based on our qualitative analysis of the primary studies. Then we do various quantitative analysis using this theory.
  17. For example, we discover that seven major methodologies and algorithms are employed in the query reformulations. About 40% of the studies use term weighting approaches such as TF-IDF. About 30% studies use thesaurus such as WordNet for query expansions with synonyms. Besides 50% studies employ various advanced heuristics and ad hoc methods for query reformulation. We also identify that 70% studies do query expansion, which is the highest. There are also 15%--25% studies do query reduction or replacement. Majority of the approaches do not collect any feedback during query reformulation. We also discover that 40% literature on query reformulation target Internet-scale code search. The remaining studies target various code searches within a local codebase for bug localization, concept location and feature location.
  18. In the second research question, we investigate which evaluation and validation approaches were used by the existing literature on query reformulations. We found that 50% studies used more than two performance metrics. 50% studies used at least 2 subject systems for their experiments. About 38% studies involve developers in their experiments, and 50% of them use less than 16 developers. Most of the studies used some means to validate their work. 50% of the studies compare with at least 2 existing works. In term of search queries, we found that 50% studies used at least 74 queries for their evaluation and validation. Now, we see that the subject systems and validation targets are not sufficient, which often lead to the lack of generalizability issues.
  19. In the third research question, we identify the threats and limitations of the existing literature. For doing that, we consult with the methodology and threats to validity section of each of the primary studies. In particular, we check the threats or issues reported by the authors, and identify several issues through inferences. Like RQ1, we apply Grounded Theory approach, and identify the common challenges, issues and limitations of the existing literature.
  20. We found seven major challenges/limitations with the existing literature. Now, the details are in the report. We are just providing the summary here. We see that 80% studies suffer from one or more generalizability issues. That is, they either use only subject systems from a single programming platform. For example, the findings from Java-based systems might not always generalize for C-based systems. The number of queries or developers involved is not sufficient, or the validation is not sufficient enough. We also see that 50% studies are affected by human bias and suffer from weak evaluation. We also found that vocabulary mismatch problem is not solved. Well, this is a long standing problem in any type of document search, and all query reformulation approaches attempt to solve this problem. But found that 30% studies do a poor job in doing so. We also see that 30% studies impose extra cognitive burdens on the developers during query reformulation/code search.
  21. In the fourth research question, we find out the statistics on the research activities conducted on automated query reformulation for the last 15 years. We see that first work on query reformulation targeting concept location was published at 2004. Then there was some moderate activities. However, for the last 5-6 years, we see significant activities by the community. Especially, since 2013, we see a major interest on this domain. In terms of venue, we see that ASE and ICSE are pioneer which are A/A* conferences. So, yes, these are top-quality researches. In fact 70% of the primary studies were done in the last 5 years, which shows the promising aspect of this domain. These are some top authors. According to our investigation, about 150 researches have worked or working in this domain. So, this is a well established and promising area for research.
  22. Besides these analysis, we did more in-depth analysis, and compare between Local and Internet-scale code search in RQ5. We see that local code searches involve term weighting for keyword selection, since it has the bug reports. On the contrary, people use thesaurus for query expansion for code search on the web. No bug reports are there. More details can be found on the comprehensive paper.
  23. We also see that queries in the Internet-scale code search suffer more from vocabulary mismatch issues. In this case, the developers do not have materials to get the help for queries. So, they generally guess some keywords that define their information needs, which are often not sufficient. This leads to vocabulary mismatch issue.
  24. In the sixth research questions, we see that Term weighting has some connection with vocabulary mismatch problem. Inappropriate term weighting can choose inappropriate keywords for query reformulation. This leads of noise in the query and degraded performance. On the contrary, thesaurus and term-query co-occurrence attempt to deliver synonyms or similar words. They create less vocabulary mismatch issues comparatively.
  25. OK! Now we are done with the literature survey. Now, we will focus on the third part, the future research opportunities.
  26. Let us see an example. This is a bug report, this is title and this is the description. Now, developer JOE would use this bug report to localize the bug from source code. Now he chose some ad hoc queries. Which one is the best do you think, here? PAUSE! Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query. … oh… this one is the best. So, selecting appropriate keywords from the bug report is not that simple.
  27. Now, this is a metric which has been on the play from last the century. It was proposed in the 70s. It is a good metric, but it was actually proposed for regular texts such as news articles. On the other hand, we are dealing with source code here. Now, regular texts and source code have different semantics and different structures. They are not the same So, metrics for regular texts are not appropriate for the source code– this is our hypothesis.
  28. That is, each of three people, customer, past developer and JOE have their own vocabulary to describe a certain problem/concept. In fact, any people will discuss the same problem with the same vocabulary, this probability is only 15%-20% So, naturally, developer JOE finds it a great challenge to make a connection between bug report and the buggy code. This costs development time, money and valuable efforts.
  29. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  30. Now, I am not going to discuss those studies in details. But here is the glimpse. Developers generally look for relevant code on the web using natural language query. Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub. Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers. But what we are dealing with source code right? So, we need source code friendly query for a better result. So, we identify relevant API classes against this natural language query through extensive data mining and data analytics. And once again, Stack Overflow is our friend in this grand challenge.
  31. And, I am done with my talking. Thanks a lot for your attention. Now, I am ready to take some questions.