1. A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Masud Rahman
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal K. Roy
@masud233
6
2. MASUD RAHMAN: ACADEMICS
2
2019
PhD (In Progress),
University of Saskatchewan
(Award: Dr. Keith Geddes Award)
2014
MSc, University of Saskatchewan
(Award: Best MSc Thesis Nomination)
2009
BSc, Khulna University, Bangladesh
(Award: President Gold Medal)
MasudRahman,UofS
5. 5
MasudRahman,UofS
A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Source Code Search
Automated Query
Reformulation
BACKGROUND CONCEPTS
1 2
P1 P2 P3 P4
6. MCAS: A SOFTWARE BUG THAT KILLS
6
MasudRahman,UofS
P1 P2 P3 P4
Boeing 737 MAX 8
7. A TALE OF SOURCE CODE SEARCH
7
MasudRahman,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
P1 P2 P3 P4
8. QUERY REFORMULATION: 2 WORKING CONTEXTS
8
MasudRahman,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3 P4
14. SYSTEMATIC LITERATURE REVIEW: 6 STEPS
14
MasudRahman,UofS
Research questions Search keywords Literature search
Literature bulkNoise filtrationPrimary studies
In-depth
investigation
7 RQs
P1 P2 P3 P4
15. SYSTEMATIC LITERATURE REVIEW: PRIMARY
STUDY SELECTION
15
MasudRahman,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
P1 P2 P3 P4
Filter by
Full texts
Information retrieval, IR, text retrieval, TR, bug localization,
concept location, feature location, FLT, concern location, Internet-
scale code search, code search engine, search engine, local code
search, code search, source code search, and code search query.
Query reformulation, query expansion, query reduction, query
formulation, query refinement, automated query expansion, AQE,
query suggestion, query recommendation, term selection, query
replacement, query difficulty, query quality, keyword selection,
keyword extraction, search term identification, search query,
search term, and search keyword.
+
16. OUR RESEARCH QUESTIONS
16
MasudRahman,UofS
RQ1: Which methods, algorithms and data sources have been used for automated
query reformulations targeting code search in the literature?
P1 P2 P3 P4
RQ2: Which methods, metrics or subject systems have been used to evaluate and
validate the researches on automated query reformulations?
RQ3: What are the major challenges of automated query reformulations intended
for code search? How many of them have been solved to date by the literature?
RQ4: How much activities of research on automated query reformulations have
been performed to date? What are the venues that these researches got published
at?
RQ5: What are the differences and similarities between query reformulations for
local code search and query reformulations for Internet-scale code search?
RQ6: Which one is more appropriate among term weighting, query-term co-
occurrence and thesaurus-based approaches for query keyword selection?
RQ7: What are the scopes for future work in the area of automated query
reformulation targeting the code search?
17. RQ1: WHICH METHODS & ALGORITHMS ARE
USED BY LITERATURE?
17
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4
18. RQ1: WHICH ALGORITHMS & REFORMULATION TYPES
ARE USED BY LITERATURE?
18
MasudRahman,UofS
P1 P2 P3 P4
19. RQ2: WHICH EVALUATION & VALIDATION
SETTINGS ARE EMPLOYED?
19
MasudRahman,UofS
P1 P2 P3 P4
20. RQ3: WHAT ARE COMMON CHALLENGES &
LIMITATIONS OF EXISTING LITERATURE?
20
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4
21. RQ3: WHAT ARE COMMON CHALLENGES & LIMITATIONS OF
EXISTING LITERATURE?
21
MasudRahman,UofS
P1 P2 P3 P4
22. RQ4: PUBLICATION STATS & INTERESTS ON QUERY
REFORMULATION RESEARCH
22
MasudRahman,UofS
P1 P2 P3 P4
23. RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
23
MasudRahman,UofS
TW = Term Weighting, TQC = Term-Query Co-occurrence, TS =
Thesaurus, ON = Ontology, SLM = Search Log Mining, ML = Machine
Learning, HM = Heuristics & Miscellaneous
P1 P2 P3 P4
24. RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
24
MasudRahman,UofS
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4
25. RQ6: CHALLENGES WITH THREE KEYWORD
SELECTION METHODS
25
MasudRahman,UofS
Method #Study CH1 CH2 CH3 CH6
Term Weighting 22 (39%) 36% 18% 91% 50%
Term-Query Co-occurrence 11 (20%) 9% 27% 64% 91%
Thesaurus 17 (30%) 12% 12% 47% 41%
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4
27. 27
MasudRahman,,UofS
R1: KEYWORD SELECTION FROM BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3 P4
28. R2: TERM WEIGHTING FOR SOURCE CODE
28
RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
P1 P2 P3 P4
Hello everyone! Good afternoon!
Thank you all for coming and attending this talk.
My name is Masud Rahman. I am a PhD student in the Department of Computer Science.
I work with Dr. Chanchal K. Roy.
Today, I will be talking about automated query reformulations for code search.
A little bit of background about Me:
Currently, I am a PhD student at USASK.
I completed my MSc in Software Engineering from the same university in 2014.
Before that, I completed my BSc in Computer Science & Engineering from Khulna University, back in 2009.
Got couple of awards.
Today, my talk will be divided into four sections.
In the first section, I will provide a background overview on automated query reformulations & code search in general.
In the second section, I will present a systematic literature review of automated query reformulations.
In the third section, I will discuss about the future research opportunities in this domain.
Finally, we will have a Q&A session.
Part 1: Background concepts.
If we look at my talk’s title, we can see two major concepts.
Source code search
Automated query reformulations.
Now we will go into details. But, lets look at two recent events.
You are looking at two aircrafts -- Ethiopian airlines and Lion Air Indonesia.
These are nose-down situation. Due to these nose down situations, we have two fatal crashes in a single calendar year.
These crashes took 346 precious human lives and cost trillions of dollars.
Now, the culprit is MCAS. This is a software component that was added to Boeing 737-Max 8 version.
The summary is, this is a faulty component, not well designed, and ultimately leads to crash.
That is why, Boeing 737 Max planes are grounded right now.
Now, lets say, a Boeing customer has submitted a bug report.
Now, a Boeing developer is responsible to locate and repair the faulty code triggering that bug.
As a frequent practice, developer chooses a few important keywords and attempts to locate the buggy code within the Boeing codebase.
But the study shows that 88% of the keywords chosen by the developer are incorrect. That is, they do not return the buggy code.
So, the obvious next step is to reformulate the query through automated tool supports, so that the buggy code could be located.
There are also tools that take a bug report and suggest appropriate search queries in the first place.
Now, the developer not only searches in the Boeing codebase, she might search in the Internet-scale codebase such as GitHub as well.
So as discussed, the code search could be done in two working contexts.
It could be in a local codebase such as Boeing.
It could also be in the large-scale open source repository such as GitHub.
Now, based on these contexts, there are different challenges in query reformulation.
The local codebase is small, domain specific and organized.
On the contrary, GitHub is huge, cross-domain and very noisy.
So, yes, they need different strategies to suggest queries for them.
We can reformulate a search query in three ways.
-- It could be query expansion by adding new keywords.
-- It could be query reduction by discarding the noisy keywords.
-- Or it could be total query replacement by using a new set of keywords.
Now, there are many steps in query reformulation.
But three major steps are common.
First, you need to collect feedback on the given query. The query is executed and top-K results are collected for developer inspection.
The developers marks them whether they look relevant or irrelevant. This is step-I.
In the second step, these annotated results are mined using various text mining tools, and candidate keywords are selected using various keyword selection methods.
In the third step, the most important keywords are returned to the developer for query expansion.
Now, there are automated query reformulations and semi-automated query reformulations.
Positives:
It can improve code search performance up to 20%, which is significant.
It helps to redefine the information needs. Developers often not sure which keywords to choose, automated suggestion can help them.
Also reduces the cost and efforts in code search, i.e., in software maintenance such as bug fixing.
Negatives:
Automated reformulation might degrade the already good queries. If you have already good one, you need to stop.
It has a chance of topic drifting. That ism through reformulations, the original topic might be lost.
Now, we are done with Background concepts, Part 1.
Now, we are going into Part 2 -- Systematic Literature Review.
Systematic literature review starts with several research questions.
We ask 7 research questions in our survey about automated query reformulations.
These questions are broken down into search keywords.
Then we use these keywords, and retrieve a bulk of literature from various publication databases.
Then we perform several steps for noise filtration, and select a set of primary studies on automated query reformulations.
Then we do in-depth investigation on these primary studies.
Now, lets look take a closer look on this section.
We choose 11 publication databases, and collect about ~3K results from these databases on automated query reformulations.
Then we remove the impurities from the results. Sometimes, keyword matching can produce unexpected results.
For example, these results contain studies from database management systems, multimedia retrieval or image retrieval.
Since we are looking for query reformulation for code search, we only keep results for code search, and discard the rest.
This step provides ~2300 results. That is still huge.
Then we filter the results by title and abstract. That is, we look at the title and abstract, and determine whether they are related to code search and query reformulations or not.
These steps provide us 195 results.
Then, we do the merging and duplication removal. Still the topics of a few results were not clear to us.
We thus read their full texts, especially the Introduction part.
Finally, we reach to a collection of 56 studies after all these filtration steps.
We call these studies as the primary studies on automated query reformulations.
We answer 7 research questions in our systematic survey.
We answer three general questions about methodology, evaluation and challenges/limitations from the existing literature.
We answer one statistical question.
Then we answer three specialized questions including future research opportunities.
In the first research question, we identify which underlying algorithms and methodologies are used by the existing literature.
In order to do that, we used Grounded theory approach. It is a well known method for qualitative research.
How do we do that?
Well, we read the Introduction and methodology section of each of the 56 primary studies, and identify the algorithms and technology used.
Since we are trying to develop a theory about the existing literature, we apply three types of coding in the Grounded Theory.
--Open coding: In this stage, we describe each study with a list of appropriate key phrases. The idea is to keep an open mind, and use as much as key phrases possible.
-- Axial coding: In this stage, we try to make connections among different key phrases, and color code similar phrases. We basically look for topical similarity.
-- Selective coding: In this stage, we develop the underlying variables for a dependent variable.
Thus, we develop a mental model about existing literature based on our qualitative analysis of the primary studies.
Then we do various quantitative analysis using this theory.
For example, we discover that seven major methodologies and algorithms are employed in the query reformulations.
About 40% of the studies use term weighting approaches such as TF-IDF.
About 30% studies use thesaurus such as WordNet for query expansions with synonyms.
Besides 50% studies employ various advanced heuristics and ad hoc methods for query reformulation.
We also identify that 70% studies do query expansion, which is the highest.
There are also 15%--25% studies do query reduction or replacement.
Majority of the approaches do not collect any feedback during query reformulation.
We also discover that 40% literature on query reformulation target Internet-scale code search.
The remaining studies target various code searches within a local codebase for bug localization, concept location and feature location.
In the second research question, we investigate which evaluation and validation approaches were used by the existing literature on query reformulations.
We found that 50% studies used more than two performance metrics.
50% studies used at least 2 subject systems for their experiments.
About 38% studies involve developers in their experiments, and 50% of them use less than 16 developers.
Most of the studies used some means to validate their work. 50% of the studies compare with at least 2 existing works.
In term of search queries, we found that 50% studies used at least 74 queries for their evaluation and validation.
Now, we see that the subject systems and validation targets are not sufficient, which often lead to the lack of generalizability issues.
In the third research question, we identify the threats and limitations of the existing literature.
For doing that, we consult with the methodology and threats to validity section of each of the primary studies.
In particular, we check the threats or issues reported by the authors, and identify several issues through inferences.
Like RQ1, we apply Grounded Theory approach, and identify the common challenges, issues and limitations of the existing literature.
We found seven major challenges/limitations with the existing literature. Now, the details are in the report. We are just providing the summary here.
We see that 80% studies suffer from one or more generalizability issues.
That is, they either use only subject systems from a single programming platform. For example, the findings from Java-based systems might not always generalize for C-based systems.
The number of queries or developers involved is not sufficient, or the validation is not sufficient enough.
We also see that 50% studies are affected by human bias and suffer from weak evaluation.
We also found that vocabulary mismatch problem is not solved. Well, this is a long standing problem in any type of document search, and all query reformulation approaches attempt to solve this problem. But found that 30% studies do a poor job in doing so.
We also see that 30% studies impose extra cognitive burdens on the developers during query reformulation/code search.
In the fourth research question, we find out the statistics on the research activities conducted on automated query reformulation for the last 15 years.
We see that first work on query reformulation targeting concept location was published at 2004.
Then there was some moderate activities. However, for the last 5-6 years, we see significant activities by the community.
Especially, since 2013, we see a major interest on this domain.
In terms of venue, we see that ASE and ICSE are pioneer which are A/A* conferences. So, yes, these are top-quality researches.
In fact 70% of the primary studies were done in the last 5 years, which shows the promising aspect of this domain.
These are some top authors. According to our investigation, about 150 researches have worked or working in this domain.
So, this is a well established and promising area for research.
Besides these analysis, we did more in-depth analysis, and compare between Local and Internet-scale code search in RQ5.
We see that local code searches involve term weighting for keyword selection, since it has the bug reports.
On the contrary, people use thesaurus for query expansion for code search on the web. No bug reports are there.
More details can be found on the comprehensive paper.
We also see that queries in the Internet-scale code search suffer more from vocabulary mismatch issues.
In this case, the developers do not have materials to get the help for queries.
So, they generally guess some keywords that define their information needs, which are often not sufficient.
This leads to vocabulary mismatch issue.
In the sixth research questions, we see that
Term weighting has some connection with vocabulary mismatch problem.
Inappropriate term weighting can choose inappropriate keywords for query reformulation.
This leads of noise in the query and degraded performance.
On the contrary, thesaurus and term-query co-occurrence attempt to deliver synonyms or similar words.
They create less vocabulary mismatch issues comparatively.
OK! Now we are done with the literature survey.
Now, we will focus on the third part, the future research opportunities.
Let us see an example.
This is a bug report, this is title and this is the description.
Now, developer JOE would use this bug report to localize the bug from source code.
Now he chose some ad hoc queries.
Which one is the best do you think, here? PAUSE!
Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query.
… oh… this one is the best.
So, selecting appropriate keywords from the bug report is not that simple.
Now, this is a metric which has been on the play from last the century. It was proposed in the 70s.
It is a good metric, but it was actually proposed for regular texts such as news articles.
On the other hand, we are dealing with source code here.
Now, regular texts and source code have different semantics and different structures.
They are not the same
So, metrics for regular texts are not appropriate for the source code– this is our hypothesis.
That is, each of three people, customer, past developer and JOE have their own vocabulary to describe a certain problem/concept.
In fact, any people will discuss the same problem with the same vocabulary, this probability is only 15%-20%
So, naturally, developer JOE finds it a great challenge to make a connection between bug report and the buggy code.
This costs development time, money and valuable efforts.
Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time.
Well, that is not the case.
They are mentioned in the similar contexts by the people across the whole corpus.
The model recognizes such occurrences and thus put burger and sandwich close together.
Similarly, dumpling and ramen are close to each other.
Now, we propose this. This is original query, and this is reformulated query.
Now, a good reformulated query will cluster together the original query.
A bad reformulated query will NOT be able to cluster with the original query.
So, clustering tendency within the hyperspace is our weapon here.
We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
Now, I am not going to discuss those studies in details.
But here is the glimpse.
Developers generally look for relevant code on the web using natural language query.
Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub.
Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers.
But what we are dealing with source code right? So, we need source code friendly query for a better result.
So, we identify relevant API classes against this natural language query through extensive data mining and data analytics.
And once again, Stack Overflow is our friend in this grand challenge.
And, I am done with my talking.
Thanks a lot for your attention.
Now, I am ready to take some questions.