Who Is Emmanuel Katto Uganda? His Career, personal life etc.
QUICKAR-ASE2016-Singapore
1. QUICKAR: AUTOMATIC QUERY
REFORMULATION FOR CONCEPT
LOCATION USING CROWDSOURCED
KNOWLEDGE
Mohammad Masudur Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan, Canada
31st IEEE/ACM International Conference on
Automated Software Engineering (ASE 2016), Singapore
3. CONCEPT LOCATION: REFORMULATION OF
CODE SEARCH QUERIES
3
Software user
Change request
Issue database
Software
developer
Initial query
Code search
engine
Code repositorySearch results
Stop
Yes
No
Query reformulation
12.2% of the queries by
developers were useful, Kevic
and Fritz, ICSE 2014
10%-15% of
times vocabulary
matched, Furnas et
al, Commun. ACM,
1987
Our research problem
4. QUERY REFORMULATION LITERATURE
Relevance Feedback
Gay et al, ICSM 2009
Haiduc et al, ICSE 2013
Query Quality Analysis
Haiduc et al, ASE 2012
Haiduc et al, ICSE 2012
Haiduc and Marcus, ICPC 2011
Query Context Analysis
Howard et al, MSR 2013
Yang and Tan, MSR 2012
Kevic and Fritz, MSR 2014
4
Our work:
QUICKAR
5. SEMANTIC SIMILARITY: USING ADJACENCY
TERM LIST
ID Stack Overflow Question Title
6470651 Creating a memory leak with Java
4948521 Easiest way to cause memory leak in Java?
1071631 Tracking down a memory leak/ garbage-collection issue in Java
5
Term Adjacency List
Create (T1) {Memory, leak, Java}
Cause (T2) {Easiest, way, memory, leak, Java}
Track (T3) {Down, memory, leak, garbage, collection, issues, Java}
T1 ∩ T2 ≠ , T2 ∩ T3 ≠ , T1 ∩ T3 ≠
Table: Duplicate questions from Stack Overflow
T1 ≡ T2 ≡ T3
6. QUICKAR: PROPOSED TECHNIQUE FOR
QUERY REFORMULATION
6
QUICKAR
Construction of
adjacency list database
Reformulation of
initial search query
7. CONSTRUCTION OF
ADJACENCY LIST DATABASE
7
Stack Overflow
questions [input]
Question title Natural language
preprocessing
Natural language
tokens
Term adjacency
analysis
Adjacency list
Database [output]
Easiest way to cause memory leak in JavaEasiest way cause memory leak Java
Easiest way cause memory leak Java
Term Adjacent list
T1 T2 T3 T6
T2 T1 T4
T3 T1 T5 T6
11. EXPERIMENTAL RESULTS
11
Technique Improved Worsened Preserved
Baseline
(preprocessed)
17.84% 9.90% 72.27%
QUICKARP 49.15% 48.41% 2.44%
QUICKARSO 47.83% 49.91% 2.27%
QUICKARred 55.55% 24.46% 19.99%
QUICKARALL 66.54% 23.65% 9.81%
Both QUICKARP and QUICKARSO perform almost equally.
QUICKARred found quite dominating over the others.
Results from QUICKARP, QUICKARSO and QUICKARred overlap, but
they succeed for 21%--42% unique queries in isolation.
Combination of all 3 maximizes the results.
14. TAKE-HOME MESSAGE
Only 12.2% of search queries from the developers
are relevant for code change tasks, i.e., vocabulary
mismatch is a great concern.
Automatic query reformulation is essential.
Relevance feedback from project source might
always not be sufficient.
Queries can be reformulated effectively using
Stack Overflow information.
QUICKAR combines vocabulary from project
source and Stack Overflow questions.
Empirical evaluation and validation demonstrate
potential for our technique.
14
Hello everyone.
My name is Mohammad Masudur Rahman.
I am a PhD student from University of Saskatchewan, Canada.
I work with Dr. Chanchal Roy.
Today, I am going to talk on query reformulation for concept location where we used crowdsourced knowledge.
Concept location is a systematic process for mapping between items. It maps concepts from a natural language text to a software engineering artifact.
For example, if the natural language item is a software change request, then the software engineering artifact will be the source code.
During software maintenance such as feature implementation or bug fixation, such mapping is frequently done by the developers.
We provide automatic support in such mapping task.
Once a software user submits a change request (i.e., could be a feature or a bug) to the issue database, a developer is assigned to that change task.
Now, what a developer does? She selects some initial keywords to find the relevant source code that has to be changed for resolving the submitted issue.
If the search does not work with that query, she reformulates that query.
Now existing studies have shown that developers face difficulties in selecting appropriate search terms. In fact, only 12.2% of the search terms were relevant according to an existing study.
This is mostly because of the vocabulary mismatch problem. That is, the concept expressed in change request is also expressed in the source but using different vocabulary.
Therefore, the reformulation is always challenging, we tackle that research problem in this on-going work.
Now, there are several work that use relevant feedback for query reformulation.
One limitation of these work is collecting relevance feedback from the developers is time-consuming and always not possible.
Haiduc et al. focus on query quality based reformulation where they coin a term called query difficulty.
That is, they determine quality of a query using linguistic and statistical analysis.
There is another group of studies that consider context of a query term in the source code.
However, this is also limited since it depends on documentation quality of the code. If the code does not contain enough comments, these techniques will not work.
Our work is basically a combination of relevance feedback and query context.
More importantly, we add another data source—Stack Overflow--to overcome the limitations of both group of techniques.
In Stack Overflow, there are many questions that are duplicate or deal with similar type of issues.
Developers often volunteer in marking such duplicate questions.
For example, these are duplicate or very closely related questions, and all they focus on memory leak problem.
Now, if we consider three terms– Create, Cause and Track, we will get these adjacent word list.
Now, as the proverb says—”A person is known by the friend he/she keeps”
We apply that concept/idea here. That is, we determine semantic similarity or relatedness between any two terms by comparing their adjacent word list.
For example, Create, Cause and Track all share some adjacent word list collected from those questions.
That means these words are semantically related to one another.
Since we are interested in query reformulation, we can apply such semantically similar/related terms for query reformulation.
This can help us beat the vocabulary mismatch problem.
So, we propose our query reformulation technique –QUICKAR– that uses Stack Overflow questions for query reformulation.
It has two major steps
Construction of adjacent word list.
Reformulation of an initial query.
Construction of adjacency list database.
We select 500K questions from Stack Overflow, and collect their titles.
We perform natural language preprocessing, and consider each title as an ordered list of tokens.
Then we consider a window size of two, and capture adjacent word list from each question for each of the terms.
Finally, we get a database like this where each term has an adjacent word list from Stack Overflow.
We use this database for determining semantic similarity between any 2 terms.
In the query reformulation task, we collect Top 5 results returned by the initial query.
Then, we collect the source code tokens from those top results, and consider them as the candidates for reformulation.
Now, all the existing studies so far apply different strategies to extract appropriate candidates such as TF-IDF, Dice co-efficient from such tokens.
However, they are subject to vocabulary mismatch issue.
What we did?
We determine semantic similarity between initial query and those candidates using the adjacent list developed from Stack Overflow questions.
We select the most semantically similar/related code tokens as the appropriate candidates.
We also collect most frequently co-occurring word from Stack Overflow questions that co-occur with keywords from initial query.
Once reformulation candidates are selected from both source code and Stack Overflow, we combine them selectively,
Reformulate the initial query.
Now lets take a look at the example.
Suppose this is the raw query, and this is the preprocessed version.
This query returns the results at 65th position which is not a very good thing.
Now, we first perform some reduction on the query and use that to collect reformulation candidates from both source code and Stack Overflow
Now, in the combination, we applied some ad-hoc strategy. We count the NOMINAL terms among the candidates.
For example, in this case, the candidates from Stack Overflow tokens are selected for query expansion.
So, we append those candidates to the reduced version of query, and this expanded query returns the first result at the Top position.
In order to test our hypothesis, we conducted our preliminary experiments with 510 queries
From two subject systems.
In the experiment, we compared our reformulated query with the baseline query in terms of their returned results.
If our query returns result in a better position than baseline, then the reformulation is useful.
We also compare with a baseline technique called Rocchio’s expansion.
These are the the findings from our preliminary experiments.
We first determine the effect of preprocessing on the baseline queries which is not much.
Then we see the performance by the reformulation candidates from source code tokens and Stack Overflow tokens.
Both of them provide close to 50% improvement, but also worsen equal amount of queries.
We also notice an interesting effect of query reduction phase which improves a significant amount of queries.
However, when we combine both query reduction and query expansion using candidate tokens from source code and Stack Overflow, the performance improves pretty much.
At this point, the performance improvement might not be significant, this is what interested us in this approach.
We see that a major improvement cases overlap among three aspects of our proposed reformulation strategies.
However, there exist a significant number of cases where the improvement is not possible if all 3 aspects are not considered properly.
For example, about 30% of the improvement is done by either any of three variants of QUICKAR.
So, we combine all three aspects for the maximized performance.
We compare with baseline technique, and found that our performance is significantly higher.
Even when we compare with a equivalent variant of our technique with baseline, the performance is still higher.
This is because, the existing techniques cannot survive the vocabulary mismatch problem.
But, our technique provides a way out of that by computing semantic similarity using adjacency list.
So, these are the take-home messages.
Developers often face difficulties in choosing appropriate query terms during concept location.
So, we provide automatic support in their query reformulation.
In our work, we apply semantic similarity rather than lexical similarity to suggest query reformulation,
And we use Stack Overflow questions for that purpose.
Our experiments findings also show that the reformulated queries are actually improving the baseline queries,
And also performing better than a baseline technique.
This is an on-going work, and we are still working on with larger dataset.
Thanks for your time.
Now, I am ready to take your questions.