SlideShare une entreprise Scribd logo
1  sur  85
SUPPORTING SOURCE CODE SEARCH WITH
CONTEXT-AWARE AND SEMANTICS-DRIVEN
QUERY REFORMULATION
Masud Rahman, PhD Candidate
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal Roy
@masud233
6
ABOUT ME
2
Masud
Rahman,
PhD
Candidate,
U
of
S
Software Research Lab
Me
MASUD RAHMAN: ACADEMICS
3
2019
PhD Candidate,
University of Saskatchewan
(Award: Dr. Keith Geddes Award)
2014
MSc, University of Saskatchewan
(Award: Best MSc Thesis Nomination)
2009
BSc, Khulna University, Bangladesh
(Award: President Gold Medal)
Masud
Rahman,
PhD
Candidate,
U
of
S
TALK OUTLINE
Part 2: PhD Thesis
Part 1: Research Problem
Part 4: Q&A + Discussions
4
Part 3: Future Works
Masud
Rahman,
PhD
Candidate,
U
of
S
Masud
Rahman,
PhD
Candidate,
U
of
S
Part 1: Research Problem
5
P1 P2 P4
P3
6
Masud
Rahman,
PhD
Candidate,
U
of
S
Story I
Software Bugs
P1 P2 P4
P3
7
Masud
Rahman,
PhD
Candidate,
U
of
S
Story II
Software Features
P1 P2 P4
P3
8
Masud
Rahman,
PhD
Candidate,
U
of
S
Software Bugs
Software Features
Bug
Resolution
Feature
Improvement
20% 60%
Story III
P1 P2 P4
P3
BUG REPORT & CHANGE REQUEST
9
Masud
Rahman,
PhD
Candidate,
U
of
S
P1 P2 P4
P3
10
Masud
Rahman,
PhD
Candidate,
U
of
S
Q1: How can we fix software bugs using the bug
reports?
Q2: How can we add/improve features in the existing
software?
Bug Localization
Concept Location
BUG LOCALIZATION & CONCEPT LOCATION
P1 P2 P4
P3
THREE TYPES OF CODE SEARCH
11
Masud
Rahman,
PhD
Candidate,
U
of
S
(1) Bug Localization
(2) Concept Location
(3) Internet-scale
Code Search
P1 P2 P4
P3
SEARCH FOR THE BUGGY CODE
Masud
Rahman,
PhD
Candidate,
U
of
S
Software
Customer
Bug report
Software Developer Code Search
Query Reformulation Query Reformulation
Software Codebase
12
P1 P2 P4
P3
* Kevic & Fritz, ICSE 2014
SEARCH FOR THE RELEVANT CODE
13
Masud
Rahman,
PhD
Candidate,
U
of
S
Developer
Internet-scale codebase
Query Reformulation
*Bajracharya and Lopes, EMSE 2012
P1 P2 P4
P3
Reformulated Query
PART 1: SUMMARY
14
Masud
Rahman,
PhD
Candidate,
U
of
S
(1) Bug Localization (2) Concept Location
(3) Internet-scale Code Search
P1 P2 P4
P3
Masud
Rahman,
PhD
Candidate,
U
of
S
Part 2: PhD Thesis
15
P1 P2 P4
P3
SYSTEMATIC LITERATURE REVIEW
Masud
Rahman,
PhD
Candidate,
U
of
S
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
Filter by
Full texts
Query reformulation, query expansion, query reduction, query formulation,
query refinement, automated query expansion, AQE, query suggestion,
query recommendation, term selection, query replacement, query difficulty,
query quality, keyword selection, keyword extraction, search term
identification, search query, search term, and search keyword.
16
P1 P2 P4
P3
PHD THESIS OVERVIEW
17
Masud
Rahman,
PhD
Candidate,
U
of
S
(2) Bug Localization
(1) Concept Location
(3) Internet-scale Code
Search
S1 (STRICT)
S2 (ACER)
S3 (BLIZZARD)
S4 (BLADER)
S6 (NLP2API)
S5 (RACK)
P1 P2 P4
P3
18
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
STRICT
Search Query Reformulation for Concept Location
using Graph-based Term Weighting
[SANER 2015 + 2017]
P1 P2 P4
P3
19
Masud
Rahman,
PhD
Candidate,
U
of
S
QUIZ TEST: QUERIES FROM A CHANGE REQUEST
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
~11K documents
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
TF-IDF: TERM IMPORTANCE (TRADITIONAL)
20
Masud
Rahman,
PhD
Candidate,
U
of
S
University of Saskatchewan
The Saskatchewan Huskies football team
represents the University of Saskatchewan
in U Sports football that competes in the
Canada West Universities Athletic
Association conference of U Sports. The
program has won the Vanier Cup national
championship three times, in 1990, 1996
and 1998.
The Saskatchewan Huskies
became only the second U Sports team to
advance to three consecutive Vanier Cup
games, after the Saint Mary's Huskies, but
lost all three games from 2004-2006. The
team has won the most Hardy Trophy
titles in Canada West, having won a total
of 20 times. The 2006 Saskatchewan
Huskies became only the third team to
play in a Vanier Cup that their school was
hosting, when the University of
Saskatchewan hosted the 42nd Vanier
Cup. The Toronto Varsity Blues were the
first when they won two Vanier Cups in
1965 and 1993. Saskatchewan also
became the first western school to host
the national championship game.
Saskatchewan:6
Vanier: 5
Won: 4
Huskies: 4
Cup: 4
Team: 4
Sports: 3
Times: 2
School: 2
Championship:2
Vanier: 0.5
Won: 0.4
Huskies: 0.4
School: 0.1
Saskatchewan: 0.06
Championship: 0.06
Sports: 0.06
Times: 0.06
Cup: 0.04
Team: 0.04
TF IDF TF x IDF
Saskatchewan: .01
Vanier: 0.1
Won: 0.1
Huskies: 0.1
Cup: 0.01
Team: 0.01
Sports: 0.02
Times: 0.03
School: 0.05
Championship: .03
IDF = log (DF / N)
Saskatchewan Huskies
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
TEXTRANK: TERM IMPORTANCE USING CO-
OCCURRENCES (MIHALCEA ET AL, EMNLP 2004)
21
Masud
Rahman,
PhD
Candidate,
U
of
S
IResource … IJavaElement
IResource … IJavaElement
(Term Co-occurrence)
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
POSRANK: TERM IMPORTANCE USING SYNTACTIC
DEPENDENCE (BLANCO & LIOMA, INF. RETR. 2012)
22
Masud
Rahman,
PhD
Candidate,
U
of
S
Jespersen Rank Theory
(Syntactic Dependence)
Noun Verb Adjective
Element …reported, element …plain
(Syntactic Dependence)
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STRICT: QUERY KEYWORD SELECTION WITH
PAGERANK (BRIN & PAGE, 1998)
23
 




 )
(
)
1
0
(
|
)
(
|
)
(
)
1
(
)
(
i
v
In
j
j
j
i
v
Out
v
S
v
S 


•Element
•Iresource
•Provider
•Level
•Tree
Candidate
Query 1
Candidate
Query 2
PageRank
Algorithm
Best Query
Masud
Rahman,
PhD
Candidate,
U
of
S
K1
K2
K3
K4
K5
K1
K2
K3
K4
K5
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
EXPERIMENT, DATASET & METRICS
24
~3K Change Requests Version History
Ground Truth
Masud
Rahman,
PhD
Candidate,
U
of
S
1. Hit@K
2. MAP@K
3. MRR@K
4. QE
7 RQs
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
Correct
Result
Correct
Result
Correct
Result
IMPROVED VS. WORSE QUERY
25
Masud
Rahman,
PhD
Candidate,
U
of
S
Baseline Query
(Title + Description)
Worse Query Improved Query
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
COMPARISON WITH THE STATE-OF-THE-ART
26
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
MY PHD THESIS STORY
27
Masud
Rahman,
PhD
Candidate,
U
of
S
(2) Bug Localization
(1) Concept Location
(3) Internet-scale Code
Search
S1 (STRICT)
S2 (ACER)
S3 (BLIZZARD)
S4 (BLADER)
S6 (NLP2API)
S5 (RACK)
P1 P2 P4
P3
28
Masud
Rahman,
PhD
Candidate,
U
of
S
BLIZZARD
Search Query Reformulation for Bug Localization
using Report Quality Dynamics & Graph-based
Term Weighting
[ESEC/FSE 2018]
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
BUG REPORT QUALITY: A CLOSER LOOK
29
5000+
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
ONE SIZE DOES NOT FIT ALL
30
Traditional Idea Proposed Idea
Masud
Rahman,
PhD
Candidate,
U
of
S
Can everybody watch the game?
No
Yes
Yes
Yes
Yes
Yes
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
QUALITY-AWARE QUERY REFORMULATION
31
Noisy Poor Rich
Masud
Rahman,
PhD
Candidate,
U
of
S
Poor
Noisy
Rich
Rich
Noisy
Poor
Equality Equity
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEP I: REFORMULATING NOISY BUG REPORT
32
i
j
I
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
33
i entry
j entry
Pi Ci Mi
Pj Cj Mj
II
Masud
Rahman,
PhD
Candidate,
U
of
S
Static
Static
Hierarchical
STEP II: REFORMULATING NOISY BUG REPORT
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEP III: REFORMULATING NOISY BUG REPORT
34







)
(
)
1
0
(
|
)
(
|
)
(
)
1
(
)
(
i
V
In
j j
j
i
V
Out
V
S
V
S 


Ci
Cj
Mk
Mn
Cp
Masud
Rahman,
PhD
Candidate,
U
of
S
III
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
SEARCH QUERY FROM NOISY BUG REPORT
35
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
SEARCH QUERY FOR POOR BUG REPORT
36
Poor Bug Report
compliance create preference add
configuration field dialog
annotation
01
Masud
Rahman,
PhD
Candidate,
U
of
S
30
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
SEARCH QUERY FROM RICH BUG REPORT
37
Rich Bug Report
astvisitor post postvisit
previsit pre file post pre
astnode visitor
27 01
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
EXPERIMENT, DATASET & METRICS
38
5K+ Bug reports Version History
Ground Truth
Masud
Rahman,
PhD
Candidate,
U
of
S
1. Hit@K
2. MAP@K
3. MRR@K
4. QE
4 RQs
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
39
Masud
Rahman,
PhD
Candidate,
U
of
S
COMPARISON WITH THE STATE-OF-THE-ART
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
MY PHD THESIS STORY
40
Masud
Rahman,
PhD
Candidate,
U
of
S
(2) Bug Localization
(1) Concept Location
(3) Internet-scale Code
Search
S1 (STRICT)
S2 (ACER)
S3 (BLIZZARD)
S4 (BLADER)
S6 (NLP2API)
S5 (RACK)
P1 P2 P4
P3
41
Masud
Rahman,
PhD
Candidate,
U
of
S
BLADER
Semantics-Driven Query Reformulation for
IR-Based Bug Localization
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
CHALLENGES OF POOR BUG REPORTS
42
Masud
Rahman,
PhD
Candidate,
U
of
S
Technique Query QE
Baseline { title } 30
Baseline { title + description } 12
STRICT { IRC spam channel Bug entered rid join
messages clients entry }
26
Rocchio {title + description} + {
remoteserviceadminevent admin service feed
remote synd mask writer event export }
10
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEPS OF BLADER
43
Masud
Rahman,
PhD
Candidate,
U
of
S
(2) Clustering Tendency Analysis
(3) Query Reformulation
(1) Semantic Hyperspace
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
Semantic
Hyperspace
STEP I: CONSTRUCTION OF SEMANTIC
HYPERSPACE USING STACK OVERFLOW
44
Masud
Rahman,
PhD
Candidate,
U
of
S
Stack Overflow
corpus
Data
preprocessing
Neural Text classifier
FastText model
(skip-gram)
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEP I: SEMANTIC HYPERSPACE EXPLAINED
45
Masud
Rahman,
PhD
Candidate,
U
of
S
Coffee P (1, 5, 6, 7, ….., N)
Tea P (2, 4, 6, 9, ….., N)
Pasta P (7, 9, 0, 1, ….., N)
Pasta
Tea
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
Semantic distance = Cosine distance
46
Masud
Rahman,
PhD
Candidate,
U
of
S
channel
join spam
entered
connect
invitation
message
room
chat
handle
mask
remote
synd
admin
Q
C1
C2
STEP II: CLUSTERING TENDENCY ANALYSIS
C1 is better than C2
Hopkins Statistic (HS)
Polygon Area (PA)
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEP III: QUERY REFORMULATION
47
Ref. candidate
(HS)
Ref. candidate
(PA)
Ref. candidate
(baseline)
Data re-sampling
Machine learning
(Ensemble learning)
Selection of the best
reformulation
Reformulated
query
Source code
Candidate queries
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
BLADER QUERY PERFORMANCE
48
Masud
Rahman,
PhD
Candidate,
U
of
S
Technique Reformulated Query QE
Baseline { Title + Description } 12
STRICT { IRC spam channel Bug entered rid join messages
clients entry }
26
Rocchio { Title + Description } + { remoteserviceadminevent
admin service feed remote synd mask writer
event export }
BLADER { Title + Description } + { connect invitation
handle message room chat user send }
10
03
Lower QE is better
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
MY PHD THESIS STORY
49
Masud
Rahman,
PhD
Candidate,
U
of
S
(2) Bug Localization
(1) Concept Location
(3) Internet-scale Code
Search
S1 (STRICT)
S2 (ACER)
S3 (BLIZZARD)
S4 (BLADER)
S6 (NLP2API)
S5 (RACK)
P1 P2 P4
P3
50
Masud
Rahman,
PhD
Candidate,
U
of
S
RACK
Query Reformulation for Internet-scale Code
Search using Crowdsourced Knowledge
[SANER 2016, ICSE 2017, EMSE 2019]
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
CHALLENGES OF CODE SEARCH ON WEB
Masud
Rahman,
PhD
Candidate,
U
of
S
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects 51
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
WHAT IS CROWD KNOWLEDGE?
52
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEPS OF RACK
53
Masud
Rahman,
PhD
Candidate,
U
of
S
(2) API Candidate Ranking
(3) Query Reformulation
(1) Keyword-API Mapping
Database
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEP I: KEYWORD-API MAPPING DATABASE
54
Question title
Preprocessing
NL keywords
Accepted
answer
Code segment
extraction
API parsing
API classes
Keyword-API
linking
Keyword-API
Mapping database
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEP II: API RELEVANCE RANKING
55
Masud
Rahman,
PhD
Candidate,
U
of
S
How to parse HTML in Java?
Element
parse
HTML
Java Document
Jsoup
Keyword-API Co-occurrence
Keyword Pair-API Co-occurrence
Keyword-Keyword Coherence
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
STEP II: SEARCH QUERY REFORMULATION
56
Masud
Rahman,
PhD
Candidate,
U
of
S
HTML parser in Java
Technique Reformulated Query
Baseline HTML parser Java
RACK {HTML parser Java} + { Document Element File
IOException Jsoup }
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
EXPERIMENT: DATASET COLLECTION
57
Java2s
175 Queries & Ground truth
769K Code segments
1. Hit@K
2. MAP@K
3. MRR@K
4. QE
5. MR@K
6. IDCG
Masud
Rahman,
PhD
Candidate,
U
of
S
P1 P2 P4
P3 S1 S2 S3 S4 S5 S6
COMPARISON WITH BASELINE
58
Masud
Rahman,
PhD
Candidate,
U
of
S
P1 P2 P4
P3 S1 S2 S3 S4 S5 S6
PART 2: SUMMARY
59
Masud
Rahman,
PhD
Candidate,
U
of
S
(1) Bug Localization
(2) Concept Location
(3) Internet-scale Code Search
BLIZZARD
BLADER
STRICT
ACER
RACK NLP2API
76%-88%
P1 P2 P4
P3
BUGDOCTOR: LIVE DEMO
60
Masud
Rahman,
PhD
Candidate,
U
of
S
P1 P2 P4
P3
REPRODUCIBILITY & REPLICATION
61
Masud
Rahman,
PhD
Candidate,
U
of
S
https://github.com/masud-technope/BugDoctor
P1 P2 P4
P3
62
Masud
Rahman,
PhD
Candidate,
U
of
S
Part III: Future Works
P1 P2 P4
P3
T1: BUG REPRODUCIBILITY & BUG FIXING
63
Masud
Rahman,
PhD
Candidate,
U
of
S
Bug Localization Bug Understanding Bug Fixing
Bug Reproduction
P1 P2 P4
P3
T2: IMPROVED TEXT RETRIEVAL IN
SOFTWARE ENGINEERING
64
Masud
Rahman,
PhD
Candidate,
U
of
S
 




 )
(
)
1
0
(
|
)
(
|
)
(
)
1
(
)
(
i
v
In
j
j
j
i
v
Out
v
S
v
S 









RF
D
d t
t
n
D
d
f
t
IDF
TF log
))
,
log(
1
(
)
(
20+ SE
P1 P2 P4
P3
T3: GENETIC ALGORITHM FOR QUERIES
65
Masud
Rahman,
PhD
Candidate,
U
of
S
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P4
P3
T4: PROGRAMMER BOT
66
Masud
Rahman,
PhD
Candidate,
U
of
S
+
Developer
ProBot
Best code
example
P1 P2 P4
P3
RESEARCH COLLABORATIONS
67
Masud
Rahman,
PhD
Candidate,
U
of
S
Chanchal Roy David Lo Raula Kula Iman Kievanloo Jason Collins
Jesse Redl A. S. M. Arif Shamima Yeasmin Amit Mondal Saikat Mondal Rodrigo Silva
P1 P2 P4
P3
MAJOR RESEARCH ACHIEVEMENTS
68
Masud
Rahman,
PhD
Candidate,
U
of
S
DPA Nomination
(ICSME 2018)
President Gold Medal,
Bangladesh
Keith Geddes
Award 2017, U of S
Flagship SE venues
Google Scholar: 29/301, H-Index: 9
ACM CAPS
Award 2017
P1 P2 P4
P3
Masud
Rahman,
PhD
Candidate,
U
of
S
69
http://www.usask.ca/~masud.rahman
https://github.com/masud-technope
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
Part IV: Q&A
P1 P2 P4
P3
70
Masud
Rahman,
PhD
Candidate,
U
of
S
TAKE-HOME MESSAGES
71
Masud
Rahman,
PhD
Candidate,
U
of
S
RQ1
RQ2 RQ3
TF-IDF
PageRank
Equality
Equity
Stack Overflow
FastText
WordNet
Thesis
P1 P2 P3
EXPERIMENT, DATASET & METRICS
72
Java2s
CodeJava
310 Queries & Ground truth
769K Code segments
Hit@K
MAP@K
MRR@K
MR@K
QE
NDCG
S1 S2 S3 S4 S5 S6
P1 P2 P4
P3
Masud
Rahman,
PhD
Candidate,
U
of
S
Correct
Result
Correct
Result
Correct
Result
WHAT IS A GOOD SEARCH QUERY?
73
Masud
Rahman,
PhD
Candidate,
U
of
S
Baseline Query
(Title + Description)
Worse Query Better Query
Title
Description
P1 P2 P4
P3 S1 S2 S3 S4 S5 S6
SEMANTIC HYPERSPACE
74
Masud
Rahman,
PhD
Candidate,
U
of
S
P1 P2 P3
x P (1, 5, 6, 7, ….., N)
y P (2, 4, 6, 9, ….., N)
y
S1 S2 S3 S4 S5 S6
y = mx + c,
x^2 +y^2 = r^2
ax^2+bx+c=0
TWO WORKING CONTEXTS: LOCAL & GLOBAL
Masud
Rahman,
PhD
Candidate,
U
of
S
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3
75
S2: KEYWORDS SELECTION FROM SOURCE
CODE WITH CODERANK
76
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
P1 P2 P3
 




 )
(
)
1
0
(
|
)
(
|
)
(
)
1
(
)
(
i
v
In
j
j
j
i
v
Out
v
S
v
S 


RQ1 [Source Code]: Keywords selected by PageRank
are more effective for local code searches (e.g., concept
location) than that of TF-IDF
S1 S2 S3 S4 S5 S6
Masud
Rahman,
PhD
Candidate,
U
of
S
HOW DID WE DO?
77
Masud
Rahman,
PhD
Candidate,
U
of
S
P1 P2 P3 S1 S2 S3 S4 S5 S6
3
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
R3: SOLVE VOCABULARY MISMATCH ISSUE
Masud
Rahman,
PhD
Candidate,
U
of
S
Customer
Developer
Past
Developer
Bug Report
Codebase
P1 P2 P3 P4
78
SOLUTION: SEMANTIC HYPERSPACE
Masud
Rahman,
PhD
Candidate,
U
of
S
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
Cosine distance = Semantic
relevance
P1 P2 P3 P4
79
R4: GENETIC ALGORITHM FOR QUERIES
Masud
Rahman,
PhD
Candidate,
U
of
S
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4
80
SEARCH QUERY FROM NOISY BUG REPORT
81
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
Masud
Rahman,
PhD
Candidate,
U
of
S
S1 S2 S3 S4
P1 P2 P3
DICE, ROCCHIO, RSV
Masud
Rahman,
PhD
Candidate,
U
of
S
82
VOCABULARY MISMATCH PROBLEM
Masud
Rahman,
PhD
Candidate,
U
of
S
P1 P2 P3
Both are correct and wrong!
Boeing
Customer Boeing
Developer
83
Masud
Rahman,
PhD
Candidate,
U
of
S
KEYWORDS FROM A BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3
84
PROBABILISTIC TERM WEIGHTING
Masud
Rahman,
PhD
Candidate,
U
of
S
KLD
85

Contenu connexe

Plus de Masud Rahman

ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationMasud Rahman
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015Masud Rahman
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017Masud Rahman
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeMasud Rahman
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016Masud Rahman
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-MeetingMasud Rahman
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slidesMasud Rahman
 
CMPT470-usask-guest-lecture
CMPT470-usask-guest-lectureCMPT470-usask-guest-lecture
CMPT470-usask-guest-lectureMasud Rahman
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018Masud Rahman
 
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Masud Rahman
 
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationImproving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationMasud Rahman
 

Plus de Masud Rahman (20)

ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-Localization
 
MSR2017-Challenge
MSR2017-ChallengeMSR2017-Challenge
MSR2017-Challenge
 
MSR2017-RevHelper
MSR2017-RevHelperMSR2017-RevHelper
MSR2017-RevHelper
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
MSR2015-Challenge
MSR2015-ChallengeMSR2015-Challenge
MSR2015-Challenge
 
MSR2014-Challenge
MSR2014-ChallengeMSR2014-Challenge
MSR2014-Challenge
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015
 
STRICT-SANER2015
STRICT-SANER2015STRICT-SANER2015
STRICT-SANER2015
 
CMPT-842-BRACK
CMPT-842-BRACKCMPT-842-BRACK
CMPT-842-BRACK
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017
 
RACK-SANER2016
RACK-SANER2016RACK-SANER2016
RACK-SANER2016
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016
 
CORRECT-ICSE2016
CORRECT-ICSE2016CORRECT-ICSE2016
CORRECT-ICSE2016
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-Meeting
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slides
 
CMPT470-usask-guest-lecture
CMPT470-usask-guest-lectureCMPT470-usask-guest-lecture
CMPT470-usask-guest-lecture
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018
 
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
Effective Reformulation of Query for Code Search using Crowdsourced Knowledge...
 
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query ReformulationImproving IR-Based Bug Localization with Context-Aware-Query Reformulation
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
 

Dernier

IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 

Dernier (20)

IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

PhD Seminar - Masud Rahman, University of Saskatchewan

Notes de l'éditeur

  1. Hello everyone! Good afternoon! Thanks for attending this meeting. My name is Masud Rahman. I am a PhD Candidate from Software Research Lab. I work with Dr. Chanchal K. Roy. Today, I will be talking about automated query reformulations for code search.
  2. So, Who am I? I came to Canada back in 2012 as a young graduate student. This the first winter. I got these two lovely young ladies by my side 24/7. I am a member of Software Research Lab, University of Saskatchewan. I work with Dr. Chanchal Roy. We got a pretty big group there, and do a lot of stuffs besides research.
  3. A little bit of background about Me: Currently, I am a PhD Candidate at USASK. I completed my MSc in Software Engineering from the same university in 2014. Before that, I completed my BSc in Computer Science & Engineering from Khulna University, back in 2009.
  4. Today, my talk will be divided into four sections. In the first section, I will discuss the research problem I am trying to solve in my PhD. In the second section, I will discuss about my PhD Thesis proposals to solve that research problem. In the third section, I will summarize my PhD contributions. Finally, we will have a Q&A session and interesting discussions.
  5. Part 1: Research Problem
  6. Now, lets say, a Boeing customer has submitted a bug report. Now, a Boeing developer is responsible to locate and repair the faulty code triggering that bug. As a frequent practice, developer chooses a few important keywords and attempts to locate the buggy code within the Boeing codebase. But the study shows that 88% of the keywords chosen by the developer could be incorrect. That is, they do not return the buggy code. So, the obvious next step is to reformulate the query through automated tool supports, so that the buggy code could be located. There are also tools that take a bug report and suggest appropriate search queries in the first place. So, we are interested into these part of the process, and my PhD focuses on this. As you can also see that, Google does not have any jurisdiction in this case.
  7. So far, we deal with query reformulation for a single codebase. But you know, besides this local codebase, developers also look for source code in Internet-scale cross domain codebases. Now, that is a whole new game. Study shows that developers might fail 88% of the times to retrieve the correct code segments. So, we did two other studies, and we extensively used Stack Overflow.
  8. Now, we are done with Background concepts, Part 1. Now, we are going into Part 2 -- PhD Thesis
  9. So what we did? We did a systematic literature survey using 56 primary studies on query reformulation for code search. During this study, we found 3 major issues in the literature.
  10. Let us see an example. This is a bug report, this is title and this is the description. Now, developer JOE would use this bug report to localize the bug from source code. Now he chose some ad hoc queries. Which one is the best do you think, here? PAUSE! Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query. … oh… this one is the best. So, selecting appropriate keywords from the bug report is not that simple.
  11. Similarly, we can see the phrases and dependencies among the terms in the bug report texts as well. Our job is to identify the keywords from these texts, right? So, did we do? We consider the co-occurrences among the terms. That is, how terms occur with other terms within a certain context. We encode such co-occurrences as edges, and transform the texts into a graph like this.
  12. Besides term co-occurrences, we consider another aspect called syntactic dependencies. For this, we used Jespersen Rank Theory, a theory developed back in 1925. According this theory, parts of speech of sentence can be divided into three ranks – nouns (first), verbs + adjectives in the second rank and the rest are the third ranks According to Jespersen, verb and adjective modifies noun. That is there are some syntactic dependencies for between element and reported and element and plain to covey the overall meaning of the sentence. Now, we capture such syntactic dependencies as well, and transform the report texts into a POS graph as well.
  13. So, we have created two graphs, right? Now, we have two graphs developed from the bug report based on two different dimensions --Word co-occurrence and syntactic dependence. Once we have graphs, we apply this famous algorithm called PageRank algorithm. This is the backbone of Google search. Now, the algorithmic details are a bit complex, but I will try to provide an overview here. Why do you think, this guy is laughing? Because, it is getting the maximum votes. Similarly, in the graph, the node that is connected to most of the nodes is the winner. That is, a term’s importance will be determined by its connectivity with other nodes. More importantly, since this is a recursive algorithm, the importance depends on the weights of the connected node as well. Once the computation is done, we get a reformulation candidate from each graph. What is the reformulation candidate? – a ranked list of keywords like this. So, we collect two candidates from two graph, apply machine learning and suggest the best one as our suggested query from the bug report.
  14. Now, for the experiments, we chose 8 subject systems from Apache and Eclipse. We collect about 3000 bug reports, and try to map them with the version control history at GitHub. Through such mapping we extract the ground truth for the bug reports. This is a standard process followed by the existing literature.
  15. Now let me explain the metrics a bit since we will be using these a lot. Hit@K is the percentage of the queries for which at least one ground truth is found within the top K results. MAP is the standard precision + result position. The detailed is much complex which I can discuss later. MRR is the inverse of the rank of first ground truth within the result. QE also stands for query effectiveness is just the opposite of MRR
  16. We did an empirical study with 5K+ bug reports in our ICSE poster. And we discovered that bug reports could be very different in terms of quality. There could be different types of bug reports. It could be noisy with stack traces which is 16% It could be really poor that does not contain any structured entities, which is 30% Or it could be rich bug reports that include source code, test case and other stuffs, which is 54%
  17. So, clearly there are different quality levels for bug reports. Now, what do the existing approaches do? They do not look at the bug reports or its quality. Rather they apply the same treatment to all. But one size does not fit all, as you know. So what we do? We propose a much more balanced way to deal with this. We prefer equity rather than equality in our treatment to bug reports. So, here comes our work!
  18. So, first we take a bug report as input. Then we apply regular expressions to identify the structured components. We then classify whether this is a a noisy report containing stack traces. a poor bug report containing only regular texts a rich bug report containing source code and texts. Once the quality level is identified, what’s the next step? Well, we do query reformulation unlike the earlier studies. We separate signals from noise from noisy report, feed the poor bug report with appropriate keywords. We mostly keep the rich bug report as is. So, that is the equity approach.
  19. Now, lets go a bit deeper with this. From a noisy bug report, we first extract the stack traces. It might contain hundreds of traces like this. We choose any two consecutive traces. Lets call them I and J
  20. Now, each trace will contain three piece of information – a package name, a class name and a method name. We know that, in a single line, the method and class are statically connected for sure. However, classes and methods are also hierarchically dependent across trace lines due to caller-callee relationships. We capture such static and hierarchical relationships from consecutive trace lines, and develop a trace graph like this.
  21. So, from a noisy report, we extract The report title The encountered exception The most important keywords from the stack traces. Then we do the search with this newly constructed query. For example, the baseline noisy query returns the result at 53rd position. Whereas our query returns the correct result at the topmost position.
  22. In the case of poor bug report, we also apply a similar PageRank approach. But we collect the keywords from the source code using pseudo-relevance feedback. The details can be found in the paper. So, the bug report texts are merged with the keywords from relevant source code. While the bug report texts return the result at 30th position. After feeding poor report with appropriate keywords, the correct result is returned at the top most position.
  23. The rich bug report is inherently good for IR-based localization. But we found that selecting appropriate keywords can make it better. For example, if we reduce the bug report is reduced to these keywords, the correct result comes to the top of the list.
  24. Now, for the experiments, we chose 8 subject systems from Apache and Eclipse. We collect about 3000 bug reports, and try to map them with the version control history at GitHub. Through such mapping we extract the ground truth for the bug reports. This is a standard process followed by the existing literature.
  25. First, we construct a semantic hyperspace using Stack Overflow corpus. What is hyperspace? Now, if we have more than 3 dimensions, then we call that space as hyperspace. How do we do it? First we Stack Overflow data dump that contain software specific texts. Our corpus contains about 2.1 million questions and answers. We do pre-processing and feed the contents to FastText. Now FastText generates a three-layer neural network model. This model essentially represents the whole vocabulary like this in a hyperspace. Now how does it help?
  26. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  27. Now, here is how we do it. We extract three reformulation candidates from the source code based on IR and data analytics. Then we apply the data analytics as a proxy to query quality, do data resampling, machine learning and then identify the best reformulated query.
  28. Now this is a poor bug report, does not contain any useful hints for bug localization. Now this is the baseline initial query. This is from the state-of-the-art And this is our performance.
  29. Now, I am not going to discuss those studies in details. But here is the glimpse. Developers generally look for relevant code on the web using natural language query. Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub. Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers. But what we are dealing with source code right? So, we need source code friendly query for a better result. So, we identify relevant API classes against this natural language query through extensive data mining and data analytics. And once again, Stack Overflow is our friend in this grand challenge.
  30. For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others. We collect 300+ queries, we also collect the ground truth API classes from them. Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow. For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site. Then we determine whether our reformulated query actually works or not.
  31. Now, I care about the replication and reproducibility of my works in order to grow as a research community. So, all of works could be replicated and potentially reproduced (hopefully). I was busy during PhD and my works are uploaded at GitHub. Many of them are open source. I am a member of replication workshop and a PC member of replication track, ICPC.
  32. Now, how far I would to go in the next 5-10 years?
  33. Now, if you remember this diagram. So far, I focused on Bug Localization in my PhD mostly. In the next 5 years, I would focus on Bug Understanding and fixing, the next logical steps of software debugging. Right now, I am working on bug reproducibility. That is, how to make a reported bug reproducible. Without that, a bug can not be fixed most probably.
  34. So, I compared between TF-IDF and PageRank. But so far, I did only for query reformulation. But there 20+ software engineering tasks that used TF-IDF in one way or another. Now, a cool thing would be to investigate how PageRank can influence that tasks given PageRank outperformed TF-IDF in query reformulation. It would be really interesting to see.
  35. Another plain I have to develop a programmer bot. It will accept a natural language query from developer JOE and return a high quality relevant code. Now, in the background, this will happen. First query will be translated into search engine friendly query with relevant API classes and thousands of examples will be collected. Then we will use some to determine the best quality code example. Now, I did figured out this part. sophisticated machine learning The rest part is still pending and hope to get it done by some brilliant grad students.
  36. Over the years, I worked with experts from my domain, and learned from the bests. I worked with academia as well as industry. For example, Vendasta is a leading company based in Saskatoon, and we got our NSERC Industry Engage grant in collaboration with them. So my since thanks and gratitude to all the collaborators.
  37. Now, these are some achievements of my work. Got several competitive awards over the last few years. Got funding from NSERC. Thanks U of S and my professor for all the support. Also got accepted in the flagship conferences/journals such ICSE, FSE, ASE and EMSE
  38. Thanks for your time and attention. Now, I am ready to take your questions.
  39. For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others. We collect 300+ queries, we also collect the ground truth API classes from them. Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow. For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site. Then we determine whether our reformulated query actually works or not.
  40. Now let me explain the metrics a bit since we will be using these a lot. Hit@K is the percentage of the queries for which at least one ground truth is found within the top K results. MAP is the standard precision + result position. The detailed is much complex which I can discuss later. MRR is the inverse of the rank of first ground truth within the result. QE also stands for query effectiveness is just the opposite of MRR
  41. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  42. Now lets expand and generalize the problem a bit. So far, we discuss the code search within a local codebase. It could also be in the large-scale open source repository such as GitHub. Now, based on these contexts, there are different challenges in query reformulation. The local codebase is small, domain specific and organized. On the contrary, GitHub is huge, cross-domain and very noisy. So, yes, they need different strategies to suggest queries for them.
  43. Now once such items are extracted, we split them. Now as we see, these single terms share some kind of semantics to convey a broader semantic. That is, they complement each other in this context. Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
  44. That is, each of three people, customer, past developer and JOE have their own vocabulary to describe a certain problem/concept. In fact, any people will discuss the same problem with the same vocabulary, this probability is only 15%-20% So, naturally, developer JOE finds it a great challenge to make a connection between bug report and the buggy code. This costs development time, money and valuable efforts.
  45. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  46. So, from a noisy report, we extract The report title The encountered exception The most important keywords from the stack traces. Then we do the search with this newly constructed query. For example, the baseline noisy query returns the result at 53rd position. Whereas our query returns the correct result at the topmost position.
  47. Now the question is, why is this so challenging? The answer is vocabulary mismatch problem. In fact, this is a common problem for any type of document search. Here we see both guys are looking at the same object, but they are explaining it differently. That is, they are both correct from their perspective, but wrong from other guy’s perspective. This also actually happens with bug reports as well. Both customer and developer will explain the same problem using the same terminologies, that probability is only 15% That is why selecting appropriate keywords from the bug report is very challenging.
  48. Let us see an example. This is a bug report, this is title and this is the description. Now, developer JOE would use this bug report to localize the bug from source code. Now he chose some ad hoc queries. Which one is the best do you think, here? PAUSE! Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query. … oh… this one is the best. So, selecting appropriate keywords from the bug report is not that simple.