SlideShare une entreprise Scribd logo
1  sur  50
SUPPORTING SOURCE CODE SEARCH WITH
CONTEXT-AWARE, ANALYTICS-DRIVEN
QUERY REFORMULATION
Masud Rahman
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal Roy
@masud233
6
TALK OUTLINE
Part 2: PhD Thesis
Part 1: Research Problem
Part 4: Q&A + Discussions
2
Part 3: Contribution Summary
MasudRahman,UofS
MasudRahman,UofS
Part 1: Research Problem
3
P1 P2 P4P3
MCAS: A SOFTWARE BUG THAT KILLS
MasudRahman,UofS
Boeing 737 MAX 8
4
MCAS
P1 P2 P4P3
THE SEARCH FOR THE BUGGY CODE
MasudRahman,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
5
P1 P2 P4P3
SYSTEMATIC LITERATURE REVIEW
MasudRahman,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
Filter by
Full texts
Query reformulation, query expansion, query reduction, query formulation,
query refinement, automated query expansion, AQE, query suggestion,
query recommendation, term selection, query replacement, query difficulty,
query quality, keyword selection, keyword extraction, search term
identification, search query, search term, and search keyword.
6
3
P1 P2 P4P3
I1: INAPPROPRIATE TERM WEIGHTING


RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
MasudRahman,UofS
7
RQ1: Can TF-IDF deliver appropriate search keywords
either from source code or from bug reports? If not, how
can we improve the keyword selection?
P1 P2 P4P3
I2: LOW QUALITY OF BUG REPORTS
8
5000+
MasudRahman,UofS
PoorNoisyRich
RQ2: Can we deliver appropriate keywords for IR-
based bug localization (a.k.a., local code search)
by incorporating the bug report quality?
Traditional Practices
P1 P2 P4P3
I3: WORDNET FOR SEMANTIC SIMILARITY
9
MasudRahman,UofS
W1  W2
RQ3: Can we deliver appropriate query keywords for
the code search using crowd knowledge (Stack
Overflow) and large data analytics (FastText)?
P1 P2 P4P3
MasudRahman,UofS
Part 2: PhD Thesis Proposal
10
P1 P2 P4P3
PHD THESIS OVERVIEW
11
MasudRahman,UofS
S1 (STRICT)
S3 (ACER)
S2 (BLIZZARD) S6 (NLP2API)
S5 (RACK)
S4 (BLADER)
Thesis
Graph-based Term
Weighting
Bug Report Quality
Dimension
Crowd Knowledge Data Analytics
RQ1 RQ2 RQ3
P1 P2 P4P3
12
MasudRahman,UofS
S1 S2 S3 S4 S5 S6
RQ1: Can TF-IDF deliver appropriate search
keywords either from source code or from
bug reports? If not, how can we improve
the keyword selection?
Graph-based Term
Weighting
P1 P2 P4P3
TF-IDF: TERM IMPORTANCE (TRADITIONAL)
13
MasudRahman,UofS
University of Saskatchewan
The Saskatchewan Huskies football team
represents the University of Saskatchewan
in U Sports football that competes in the
Canada West Universities Athletic
Association conference of U Sports. The
program has won the Vanier Cup national
championship three times, in 1990, 1996
and 1998.
The Saskatchewan Huskies
became only the second U Sports team to
advance to three consecutive Vanier Cup
games, after the Saint Mary's Huskies, but
lost all three games from 2004-2006. The
team has won the most Hardy Trophy
titles in Canada West, having won a total
of 20 times. The 2006 Saskatchewan
Huskies became only the third team to
play in a Vanier Cup that their school was
hosting, when the University of
Saskatchewan hosted the 42nd Vanier
Cup. The Toronto Varsity Blues were the
first when they won two Vanier Cups in
1965 and 1993. Saskatchewan also
became the first western school to host
the national championship game.
Saskatchewan:6
Vanier: 5
Won: 4
Huskies: 4
Cup: 4
Team: 4
Sports: 3
Times: 2
School: 2
Championship:2
Vanier: 0.5
Won: 0.4
Huskies: 0.4
School: 0.1
Saskatchewan: 0.06
Championship: 0.06
Sports: 0.06
Times: 0.06
Cup: 0.04
Team: 0.04
TF IDF TF x IDF
Saskatchewan: .01
Vanier: 0.1
Won: 0.1
Huskies: 0.1
Cup: 0.01
Team: 0.01
Sports: 0.02
Times: 0.03
School: 0.05
Championship: .03
IDF = log (DF / N)
Saskatchewan Huskies
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
TEXTRANK: TERM IMPORTANCE USING CO-
OCCURRENCES (MIHALCEA ET AL, EMNLP 2004)
14
MasudRahman,UofS
IResource … IJavaElement
IResource … IJavaElement
(Term Co-occurrence)
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
POSRANK: TERM IMPORTANCE USING SYNTACTIC
DEPENDENCE (BLANCO & LIOMA, INF. RETR. 2012)
15
MasudRahman,UofS
Jespersen Rank Theory
(Syntactic Dependence)
Noun Verb Adjective
Element …reported, element …plain
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
(Syntactic Dependence)
STRICT: QUERY KEYWORD SELECTION WITH
PAGERANK (BRIN & PAGE, 1998)
16
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
•Element
•Iresource
•Provider
•Level
•Tree
Candidate
Query 1
Candidate
Query 2
Sergey
Brin
Larry
Page
PageRank
Algorithm
Best Query
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
K1
K2
K3
K4
K5
K1
K2
K3
K4
K5
ACER: KEYWORDS FROM SOURCE CODE
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
RQ1: Keywords selected by PageRank are more
effective for local code searches (e.g., concept location,
bug localization) than that of TF-IDF
17
MasudRahman,UofS
P1 P2 P4P3
launch
debug
resolve
required
classpath
S1 S2 S3 S4 S5 S6
18
MasudRahman,UofS
RQ2: Can we deliver appropriate keywords
for IR-based bug localization (a.k.a., local
code search) by incorporating the bug
report quality?
Bug Report Quality
Dimension
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
BLIZZARD: QUALITY-AWARE SEARCH QUERIES
19
Noisy Poor Rich
MasudRahman,UofS
PoorNoisyRich
Rich
Noisy
Poor
Equality Equity
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
SEARCH QUERY FROM NOISY BUG REPORT
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01 20
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
RQ2: High quality keywords can be provided for IR-
based bug localization (a.k.a., local code search) by
considering bug report quality.
21
MasudRahman,UofS
Crowd
Knowledge
Data Analytics
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
RQ3: Can we deliver appropriate query
keywords for the code search using crowd
knowledge (Stack Overflow) and large data
analytics (FastText)?
Semantic
Hyperspace
BLADER: QUERY REFORMULATION WITH
CROWD KNOWLEDGE & DATA ANALYTICS
22
MasudRahman,UofS
Stack Overflow
(Crowd Knowledge)
Data
preprocessing
Neural Text classifier
FastText model
(skip-gram)
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
SEMANTIC HYPERSPACE
23
MasudRahman,UofS
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
24
MasudRahman,UofS
channel
join spam
entered
connect
invitation
message
room
chat
handle
mask
remote
synd
admin
Q
C1
C2
CLUSTERING TENDENCY WITH DATA ANALYTICS
C1 is better than C2
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
Hopkins Statistic (HS)
Polygon Area (PA)
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
EXPERIMENT, DATASET & METRICS
25
5K+ Bug reports Version HistoryGround Truth
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
1. Hit@K
2. MAP@K
3. MRR@K
4. QE
SEARCH CONTEXTS: LOCAL & INTERNET-SCALE
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
26
76%
S1 S2 S3 S4 S5 S6
MasudRahman,UofS
P1 P2 P4P3
CROWD KNOWLEDGE & DATA ANALYTICS FOR QUERY
EXPANSION
MasudRahman,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects 27
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
WHAT IS CROWD KNOWLEDGE?
28
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
RACK: QUERIES USING CROWD KNOWLEDGE
29
MasudRahman,UofS
MessageDigest
generate
MD5
hash
S1 S2 S3 S4 S5 S6P1 P2 P4P3
RQ3: Appropriate query keywords (e.g., relevant API classes)
can be delivered for the code search using crowd knowledge
(Stack Overflow)
Q* = Q + C
Keyword-API
Mapping DB
NLP2API: QUERIES WITH DATA ANALYTICS
30
MasudRahman,UofS
S1 S2 S3 S4 S5 S6P1 P2 P4P3
Semantic Proximity:
if proximity(Q,A) > proximity(Q,B)
Q
A
B
Q* = Q + A
RQ3: Appropriate query keywords (e.g., relevant API classes)
can be delivered for the code search using large-scale data
analytics (FastText).
31
Part 3: Contribution Summary
MasudRahman,UofS
P1 P2 P4P3
PHD PROGRESS REPORT
32
MasudRahman,UofS
S1 (STRICT)
S2 (BLIZZARD)
S3 (ACER)
S4 (BLADER)
S5 (RACK)
S6 (NLP2API)
SANER 2015, 2017 TSE(A*) (Under Review)
ESEC/FSE 2018 (A*)
ASE 2017 (A)
TSE (A*) (To be submitted)
SANER 2016 ICSE 2017 (A*)
ICSME 2018 (A)
EMSE (A)
P1 P2 P4P3
ICSE 2019 (A*) Doctoral Symposium, Montreal
TAKE-HOME MESSAGES
33
MasudRahman,UofS
Term Independence
(TF-IDF)
Term Dependence
(PageRank)
Reliance on Auxiliary
Resources (e.g., history mining)
Efficient Use of Primary
Resource (e.g., Bug Reports)
Bug Report Quality
(Overlooked)
Reporting Quality-Aware
Bug Localization
Thesaurus-Based Similar
Keyword Suggestion
Crowd Knowledge & Large
Data Analytics
Traditional Proposed
Cosine Similarity for
Semantic Distance
Semantic Hyperspace &
Clustering Tendency
P1 P2 P4P3
MasudRahman,UofS
34
http://www.usask.ca/~masud.rahman
https://github.com/masud-technope
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
Part IV: Q & A
P1 P2 P4P3
35
MasudRahman,UofS
TAKE-HOME MESSAGES
36
MasudRahman,UofS
RQ1
RQ2 RQ3
TF-IDF
PageRank
Equality
Equity
Stack Overflow
FastText
WordNet
Thesis
P1 P2 P3
EXPERIMENT, DATASET & METRICS
37
Java2s
CodeJava
310 Queries & Ground truth
769K Code segments
Hit@K
MAP@K
MRR@K
MR@K
QE
NDCG
S1 S2 S3 S4 S5 S6P1 P2 P4P3
MasudRahman,UofS
Correct
Result
Correct
Result
Correct
Result
WHAT IS A GOOD SEARCH QUERY?
38
MasudRahman,UofS
Baseline Query
(Title + Description)
Worse Query Better Query
Title
Description
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
SEMANTIC HYPERSPACE
39
MasudRahman,UofS
P1 P2 P3
x P (1, 5, 6, 7, ….., N)
y P (2, 4, 6, 9, ….., N)
y
S1 S2 S3 S4 S5 S6
y = mx + c,
x^2 +y^2 = r^2
ax^2+bx+c=0
TWO WORKING CONTEXTS: LOCAL & GLOBAL
MasudRahman,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3
40
S2: KEYWORDS SELECTION FROM SOURCE
CODE WITH CODERANK
41
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
P1 P2 P3
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
RQ1 [Source Code]: Keywords selected by PageRank
are more effective for local code searches (e.g., concept
location) than that of TF-IDF
S1 S2 S3 S4 S5 S6
MasudRahman,UofS
HOW DID WE DO?
42
MasudRahman,UofS
P1 P2 P3 S1 S2 S3 S4 S5 S6
3
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
R3: SOLVE VOCABULARY MISMATCH ISSUE
MasudRahman,UofS
Customer
Developer
Past
Developer
Bug Report
Codebase
P1 P2 P3 P4
43
SOLUTION: SEMANTIC HYPERSPACE
MasudRahman,UofS
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
Cosine distance = Semantic
relevance
P1 P2 P3 P4
44
R4: GENETIC ALGORITHM FOR QUERIES
MasudRahman,UofS
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4
45
SEARCH QUERY FROM NOISY BUG REPORT
46
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
MasudRahman,UofS
S1 S2 S3 S4P1 P2 P3
DICE, ROCCHIO, RSV
MasudRahman,UofS
47
VOCABULARY MISMATCH PROBLEM
MasudRahman,UofS
P1 P2 P3
Both are correct and wrong!
Boeing
Customer Boeing
Developer
48
MasudRahman,UofS
KEYWORDS FROM A BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3
49
PROBABILISTIC TERM WEIGHTING
MasudRahman,UofS
KLD
50

Contenu connexe

Similaire à PhD proposal of Masud Rahman

Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
inside-BigData.com
 
Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...
Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...
Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...
paupo
 
Executing SPARQL Queries over Mapped Document Stores with SparqlMap-M
Executing SPARQL Queries over Mapped Document Stores with SparqlMap-MExecuting SPARQL Queries over Mapped Document Stores with SparqlMap-M
Executing SPARQL Queries over Mapped Document Stores with SparqlMap-M
Linked Enterprise Date Services
 

Similaire à PhD proposal of Masud Rahman (20)

Modelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic StudyModelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic Study
 
SSN-TC workshop talk at ISWC 2015 on Emrooz
SSN-TC workshop talk at ISWC 2015 on EmroozSSN-TC workshop talk at ISWC 2015 on Emrooz
SSN-TC workshop talk at ISWC 2015 on Emrooz
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
 
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
 
PAKDD2013
PAKDD2013PAKDD2013
PAKDD2013
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
Computing k-rank Answers with Ontological CP-nets
Computing k-rank Answers with Ontological CP-netsComputing k-rank Answers with Ontological CP-nets
Computing k-rank Answers with Ontological CP-nets
 
The Allen AI Science Challenge
The Allen AI Science ChallengeThe Allen AI Science Challenge
The Allen AI Science Challenge
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Representation of molecular structures and related computations on the Sema...
Representation of molecular structures and related computations on the Sema...Representation of molecular structures and related computations on the Sema...
Representation of molecular structures and related computations on the Sema...
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
List.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF ListsList.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF Lists
 
RichardPughspatial.ppt
RichardPughspatial.pptRichardPughspatial.ppt
RichardPughspatial.ppt
 
Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...
Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...
Holistic Analysis and Optimization of Heterogeneous Fault-Tolerant Embedded S...
 
Phd
PhdPhd
Phd
 
Executing SPARQL Queries over Mapped Document Stores with SparqlMap-M
Executing SPARQL Queries over Mapped Document Stores with SparqlMap-MExecuting SPARQL Queries over Mapped Document Stores with SparqlMap-M
Executing SPARQL Queries over Mapped Document Stores with SparqlMap-M
 
CCLS Internship Presentation
CCLS Internship PresentationCCLS Internship Presentation
CCLS Internship Presentation
 

Plus de Masud Rahman

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
Masud Rahman
 

Plus de Masud Rahman (20)

HereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie UniversityHereWeCode 2022: Dalhousie University
HereWeCode 2022: Dalhousie University
 
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
 
ICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-LocalizationICSE2018-Poster-Bug-Localization
ICSE2018-Poster-Bug-Localization
 
MSR2017-Challenge
MSR2017-ChallengeMSR2017-Challenge
MSR2017-Challenge
 
MSR2017-RevHelper
MSR2017-RevHelperMSR2017-RevHelper
MSR2017-RevHelper
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
MSR2015-Challenge
MSR2015-ChallengeMSR2015-Challenge
MSR2015-Challenge
 
MSR2014-Challenge
MSR2014-ChallengeMSR2014-Challenge
MSR2014-Challenge
 
CodeInsight-SCAM2015
CodeInsight-SCAM2015CodeInsight-SCAM2015
CodeInsight-SCAM2015
 
STRICT-SANER2015
STRICT-SANER2015STRICT-SANER2015
STRICT-SANER2015
 
CMPT-842-BRACK
CMPT-842-BRACKCMPT-842-BRACK
CMPT-842-BRACK
 
RACK-Tool-ICSE2017
RACK-Tool-ICSE2017RACK-Tool-ICSE2017
RACK-Tool-ICSE2017
 
RACK-SANER2016
RACK-SANER2016RACK-SANER2016
RACK-SANER2016
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016CORRECT-ToolDemo-ASE2016
CORRECT-ToolDemo-ASE2016
 
CORRECT-ICSE2016
CORRECT-ICSE2016CORRECT-ICSE2016
CORRECT-ICSE2016
 
Code-Review-COW56-Meeting
Code-Review-COW56-MeetingCode-Review-COW56-Meeting
Code-Review-COW56-Meeting
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slides
 
CMPT470-usask-guest-lecture
CMPT470-usask-guest-lectureCMPT470-usask-guest-lecture
CMPT470-usask-guest-lecture
 
NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018NLP2API: Replication package accepted by ICSME 2018
NLP2API: Replication package accepted by ICSME 2018
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 

Dernier (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

PhD proposal of Masud Rahman

  • 1. SUPPORTING SOURCE CODE SEARCH WITH CONTEXT-AWARE, ANALYTICS-DRIVEN QUERY REFORMULATION Masud Rahman Department of Computer Science University of Saskatchewan, Canada Advisor: Dr. Chanchal Roy @masud233 6
  • 2. TALK OUTLINE Part 2: PhD Thesis Part 1: Research Problem Part 4: Q&A + Discussions 2 Part 3: Contribution Summary MasudRahman,UofS
  • 3. MasudRahman,UofS Part 1: Research Problem 3 P1 P2 P4P3
  • 4. MCAS: A SOFTWARE BUG THAT KILLS MasudRahman,UofS Boeing 737 MAX 8 4 MCAS P1 P2 P4P3
  • 5. THE SEARCH FOR THE BUGGY CODE MasudRahman,UofS Boeing Customer MCAS Bug report Boeing Developer Code search Query Suggestion Query Reformulation Boeing Codebase 5 P1 P2 P4P3
  • 6. SYSTEMATIC LITERATURE REVIEW MasudRahman,UofS ACM DL CrossRef DBLP Mendeley Google Scholar IEEE Xplore ProQuest ScienceDirect SpringerLink Web of Science Wiley Online Lib 2871 2317 562 Initial results Impurity removal Filter by Title 195 Filter by Abstract 93 Merging & Duplicate removal 56 Primary studies Filter by Full texts Query reformulation, query expansion, query reduction, query formulation, query refinement, automated query expansion, AQE, query suggestion, query recommendation, term selection, query replacement, query difficulty, query quality, keyword selection, keyword extraction, search term identification, search query, search term, and search keyword. 6 3 P1 P2 P4P3
  • 7. I1: INAPPROPRIATE TERM WEIGHTING   RFDd t t n D dftIDFTF log)),log(1()( • Different syntax • Different semantics • Different structures MasudRahman,UofS 7 RQ1: Can TF-IDF deliver appropriate search keywords either from source code or from bug reports? If not, how can we improve the keyword selection? P1 P2 P4P3
  • 8. I2: LOW QUALITY OF BUG REPORTS 8 5000+ MasudRahman,UofS PoorNoisyRich RQ2: Can we deliver appropriate keywords for IR- based bug localization (a.k.a., local code search) by incorporating the bug report quality? Traditional Practices P1 P2 P4P3
  • 9. I3: WORDNET FOR SEMANTIC SIMILARITY 9 MasudRahman,UofS W1  W2 RQ3: Can we deliver appropriate query keywords for the code search using crowd knowledge (Stack Overflow) and large data analytics (FastText)? P1 P2 P4P3
  • 10. MasudRahman,UofS Part 2: PhD Thesis Proposal 10 P1 P2 P4P3
  • 11. PHD THESIS OVERVIEW 11 MasudRahman,UofS S1 (STRICT) S3 (ACER) S2 (BLIZZARD) S6 (NLP2API) S5 (RACK) S4 (BLADER) Thesis Graph-based Term Weighting Bug Report Quality Dimension Crowd Knowledge Data Analytics RQ1 RQ2 RQ3 P1 P2 P4P3
  • 12. 12 MasudRahman,UofS S1 S2 S3 S4 S5 S6 RQ1: Can TF-IDF deliver appropriate search keywords either from source code or from bug reports? If not, how can we improve the keyword selection? Graph-based Term Weighting P1 P2 P4P3
  • 13. TF-IDF: TERM IMPORTANCE (TRADITIONAL) 13 MasudRahman,UofS University of Saskatchewan The Saskatchewan Huskies football team represents the University of Saskatchewan in U Sports football that competes in the Canada West Universities Athletic Association conference of U Sports. The program has won the Vanier Cup national championship three times, in 1990, 1996 and 1998. The Saskatchewan Huskies became only the second U Sports team to advance to three consecutive Vanier Cup games, after the Saint Mary's Huskies, but lost all three games from 2004-2006. The team has won the most Hardy Trophy titles in Canada West, having won a total of 20 times. The 2006 Saskatchewan Huskies became only the third team to play in a Vanier Cup that their school was hosting, when the University of Saskatchewan hosted the 42nd Vanier Cup. The Toronto Varsity Blues were the first when they won two Vanier Cups in 1965 and 1993. Saskatchewan also became the first western school to host the national championship game. Saskatchewan:6 Vanier: 5 Won: 4 Huskies: 4 Cup: 4 Team: 4 Sports: 3 Times: 2 School: 2 Championship:2 Vanier: 0.5 Won: 0.4 Huskies: 0.4 School: 0.1 Saskatchewan: 0.06 Championship: 0.06 Sports: 0.06 Times: 0.06 Cup: 0.04 Team: 0.04 TF IDF TF x IDF Saskatchewan: .01 Vanier: 0.1 Won: 0.1 Huskies: 0.1 Cup: 0.01 Team: 0.01 Sports: 0.02 Times: 0.03 School: 0.05 Championship: .03 IDF = log (DF / N) Saskatchewan Huskies P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 14. TEXTRANK: TERM IMPORTANCE USING CO- OCCURRENCES (MIHALCEA ET AL, EMNLP 2004) 14 MasudRahman,UofS IResource … IJavaElement IResource … IJavaElement (Term Co-occurrence) P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 15. POSRANK: TERM IMPORTANCE USING SYNTACTIC DEPENDENCE (BLANCO & LIOMA, INF. RETR. 2012) 15 MasudRahman,UofS Jespersen Rank Theory (Syntactic Dependence) Noun Verb Adjective Element …reported, element …plain P1 P2 P4P3 S1 S2 S3 S4 S5 S6 (Syntactic Dependence)
  • 16. STRICT: QUERY KEYWORD SELECTION WITH PAGERANK (BRIN & PAGE, 1998) 16    )( )10( |)(| )( )1()( ivInj j j i vOut vS vS  •Element •Iresource •Provider •Level •Tree Candidate Query 1 Candidate Query 2 Sergey Brin Larry Page PageRank Algorithm Best Query MasudRahman,UofS P1 P2 P4P3 S1 S2 S3 S4 S5 S6 K1 K2 K3 K4 K5 K1 K2 K3 K4 K5
  • 17. ACER: KEYWORDS FROM SOURCE CODE resolveRuntimeClasspathEntry Resolve Runtime Classpath Entry    )( )10( |)(| )( )1()( ivInj j j i vOut vS vS  RQ1: Keywords selected by PageRank are more effective for local code searches (e.g., concept location, bug localization) than that of TF-IDF 17 MasudRahman,UofS P1 P2 P4P3 launch debug resolve required classpath S1 S2 S3 S4 S5 S6
  • 18. 18 MasudRahman,UofS RQ2: Can we deliver appropriate keywords for IR-based bug localization (a.k.a., local code search) by incorporating the bug report quality? Bug Report Quality Dimension P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 19. BLIZZARD: QUALITY-AWARE SEARCH QUERIES 19 Noisy Poor Rich MasudRahman,UofS PoorNoisyRich Rich Noisy Poor Equality Equity P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 20. SEARCH QUERY FROM NOISY BUG REPORT Bug 31637 – should be able to cast null NullPointerException Ci Cj Mk Mn Cp 53 01 20 MasudRahman,UofS P1 P2 P4P3 S1 S2 S3 S4 S5 S6 RQ2: High quality keywords can be provided for IR- based bug localization (a.k.a., local code search) by considering bug report quality.
  • 21. 21 MasudRahman,UofS Crowd Knowledge Data Analytics P1 P2 P4P3 S1 S2 S3 S4 S5 S6 RQ3: Can we deliver appropriate query keywords for the code search using crowd knowledge (Stack Overflow) and large data analytics (FastText)?
  • 22. Semantic Hyperspace BLADER: QUERY REFORMULATION WITH CROWD KNOWLEDGE & DATA ANALYTICS 22 MasudRahman,UofS Stack Overflow (Crowd Knowledge) Data preprocessing Neural Text classifier FastText model (skip-gram) P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 23. SEMANTIC HYPERSPACE 23 MasudRahman,UofS Word 1 P (1, 5, 6, 7, ….., N) Word 2 P (2, 4, 6, 9, ….., N) Word 2 P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 24. 24 MasudRahman,UofS channel join spam entered connect invitation message room chat handle mask remote synd admin Q C1 C2 CLUSTERING TENDENCY WITH DATA ANALYTICS C1 is better than C2 P1 P2 P4P3 S1 S2 S3 S4 S5 S6 Hopkins Statistic (HS) Polygon Area (PA) RQ3: Appropriate query keywords can be delivered for the code search using Stack Overflow and FastText.
  • 25. EXPERIMENT, DATASET & METRICS 25 5K+ Bug reports Version HistoryGround Truth MasudRahman,UofS P1 P2 P4P3 S1 S2 S3 S4 S5 S6 1. Hit@K 2. MAP@K 3. MRR@K 4. QE
  • 26. SEARCH CONTEXTS: LOCAL & INTERNET-SCALE Local code search (e.g., bug localization) Internet-scale code search Boeing codebase GitHub 26 76% S1 S2 S3 S4 S5 S6 MasudRahman,UofS P1 P2 P4P3
  • 27. CROWD KNOWLEDGE & DATA ANALYTICS FOR QUERY EXPANSION MasudRahman,UofS Convert image to gray scale without losing transparency BufferedImage Grayscale ImageEdit ColorConvertOp File Transparency ColorSpace BufferedImageOp Graphics ImageEffects 27 P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 28. WHAT IS CROWD KNOWLEDGE? 28 MasudRahman,UofS P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 29. RACK: QUERIES USING CROWD KNOWLEDGE 29 MasudRahman,UofS MessageDigest generate MD5 hash S1 S2 S3 S4 S5 S6P1 P2 P4P3 RQ3: Appropriate query keywords (e.g., relevant API classes) can be delivered for the code search using crowd knowledge (Stack Overflow) Q* = Q + C Keyword-API Mapping DB
  • 30. NLP2API: QUERIES WITH DATA ANALYTICS 30 MasudRahman,UofS S1 S2 S3 S4 S5 S6P1 P2 P4P3 Semantic Proximity: if proximity(Q,A) > proximity(Q,B) Q A B Q* = Q + A RQ3: Appropriate query keywords (e.g., relevant API classes) can be delivered for the code search using large-scale data analytics (FastText).
  • 31. 31 Part 3: Contribution Summary MasudRahman,UofS P1 P2 P4P3
  • 32. PHD PROGRESS REPORT 32 MasudRahman,UofS S1 (STRICT) S2 (BLIZZARD) S3 (ACER) S4 (BLADER) S5 (RACK) S6 (NLP2API) SANER 2015, 2017 TSE(A*) (Under Review) ESEC/FSE 2018 (A*) ASE 2017 (A) TSE (A*) (To be submitted) SANER 2016 ICSE 2017 (A*) ICSME 2018 (A) EMSE (A) P1 P2 P4P3 ICSE 2019 (A*) Doctoral Symposium, Montreal
  • 33. TAKE-HOME MESSAGES 33 MasudRahman,UofS Term Independence (TF-IDF) Term Dependence (PageRank) Reliance on Auxiliary Resources (e.g., history mining) Efficient Use of Primary Resource (e.g., Bug Reports) Bug Report Quality (Overlooked) Reporting Quality-Aware Bug Localization Thesaurus-Based Similar Keyword Suggestion Crowd Knowledge & Large Data Analytics Traditional Proposed Cosine Similarity for Semantic Distance Semantic Hyperspace & Clustering Tendency P1 P2 P4P3
  • 37. EXPERIMENT, DATASET & METRICS 37 Java2s CodeJava 310 Queries & Ground truth 769K Code segments Hit@K MAP@K MRR@K MR@K QE NDCG S1 S2 S3 S4 S5 S6P1 P2 P4P3 MasudRahman,UofS
  • 38. Correct Result Correct Result Correct Result WHAT IS A GOOD SEARCH QUERY? 38 MasudRahman,UofS Baseline Query (Title + Description) Worse Query Better Query Title Description P1 P2 P4P3 S1 S2 S3 S4 S5 S6
  • 39. SEMANTIC HYPERSPACE 39 MasudRahman,UofS P1 P2 P3 x P (1, 5, 6, 7, ….., N) y P (2, 4, 6, 9, ….., N) y S1 S2 S3 S4 S5 S6 y = mx + c, x^2 +y^2 = r^2 ax^2+bx+c=0
  • 40. TWO WORKING CONTEXTS: LOCAL & GLOBAL MasudRahman,UofS Local code search (e.g., bug localization) Internet-scale code search Boeing codebase GitHub P1 P2 P3 40
  • 41. S2: KEYWORDS SELECTION FROM SOURCE CODE WITH CODERANK 41 resolveRuntimeClasspathEntry Resolve Runtime Classpath Entry P1 P2 P3    )( )10( |)(| )( )1()( ivInj j j i vOut vS vS  RQ1 [Source Code]: Keywords selected by PageRank are more effective for local code searches (e.g., concept location) than that of TF-IDF S1 S2 S3 S4 S5 S6 MasudRahman,UofS
  • 42. HOW DID WE DO? 42 MasudRahman,UofS P1 P2 P3 S1 S2 S3 S4 S5 S6 3 RQ3: Appropriate query keywords can be delivered for the code search using Stack Overflow and FastText.
  • 43. R3: SOLVE VOCABULARY MISMATCH ISSUE MasudRahman,UofS Customer Developer Past Developer Bug Report Codebase P1 P2 P3 P4 43
  • 44. SOLUTION: SEMANTIC HYPERSPACE MasudRahman,UofS Word 1 P (1, 5, 6, 7, ….., N) Word 2 P (2, 4, 6, 9, ….., N) Word 2 Cosine distance = Semantic relevance P1 P2 P3 P4 44
  • 45. R4: GENETIC ALGORITHM FOR QUERIES MasudRahman,UofS Method Search Query QE Baseline {title + description} 25 STRICT[140] {tab classpath enabled buttons user entry} 86 TF-IDF {button entry bootstrap enabled incorrectly moving} 177 GA {open reflect tab bottom entry classpath} 01 Title Description Lower QE is better P1 P2 P3 P4 45
  • 46. SEARCH QUERY FROM NOISY BUG REPORT 46 Bug 31637 – should be able to cast null NullPointerException Ci Cj Mk Mn Cp 53 01 MasudRahman,UofS S1 S2 S3 S4P1 P2 P3
  • 48. VOCABULARY MISMATCH PROBLEM MasudRahman,UofS P1 P2 P3 Both are correct and wrong! Boeing Customer Boeing Developer 48
  • 49. MasudRahman,UofS KEYWORDS FROM A BUG REPORT Title Description ID Query QE 1. Custom search results view iresource 2. Custom search results search results view 3. element iresource provider level tree 4. Custom search results hierarchically java search results 1331 636 01 570 Lower QE is better P1 P2 P3 49

Notes de l'éditeur

  1. Hello everyone! Good afternoon! Thanks for attending this meeting. My name is Masud Rahman. I am a PhD Candidate from Software Research Lab. I work with Dr. Chanchal K. Roy. Today, I will be talking about automated query reformulations for code search.
  2. Today, my talk will be divided into four sections. In the first section, I will discuss the research problem I am trying to solve in my PhD. In the second section, I will discuss about my PhD Thesis proposals to solve that research problem. In the third section, I will summarize my PhD contributions. Finally, we will have a Q&A session and interesting discussions.
  3. Part 1: Research Problem
  4. You are looking at two aircrafts -- Ethiopian airlines and Lion Air Indonesia. These are called the nose-down situation. Due to these nose down situations, we have two fatal crashes in a single calendar year. These crashes took 346 precious human lives and cost trillions of dollars. Now, the culprit is MCAS. This is a software component that was added to Boeing 737-Max 8 version. The bottom line conclusion is, this is a faulty component, not well designed, and ultimately leads to crash. That is why, Boeing 737 Max planes are grounded right now.
  5. Now, lets say, a Boeing customer has submitted a bug report. Now, a Boeing developer is responsible to locate and repair the faulty code triggering that bug. As a frequent practice, developer chooses a few important keywords and attempts to locate the buggy code within the Boeing codebase. But the study shows that 88% of the keywords chosen by the developer could be incorrect. That is, they do not return the buggy code. So, the obvious next step is to reformulate the query through automated tool supports, so that the buggy code could be located. There are also tools that take a bug report and suggest appropriate search queries in the first place. So, we are interested into these part of the process, and my PhD focuses on this. As you can also see that, Google does not have any jurisdiction in this case.
  6. So what we did? We did a systematic literature survey using 56 primary studies on query reformulation for code search. During this study, we found 3 major issues in the literature.
  7. Now, this is a metric which has been on the play from last the century. It was proposed in the 70s. It is a good metric, but it was actually proposed for regular texts such as news articles or plaintexts. On the other hand, we are dealing with source code here. Now, regular texts and source code have different semantics and different structures. They are not the same So, metrics for regular texts are not appropriate for the source code– this is our hypothesis. So, here is our first research question? How does TF-IDF perform? If not good, can we propose something new?
  8. We did an empirical study with 5K+ bug reports in our ICSE poster. And we discovered that bug reports could be very different in terms of quality. There could be different types of bug reports. It could be noisy with stack traces which is 16% It could be really poor that does not contain any structured entities, which is 30% Or it could be rich bug reports that include source code, test case and other stuffs, which is 54% Now, what the existing studies do? They treat all these different types of bug reports like the same. So, in their approach, everybody does not get a chance to watch the game. So, here is our second research question. Can we incorporate reporting quality into bug localization and deliver better queries?
  9. Identifying similar words is very important during query reformulation. We found that WordNet has been extensively used by the literature for finding the similar words. Now, it is good for regular texts. But again, we are dealing with source code here. Evidence suggest that WordNet might not work well for source code. However, those were old days. Now we have Stack Overflow and advanced tools like FastText for semantic similarity calculation. So, here is our third RQ. Can deliver appropriate keywords during code search using Stack Overflow and FastText?
  10. Now, we are done with Background concepts, Part 1. Now, we are going into Part 2 -- PhD Thesis
  11. So, this is our thesis statement. We hypothesize that we can improve the query reformulation using graph-based term weighting rather than TF-IDF Bug report quality and document contexts Crowdsourced knowledge, i.e., Stack Overflow and Data analytics such as word embedding from FastText. So, to evaluate these hypothesis, we conduct six studies in the PhD. The first and second study address RQ1, the third study addresses RQ2 and the rest answers RQ3
  12. Similarly, we can see the phrases and dependencies among the terms in the bug report texts as well. Our job is to identify the keywords from these texts, right? So, did we do? We consider the co-occurrences among the terms. That is, how terms occur with other terms within a certain context. We encode such co-occurrences as edges, and transform the texts into a graph like this.
  13. Besides term co-occurrences, we consider another aspect called syntactic dependencies. For this, we used Jespersen Rank Theory, a theory developed back in 1925. According this theory, parts of speech of sentence can be divided into three ranks – nouns (first), verbs + adjectives in the second rank and the rest are the third ranks According to Jespersen, verb and adjective modifies noun. That is there are some syntactic dependencies for between element and reported and element and plain to covey the overall meaning of the sentence. Now, we capture such syntactic dependencies as well, and transform the report texts into a POS graph as well.
  14. So, we have created two graphs, right? Now, we have two graphs developed from the bug report based on two different dimensions --Word co-occurrence and syntactic dependence. Once we have graphs, we apply this famous algorithm called PageRank algorithm. This is the backbone of Google search. Now, the algorithmic details are a bit complex, but I will try to provide an overview here. Why do you think, this guy is laughing? Because, it is getting the maximum votes. Similarly, in the graph, the node that is connected to most of the nodes is the winner. That is, a term’s importance will be determined by its connectivity with other nodes. More importantly, since this is a recursive algorithm, the importance depends on the weights of the connected node as well. Once the computation is done, we get a reformulation candidate from each graph. What is the reformulation candidate? – a ranked list of keywords like this. So, we collect two candidates from two graph, apply machine learning and suggest the best one as our suggested query from the bug report.
  15. Now once such items are extracted, we split them. Now as we see, these single terms share some kind of semantics to convey a broader semantic. That is, they complement each other in this context. Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
  16. So, first we take a bug report as input. Then we apply regular expressions to identify the structured components. We then classify whether this is a a noisy report containing stack traces. a poor bug report containing only regular texts a rich bug report containing source code and texts. Once the quality level is identified, what’s the next step? Well, we do query reformulation unlike the earlier studies. We separate signals from noise from noisy report, feed the poor bug report with appropriate keywords. We mostly keep the rich bug report as is. So, that is the equity approach.
  17. So, from a noisy report, we extract The report title The encountered exception The most important keywords from the stack traces. Then we do the search with this newly constructed query. For example, the baseline noisy query returns the result at 53rd position. Whereas our query returns the correct result at the topmost position.
  18. First, we construct a semantic hyperspace using Stack Overflow corpus. What is hyperspace? Now, if we have more than 3 dimensions, then we call that space as hyperspace. How do we do it? First we Stack Overflow data dump that contain software specific texts. Our corpus contains about 2.1 million questions and answers. We do pre-processing and feed the contents to FastText. Now FastText generates a three-layer neural network model. This model essentially represents the whole vocabulary like this in a hyperspace. Now how does it help?
  19. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  20. Now, for the experiments, we chose 8 subject systems from Apache and Eclipse. We collect about 3000 bug reports, and try to map them with the version control history at GitHub. Through such mapping we extract the ground truth for the bug reports. This is a standard process followed by the existing literature.
  21. Now lets expand and generalize the problem a bit. So far, we discuss the code search within a local codebase. It could also be in the large-scale open source repository such as GitHub. Now, based on these contexts, there are different challenges in query reformulation. The local codebase is small, domain specific and organized. On the contrary, GitHub is huge, cross-domain and very noisy. So, yes, they need different strategies to suggest queries for them.
  22. Now, I am not going to discuss those studies in details. But here is the glimpse. Developers generally look for relevant code on the web using natural language query. Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub. Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers. But what we are dealing with source code right? So, we need source code friendly query for a better result. So, we identify relevant API classes against this natural language query through extensive data mining and data analytics. And once again, Stack Overflow is our friend in this grand challenge.
  23. Thanks for your time and attention. Now, I am ready to take your questions.
  24. For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others. We collect 300+ queries, we also collect the ground truth API classes from them. Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow. For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site. Then we determine whether our reformulated query actually works or not.
  25. Now let me explain the metrics a bit since we will be using these a lot. Hit@K is the percentage of the queries for which at least one ground truth is found within the top K results. MAP is the standard precision + result position. The detailed is much complex which I can discuss later. MRR is the inverse of the rank of first ground truth within the result. QE also stands for query effectiveness is just the opposite of MRR
  26. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  27. Now lets expand and generalize the problem a bit. So far, we discuss the code search within a local codebase. It could also be in the large-scale open source repository such as GitHub. Now, based on these contexts, there are different challenges in query reformulation. The local codebase is small, domain specific and organized. On the contrary, GitHub is huge, cross-domain and very noisy. So, yes, they need different strategies to suggest queries for them.
  28. Now once such items are extracted, we split them. Now as we see, these single terms share some kind of semantics to convey a broader semantic. That is, they complement each other in this context. Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
  29. That is, each of three people, customer, past developer and JOE have their own vocabulary to describe a certain problem/concept. In fact, any people will discuss the same problem with the same vocabulary, this probability is only 15%-20% So, naturally, developer JOE finds it a great challenge to make a connection between bug report and the buggy code. This costs development time, money and valuable efforts.
  30. Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time. Well, that is not the case. They are mentioned in the similar contexts by the people across the whole corpus. The model recognizes such occurrences and thus put burger and sandwich close together. Similarly, dumpling and ramen are close to each other. Now, we propose this. This is original query, and this is reformulated query. Now, a good reformulated query will cluster together the original query. A bad reformulated query will NOT be able to cluster with the original query. So, clustering tendency within the hyperspace is our weapon here. We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
  31. So, from a noisy report, we extract The report title The encountered exception The most important keywords from the stack traces. Then we do the search with this newly constructed query. For example, the baseline noisy query returns the result at 53rd position. Whereas our query returns the correct result at the topmost position.
  32. Now the question is, why is this so challenging? The answer is vocabulary mismatch problem. In fact, this is a common problem for any type of document search. Here we see both guys are looking at the same object, but they are explaining it differently. That is, they are both correct from their perspective, but wrong from other guy’s perspective. This also actually happens with bug reports as well. Both customer and developer will explain the same problem using the same terminologies, that probability is only 15% That is why selecting appropriate keywords from the bug report is very challenging.
  33. Let us see an example. This is a bug report, this is title and this is the description. Now, developer JOE would use this bug report to localize the bug from source code. Now he chose some ad hoc queries. Which one is the best do you think, here? PAUSE! Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query. … oh… this one is the best. So, selecting appropriate keywords from the bug report is not that simple.