2. How can we do searching?
Well we can search with help of some algorithms like Binary
Search, Linear Search, BST, Tries etc. But Applying them on a
string is not very useful as we have to match exact words in a big
dataset and length of string is also a factor.
3. Dataset:
• Firstly we need some data where we can perform match our queries with.
Stack Overflow Data from Kaggle.
4. Data Visualization & Observations:
• We find out at 30% of our data is duplicated i.e. there are many rows
which are repeated twice or thrice. Some rows were repeated even 6
times.
5. Searching without Machine Learning
Now we want to perform searching and find similar questions for our query questions:
• We will pre-process our question Titles, To remove html and extra spaces and other things.
We will use this same pre-process function to pre-process our query too.
• So let us first Vectorise the data. We will use TF-IDF Vectorizer as BoW vectorizer was not
giving better results.
6. Result without Machine Learning:
• Now we want to perform searching and find similar questions
for our query questions.
For query = “Synchronization” let’s see what our function returns:
7. Searching Using Machine Learning
When we use Stack Overflow we mostly add programming
language in query. Like : Static Variable in C, Synchronization in
Java, View Controller error in iOS.
We have Tags for our dataset, Can we use those tags to optimise
our queries?
But How? Answer is YES, we can train a model which can
predict the Tag for a given query and adding that tag in the
query.
8. Technique Used:
• We first Simplify the our tags by using only the first tag in each row in our
dataset.
• Then we have to first change our Tags into numeric form
• Then we perform TF-IDF vectorization, and train our model on LR and
SVM.
We observe that our LR Model performed better with hyper parameter
tuning.
9. Models Used
• So we trained our model again with all the data with Logistic
Regression, SVM, Naïve Bayes
• Then we add that predicted Tag into our query by using this
function.
12. Output:
• We first optimizes the query then perform TF-IDF on query as it
is important for query to have same shape as of our dataset.
Then we get indices and we publish them. Let’s again try it for
query = “Synchronization”.
14. Future Aspect:
• We can use w2v, TF-IDF weighted w2v or other techniques to
vectorize, As I am limited by my computing powers so couldn’t
do so.
• Using full dataset.
• Making a better UI in Search Engine