Search Engine using Natural Language Processing

2. How can we do searching? Well we can search with help of some algorithms like Binary Search, Linear Search, BST, Tries etc. But Applying them on a string is not very useful as we have to match exact words in a big dataset and length of string is also a factor.

3. Dataset: • Firstly we need some data where we can perform match our queries with. Stack Overflow Data from Kaggle.

4. Data Visualization & Observations: • We find out at 30% of our data is duplicated i.e. there are many rows which are repeated twice or thrice. Some rows were repeated even 6 times.

5. Searching without Machine Learning Now we want to perform searching and find similar questions for our query questions: • We will pre-process our question Titles, To remove html and extra spaces and other things. We will use this same pre-process function to pre-process our query too. • So let us first Vectorise the data. We will use TF-IDF Vectorizer as BoW vectorizer was not giving better results.

6. Result without Machine Learning: • Now we want to perform searching and find similar questions for our query questions. For query = “Synchronization” let’s see what our function returns:

7. Searching Using Machine Learning When we use Stack Overflow we mostly add programming language in query. Like : Static Variable in C, Synchronization in Java, View Controller error in iOS. We have Tags for our dataset, Can we use those tags to optimise our queries? But How? Answer is YES, we can train a model which can predict the Tag for a given query and adding that tag in the query.

8. Technique Used: • We first Simplify the our tags by using only the first tag in each row in our dataset. • Then we have to first change our Tags into numeric form • Then we perform TF-IDF vectorization, and train our model on LR and SVM. We observe that our LR Model performed better with hyper parameter tuning.

9. Models Used • So we trained our model again with all the data with Logistic Regression, SVM, Naïve Bayes • Then we add that predicted Tag into our query by using this function.

10. Precision and Recall:

11. Models Result: Model Precision Recall Logistic Regression 0.52 0.40 SVM 0.78 0.65 Naïve Bayes 0.79 0.71

12. Output: • We first optimizes the query then perform TF-IDF on query as it is important for query to have same shape as of our dataset. Then we get indices and we publish them. Let’s again try it for query = “Synchronization”.

13. Search Engine UI:

14. Future Aspect: • We can use w2v, TF-IDF weighted w2v or other techniques to vectorize, As I am limited by my computing powers so couldn’t do so. • Using full dataset. • Making a better UI in Search Engine

15. Thank You

Search Engine using Natural Language Processing

Recommandé

Recommandé

Contenu connexe

Similaire à Search Engine using Natural Language Processing

Similaire à Search Engine using Natural Language Processing (20)

Dernier

Dernier (20)

Search Engine using Natural Language Processing