SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Language Search
ElasticSearch Boston Meetup - 3/27
       Bryan Warner - Traackr
About me
● Bryan Warner - Developer @Traackr
  ○ bwarner@traackr.com

● I've worked with ElasticSearch since early 2012 ...
  before that I had worked with Lucene & Solr

● Primary background is in Java back-end development

● Shifting focus into Scala development past year
About Traackr
● Influencer search engine

● We track content daily & in real-time for our database of
  influential people

● We leverage ElasticSearch parent/child (top-children)
  queries to search content (i.e. the children) to surface
  the influencers who've authored it (i.e. the parents)

● Some of our back-end stack includes: ElasticSearch,
  MongoDb, Java/Spring, Scala/Akka, etc.
Overview
● Indexing / Querying strategies to support language-
  targeted searches within ES

● ES Analyzers / TokenFilters for language analysis

● Custom Analyzers / TokenFilters for ES

● Look at some OS projects that assist in language
  detection & analysis
Use Case
● We have a database of articles written in many
  languages

● We want our users to be able to search articles written
  in a particular language

● We want that search to handle the nuances for that
  particular language
Reference Schema
{
    "settings" : {
      "index": {
        "number_of_shards" : 6, "number_of_replicas" : 1
      },
      "analysis":{
        "analyzer": {}, "tokenizer": {}, "filter":{}
      }
    },
    "mappings": {
      "article": {
        "text" : {"type" : "string", "analyzer":"standard", "store":true},
        "author:" {"type" : "string", "analyzer":"simple", "store": true},
        "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
      }
    }
}
Indexing Strategies



      Separate indices per language
                  - OR -
       Same index for all languages
Indexing Strategies
Separate Indices per language
PROS
■ Clean separation
■ Truer IDF values
  ○ IDF = log(numDocs/(docFreq+1)) + 1

CONS
■ Increased Overhead
■ Parent/Child queries -> parent document duplication
   ○ Same problem for Solr Joins
■ Maintain schema per index
Indexing Strategies
Same index for all languages
PROS
■ One index to maintain (and one schema)
■ Parent/Child queries are fine

CONS
■ Schema complexity grows
■ IDF values might be skewed
Indexing Strategies
Same index for all languages ... how?
1. Create different "mapping" types per language
   a. At indexing time, we set the right mapping based on
      the article's language

2. Create different fields per language-analyzed field
   a. At indexing time, we populate the correct text field
      based on the article's language
"mappings": {
  "article_en": {
    "text" : {"type" : "string", "analyzer":"english", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_fr": {
    "text" : {"type" : "string", "analyzer":"french", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_de": {
    "text" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
"mappings": {
  "article": {
    "text_en" : {"type" : "string", "analyzer":"english", "store":true},
    "text_fr" : {"type" : "string", "analyzer":"french", "store":true},
    "text_de" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
Querying Strategies
How do we execute a language-targeted search?

... all based on our indexing strategy.
Querying Strategies
(1) Separate Indices per language
...
String targetIndex = getIndexForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch(targetIndex)
       .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2a) Same index for language - Diff. mappings
...
String targetMapping = getMappingForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch("your_index")
       .setTypes(targetMapping);

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2b) Same index for language - Diff. fields
...
SearchRequestBuilder request = client.prepareSearch("your_index")
     .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field(text_en|text_fr|text_de); // pick one
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
● Will these strategies support a multi-language search?
  ○ E.g. Search by french and german
  ○ E.g. Search against all languages

● Yes! *

● In the same SearchRequest:
   ○ We can search against multiple indices
   ○ We can search against multiple "mapping" types
   ○ We can search against multiple fields

* Need to give thought which query analyzer to use
Language Analysis
● What does ElasticSearch and/or Lucene offer us for
  analyzing various languages?

● Is there a one-size-fits-all solution?
   ○ e.g. StandardAnalyzer

● Or do we need custom analyzers for each language?
Language Analysis
StandardAnalyzer - The Good
● For many languages (french, spanish), it will get you
  95% of the way there

● Each language analyzer provides its own flavor to the
  StandardAnalyzer

● FrenchAnalyzer
  ○ Adds an ElisionFilter (l'avion -> avion)
  ○ Adds French StopWords filter
  ○ FrenchLightStemFilter
Language Analysis
StandardAnalyzer - The Bad
● For some languages, it will get you 2/3 of the way there

● German has a heavy use of compound words
     ■ das Vaterland => The fatherland
     ■ Rechtsanwaltskanzleien => Law Firms

● For best search results, these compound words should
  produce index terms for their individual parts

● GermanAnalyzer lacks a Word Compound Token Filter
Language Analysis
StandardAnalyzer - The Ugly
● For other languages (e.g. Asian languages), it will not
  get you far

● Using a Standard Tokenizer to extract tokens from
  Chinese text will not produce accurate terms
  ○ Some 3rd-party Chinese analyzers will extract
     bigrams from Chinese text and index those as if they
     were words

● Need to do your research
Language Analysis
You should also know about...
● ASCII Folding Token Filter
  ○ über => uber

● ICU Analysis Plugin
   ○ http://www.elasticsearch.org/guide/reference/index-
     modules/analysis/icu-plugin.html
   ○ Allows for unicode normalization, collation and
     folding
Custom Analyzer / Token Filter
● Let's create a custom analyzer definition for German
  text (e.g. remove stemming)

● How do we go about doing this?
   ○ One way is to leverage ElasticSearch's flexible
     schema definitions
Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
Custom Analyzer / Token Filter
Create a custom German analyzer in our schema:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type": "custom",
           "tokenizer": "standard",
           "filter": ["standard", "lowercase"], stop words, german normalization?
       }
    }
    ....
  }
}
Custom Analyzer / Token Filter
1.   Declare schema filter for german stop_words
2.   We'll also need to create a custom TokenFilter class to wrap Lucene's org.
     apache.lucene.analysis.de.GermanNormalizationFilter
     a.   It does not come as a pre-defined ES TokenFilter
     b.   German text needs to normalize on certain characters based .. e.g.
          'ae' and 'oe' are replaced by 'a', and 'o', respectively.

3.   Declare schema filter for custom GermanNormalizationFilter
package org.elasticsearch.index.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;

public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory {
  @Inject
  public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings,
           @Assisted String name, @Assisted Settings settings) {
     super(index, indexSettings, name, settings);
  }
  @Override
  public TokenStream create(TokenStream tokenStream) {
     return new GermanNormalizationFilter(tokenStream);
  }
}
Custom Analyzer / Token Filter
Define new token filters in our schema:
"settings" : {
  "analysis":{
     ....
     "filter":{
       "german_normalization":{
          "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory"
       },
       "german_stop":{
          "type":"stop",
          "stopwords":["_german_"],
          "enable_position_increments":"true"
       }
     }
....
Custom Analyzer / Token Filter
Create a custom German analyzer:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type":"custom",
           "tokenizer": "standard",
           "filter": ["german_normalization", "standard", "lowercase", "german_stop"],
       }
    }
    ....
  }
}
OS Projects
Language Detection
●   https://code.google.com/p/language-detection/
     ○ Written in Java
     ○ Provides language profiles with unigram, bigram, and trigram
         character frequencies
     ○ Detector provides accuracy % for each language detected

PROS
 ■ Very fast (~4k pieces of text per second)
 ■ Very reliable for text greater than 30-40 characters

CONS
 ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e.
   short tweets
OS Projects
German Word Decompounder
●   https://github.com/jprante/elasticsearch-analysis-decompound

●   Lucene offers two compound word token filters, a dictionary- &
    hyphenation-based variant
     ○ Not bundled with Lucene due to licensing issues
     ○ Require loading a word list in memory before they are run

●   The decompounder uses prebuilt Compact Patricia Tries for efficient word
    segmentation provided by the ASV toolbox
     ○ ASV Toolbox project - http://wortschatz.uni-leipzig.
        de/~cbiemann/software/toolbox/index.htm

Contenu connexe

Similaire à Language Search

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfInexture Solutions
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and AnalysisOpenThink Labs
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingChase Tingley
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisSebastiano Panichella
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-stepsMatteo Moci
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in GoTakuya Ueda
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)Woonsan Ko
 
Ts archiving
Ts   archivingTs   archiving
Ts archivingConfiz
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 

Similaire à Language Search (20)

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're going
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code Analysis
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in Go
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
Ts archiving
Ts   archivingTs   archiving
Ts archiving
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 

Dernier

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Language Search

  • 1. Language Search ElasticSearch Boston Meetup - 3/27 Bryan Warner - Traackr
  • 2. About me ● Bryan Warner - Developer @Traackr ○ bwarner@traackr.com ● I've worked with ElasticSearch since early 2012 ... before that I had worked with Lucene & Solr ● Primary background is in Java back-end development ● Shifting focus into Scala development past year
  • 3. About Traackr ● Influencer search engine ● We track content daily & in real-time for our database of influential people ● We leverage ElasticSearch parent/child (top-children) queries to search content (i.e. the children) to surface the influencers who've authored it (i.e. the parents) ● Some of our back-end stack includes: ElasticSearch, MongoDb, Java/Spring, Scala/Akka, etc.
  • 4. Overview ● Indexing / Querying strategies to support language- targeted searches within ES ● ES Analyzers / TokenFilters for language analysis ● Custom Analyzers / TokenFilters for ES ● Look at some OS projects that assist in language detection & analysis
  • 5. Use Case ● We have a database of articles written in many languages ● We want our users to be able to search articles written in a particular language ● We want that search to handle the nuances for that particular language
  • 6. Reference Schema { "settings" : { "index": { "number_of_shards" : 6, "number_of_replicas" : 1 }, "analysis":{ "analyzer": {}, "tokenizer": {}, "filter":{} } }, "mappings": { "article": { "text" : {"type" : "string", "analyzer":"standard", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true}, "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } } }
  • 7. Indexing Strategies Separate indices per language - OR - Same index for all languages
  • 8. Indexing Strategies Separate Indices per language PROS ■ Clean separation ■ Truer IDF values ○ IDF = log(numDocs/(docFreq+1)) + 1 CONS ■ Increased Overhead ■ Parent/Child queries -> parent document duplication ○ Same problem for Solr Joins ■ Maintain schema per index
  • 9. Indexing Strategies Same index for all languages PROS ■ One index to maintain (and one schema) ■ Parent/Child queries are fine CONS ■ Schema complexity grows ■ IDF values might be skewed
  • 10. Indexing Strategies Same index for all languages ... how? 1. Create different "mapping" types per language a. At indexing time, we set the right mapping based on the article's language 2. Create different fields per language-analyzed field a. At indexing time, we populate the correct text field based on the article's language
  • 11. "mappings": { "article_en": { "text" : {"type" : "string", "analyzer":"english", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_fr": { "text" : {"type" : "string", "analyzer":"french", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_de": { "text" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 12. "mappings": { "article": { "text_en" : {"type" : "string", "analyzer":"english", "store":true}, "text_fr" : {"type" : "string", "analyzer":"french", "store":true}, "text_de" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 13. Querying Strategies How do we execute a language-targeted search? ... all based on our indexing strategy.
  • 14. Querying Strategies (1) Separate Indices per language ... String targetIndex = getIndexForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch(targetIndex) .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 15. Querying Strategies (2a) Same index for language - Diff. mappings ... String targetMapping = getMappingForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes(targetMapping); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 16. Querying Strategies (2b) Same index for language - Diff. fields ... SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field(text_en|text_fr|text_de); // pick one query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 17. Querying Strategies ● Will these strategies support a multi-language search? ○ E.g. Search by french and german ○ E.g. Search against all languages ● Yes! * ● In the same SearchRequest: ○ We can search against multiple indices ○ We can search against multiple "mapping" types ○ We can search against multiple fields * Need to give thought which query analyzer to use
  • 18. Language Analysis ● What does ElasticSearch and/or Lucene offer us for analyzing various languages? ● Is there a one-size-fits-all solution? ○ e.g. StandardAnalyzer ● Or do we need custom analyzers for each language?
  • 19. Language Analysis StandardAnalyzer - The Good ● For many languages (french, spanish), it will get you 95% of the way there ● Each language analyzer provides its own flavor to the StandardAnalyzer ● FrenchAnalyzer ○ Adds an ElisionFilter (l'avion -> avion) ○ Adds French StopWords filter ○ FrenchLightStemFilter
  • 20. Language Analysis StandardAnalyzer - The Bad ● For some languages, it will get you 2/3 of the way there ● German has a heavy use of compound words ■ das Vaterland => The fatherland ■ Rechtsanwaltskanzleien => Law Firms ● For best search results, these compound words should produce index terms for their individual parts ● GermanAnalyzer lacks a Word Compound Token Filter
  • 21. Language Analysis StandardAnalyzer - The Ugly ● For other languages (e.g. Asian languages), it will not get you far ● Using a Standard Tokenizer to extract tokens from Chinese text will not produce accurate terms ○ Some 3rd-party Chinese analyzers will extract bigrams from Chinese text and index those as if they were words ● Need to do your research
  • 22. Language Analysis You should also know about... ● ASCII Folding Token Filter ○ über => uber ● ICU Analysis Plugin ○ http://www.elasticsearch.org/guide/reference/index- modules/analysis/icu-plugin.html ○ Allows for unicode normalization, collation and folding
  • 23. Custom Analyzer / Token Filter ● Let's create a custom analyzer definition for German text (e.g. remove stemming) ● How do we go about doing this? ○ One way is to leverage ElasticSearch's flexible schema definitions
  • 24. Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
  • 25. Custom Analyzer / Token Filter Create a custom German analyzer in our schema: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type": "custom", "tokenizer": "standard", "filter": ["standard", "lowercase"], stop words, german normalization? } } .... } }
  • 26. Custom Analyzer / Token Filter 1. Declare schema filter for german stop_words 2. We'll also need to create a custom TokenFilter class to wrap Lucene's org. apache.lucene.analysis.de.GermanNormalizationFilter a. It does not come as a pre-defined ES TokenFilter b. German text needs to normalize on certain characters based .. e.g. 'ae' and 'oe' are replaced by 'a', and 'o', respectively. 3. Declare schema filter for custom GermanNormalizationFilter
  • 27. package org.elasticsearch.index.analysis; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.de.GermanNormalizationFilter; import org.elasticsearch.common.inject.Inject; import org.elasticsearch.common.inject.assistedinject.Assisted; import org.elasticsearch.common.settings.Settings; import org.elasticsearch.index.Index; import org.elasticsearch.index.settings.IndexSettings; public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory { @Inject public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); } @Override public TokenStream create(TokenStream tokenStream) { return new GermanNormalizationFilter(tokenStream); } }
  • 28. Custom Analyzer / Token Filter Define new token filters in our schema: "settings" : { "analysis":{ .... "filter":{ "german_normalization":{ "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory" }, "german_stop":{ "type":"stop", "stopwords":["_german_"], "enable_position_increments":"true" } } ....
  • 29. Custom Analyzer / Token Filter Create a custom German analyzer: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type":"custom", "tokenizer": "standard", "filter": ["german_normalization", "standard", "lowercase", "german_stop"], } } .... } }
  • 30. OS Projects Language Detection ● https://code.google.com/p/language-detection/ ○ Written in Java ○ Provides language profiles with unigram, bigram, and trigram character frequencies ○ Detector provides accuracy % for each language detected PROS ■ Very fast (~4k pieces of text per second) ■ Very reliable for text greater than 30-40 characters CONS ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e. short tweets
  • 31. OS Projects German Word Decompounder ● https://github.com/jprante/elasticsearch-analysis-decompound ● Lucene offers two compound word token filters, a dictionary- & hyphenation-based variant ○ Not bundled with Lucene due to licensing issues ○ Require loading a word list in memory before they are run ● The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox ○ ASV Toolbox project - http://wortschatz.uni-leipzig. de/~cbiemann/software/toolbox/index.htm