SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Elasticsearch and
Spark
ANIMESH PANDEY
PROJECT CONSILIENCE
Agenda
 Who am I?
 Text searching
 Full text based
 Term based
 Databases vs. Search engines
 Why not simple SQL?
 Why need Lucene?
 Elasticsearch
 Concepts/APIs
 Network/Discovery
 Split-brain Issue
 Solutions
 Data Structure
 Inverted Index
 SOLR – Dataverse’s Search
 Why not SOLR for Consilience?
 Elasticsearch – Consilience’s Search
 Language integration
 Python
 Java
 Scala
 SPARK
 Why Spark?
 Where Spark?
 When Spark?
 Language support
 Conclusion and Questions
Who am I?
 Animesh Pandey
 Computer Science grad student @
Northeastern University, Boston
 Intern for Project Consilience for
Summer 2015
 Job: integration of Elasticsearch and
Spark into the existing project
Text Searching
 Text – a from of data
 Text – available from various resources
 Internet, books, articles etc.
 We are concerned with digital text or converting the traditional text to digital
 Digital text – internet, news articles, blogs, research papers
 Traditional text – any text from a physical book, manuscript, typed papers,
newspapers etc.
 Traditional text conversion to digital text
 Automatic - Optical Character Recognizers (OCR) e.g. Tesseract by Google Inc.
 Manual - type to a system
Full text based vs. Term based
 Full text based search
 Most general kind of search
 Used everyday when using
Google, Bing or Yahoo
 In the background it is much more
than a simple character by
character match
 Lot of pre-processing involved for
a Full text search
 Term based search
 Generally comprises of exact term
matching
 You can think of it as a SQL query
where try to find documents that
contain the exact match of a
specified word
Databases vs. Search Engines
The both have unique strengths but also have overlapping capabilities
 Similarities:
 Both can be stored as data stores
 Basic updates and modifications can be done using both
 Differences:
 Search Engines
 Used for both structured as well
as unstructured data
 The results are ordered as per
the relevance of the result to
the query
 Databases
 Used for structured data
 There is relevance
matching between the
query and results
Why not simple SQL?
 MySQL provides us some ways to perform a full text search along with term
based searches BUT …..
 Needs MyISAM storage engine. It was the default storage engine of MySQL.
 MyISAM is optimized for read operations with few write operations or may be
none.
 But you cannot avoid write (update/modify) operations.
 MyISAM creates one index for one table.
 No. of tables = No. of index => more tables more complexity.
 Relational DBs have locks. They won’t read/write operations if already one
operation is being executed.
How does a search engine help?
 Efficient indexing of data
 You don’t need multiple indices like you needed in Databases
 Index is on all fields/combinations of fields
 Analyzing data
 Text search
 Tokenzing => splitting of text
 Stemming => converting words to their root forms
 Filtering => removal of certain words
 Relevance Scoring
In order to solve the problems mentioned before there are several
Open Source search engines….
 Information Retrieval Software Library
 Free/Open Source
 Supported by Apache Foundation
 Created by Doug Cutting
 Since 1999
In order to use it there are two Java libraries available…..
APACHE LUCENE
 Built on Lucene
 Perfect for single server search
 Part of the Lucene project (Lucene comes with Solr)
 Large user and developer base
 This is Dataverse’s Search engine. Later will talk why using
Elasticsearch here won’t make a big difference
APACHE SOLR
{
"status" : 200,
"name" : "Fafnir",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.4.2",
"build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c",
"build_timestamp" : "2014-12-16T14:11:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}
ELASTICSEARCH
 Free/Open source
 Built on top of Lucene
 Created by Shay Banon @kimchy
 Current stable version is 1.6.0
 Has wrappers in many languages
 RESTful Service
 JSON API over HTTP
 Chrome Plugins – Marvel Sense and POSTman
 Can be used from Java, Python and many other languages
 High availability and clustering is very easy to set up
 Long term persistence
What does Elasticsearch add to Lucene?
Elasticsearch is a “download and use” distro
Executables
Log files
Node Configs
Data Storage
├── bin
│ ├── elasticsearch
│ ├── elasticsearch.in.sh
│ └── plugin
├── config
│ ├── elasticsearch.yml
│ └── logging.yml
├── data
│ └── cluster1
├── lib
│ ├── elasticsearch-x.y.z.jar
│ ├── ...
│ └──
└── logs
├── elasticsearch.log
└── elasticsearch_index_search_slowlog.log
└── elasticsearch_index_indexing_slowlog.log
Jar
Distributions
 Here we can initialize the basic configuration
required to start an ES node. Following are the
config types that are generally changed.
 cluster.name – the cluster to which it’ll join
 node.name – specify name of the node
 node.master – whether the node is a master
 node.data – whether this node will hold data
 path.data – path of the index
 path.conf – path of the config folder (scripts or
any file put in this folder)
 path.logs – path of the logs
elasticsearch.yml – Config file of Elasticsearch
curl -XPUT "http://localhost:9200/social_media/" -d'
{
"settings": {
"node": {
"master": true
},
"path": {
"conf": "D:/social_media/config/"
},
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}'
Underlying Lucene Inverted Index
 This is term to document mapping
 Inverted index contains terms mapped to
all documents in which it occurred
 Every document is paired with the term
frequency of the term being considered
 Sum all term frequencies to get corpus
frequency of the term
Shards and Replicas
 Primary Shard
 Created when indexing
 Index has 1..N primary shards
 Persistent
 This is the actual data
 Replica Shard
 Index has 0..N primary replicas
 Not persistent
 The is copy of the data
 Promoted to Primary shard if the node fails
Nodes discovery
 Nodes discovery in ES is using multicast
 Unicast is also possible
 Can be modified by changing elasticsearch.yml
 In multicast the master node will send requests to all nodes to check
which are waiting for connection
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [“host1", "host2:port", "host3"]
Split-brain Issue
 Suppose we have three node cluster which has 1 master and 2 slaves
 Suppose due to some reason connection to NODE 2 fails
 NODE 2 will promote its replica shards to primary shards and will convert itself to a
Master
 Cluster will be in an inconsistent state
 Indexing request to NODE 2 won’t be reflected to NODE 1 – NODE 3
 This will result in two different indices => different results
Solving the Split-brain issue
 Specify the number of masters in a cluster
 discovery.zen.minimum_master_nodes = (N/2 + 1), where N is the number of nodes in a
cluster
 In the three node cluster, the cluster with one node will fail and the production will come to
know about such issue
 discovery.zen.ping.timeout should be increased in a slow network so that nodes get
extra time to ping to each other
 Default value is 3 seconds
Elasticsearch APIs
 There are certain number of APIs provided by elasticsearch. We will
be covering the ones useful to us:
 INDEX API
 SETTING API
 MAPPING API
 TERMVECTOR/MTERMVECTOR API
 BULK API
 SEARCH API
Processing of Text using Analyzers (Settings API)
 Analyzers help in manipulating the
text that is to be indexed.
 Tokenizers, stemmers, token-filters are
the most used Analyzers.
 Analyzers are usually given a name/id
so that they can be used in future with
any type of text.
 There are other analyzers as well that
are based on term-replacement,
regular-expression pattern,
punctuation characters.
 Custom analyzers can also be
created in ES.
curl -XPUT
"http://localhost:9200/social_media/tweet/_settings" -d'
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"my_english": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload",
"cust_stop"
]
}
},
"filter": {
"cust_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt",
}
}
}
}
}’
Mapping of Documents to be indexed (Mappings API)
curl -XPUT
"http://localhost:9200/social_media/tweet/_mapping" -d
'{
"tweet": {
"properties": {
"_id": {
"type": "string",
"store": True,
"index": "not_analyzed"
},
"text": {
"type": "multi_field",
"fields": {
"text": {
"include_in_all": False,
"type": "string",
"store": False,
"index": "not_analyzed"
},
"_analyzed": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector":
"with_positions_offsets_payloads",
"analyzer": “my_english”
}
}
}
}}}
 Elasticsearch auto-maps fields but we
can also specify the types.
 Data types provided by ES:
 String
 Number
 Boolean
 Date-time
 Geo-point (coordinates)
 Attachment (requires plugin)
 Consilience uses this for indexing PDF
files
Creation of Index
 Specifying setting and mapping and sending a PUT request to
Elasticsearch initializes the index
 Now the task is to send documents to Elasticsearch
 We have to keep in mind the mappings of each field in the document
 Document Metadata fields
 _id : identifier of the document
 _index : index name
 _type : mapping type
 _source : enabled/disabled
 _timestamp
 _ttl
 _size : size of uncompressed _source
 _version
Indexing a document (Index API)
curl -XPOST
"http://localhost:9200/social_media/tweet/616272192
012165183" -d '{
"_source": {
"text": "random text",
"exact_text": "random text"
}
}‘
For ES 1.6.0+
curl -XPOST
"http://localhost:9200/social_media/tweet/616272192
012165183" -d '{
"text": "random text",
"exact_text": "random text"
}'
{
'_index': 'social_media',
'_type': 'tweet',
'_id': ‘616272192012165120',
'_source': {
'text': '@bshor Thanks for the info; this will
help us. Are these the 2 datasets you were
uploading? https://t.co/W1M4vrQUEI
https://t.co/ITRycQnPKz',
'exact_text': '@bshor Thanks for the info; this
will help us. Are these the 2 datasets you were
uploading? https://t.co/W1M4vrQUEI
https://t.co/ITRycQnPKz'
}
}
Document structure Indexing new document
Retrieving term vectors (Termvector API)
 termvector or mtermvector APIs are used for
getting the term-vectors
 We can change the above DSL according to
our needs
curl -XGET
"http://localhost:9200/social_media/tweet/616272192012165183/_termve
ctor" -d'
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}'
{
"_index": "social_media",
"_type": "tweet",
"_id": "616272192012165183",
"_version": 1,
"found": true,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 65,
"doc_count": 6,
"sum_ttf": 66
},
"terms": {
"random": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 6,
"payload": "d29yZA=="
}
]
},
"text": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 7,
"end_offset": 11,
"payload": "d29yZA=="
}
]
}
}
}
}
}
Processing independent documents
 This can be done by using Analyze API
 The analyzer my_english was defined in Slide 16
 The above DSL results in where document was
“Text to analyze”
curl -XGET "http://localhost:9200/social_media/_analyze?analyzer=my_english&text=Text to analyze"
{
"tokens": [
{
"token": "text",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "analyze",
"start_offset": 8,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}
Working with Shingles
 Shingles are a way to index group of
tokens like unigrams, bigrams etc.
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2, // for bigrams
"max_shingle_size" : 2,
"output_unigrams": True
}
curl -XGET
"http://localhost:9200/social_media/_anal
yze?analyzer=my_english_shingle&text=Text
to analyze"
{
"tokens": [
{
"token": "text",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "text _",
"start_offset": 0,
"end_offset": 8,
"type": "shingle",
"position": 1
},
{
"token": "_ analyze",
"start_offset": 8,
"end_offset": 15,
"type": "shingle",
"position": 2
},
{
"token": "analyze",
"start_offset": 8,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}
 This filter can be used in
termvector API to get
vectors containing both
unigram and bigrams
Searching in Index (Search API)
 Default search
 Exact phrase matching
curl -XGET "http://localhost:9200/social_media/tweet/_search" -d'
{
"query": {
"match": {
"text._analyzed": “some Texts“ // will search for “some text”, “some” and “text”
}
},
"explain": true
}‘
curl -XGET "http://localhost:9200/social_media/tweet/_search" -d'
{
"query": {
"match_phrase": {
"text": “some Texts“ // will search for “some Texts” as a phrase
}
},
"explain": true
}‘
Recommended Design Patterns
 Keep the number of nodes odd
 Take pre-cautions to avoid Split-brain issue
 Regularly refresh indices
 Add refresh_interval to settings
 Manage heap size
 ES_HEAP_SIZE <= ½ of the system’s RAM but not more than 32GB
 export ES_HEAP_SIZE=10g
 ./bin/elasticsearch -Xmx10g -Xms10g
 Use Aliases
 Searches are made using an index created from the original index
 This prevents cluster down time or delays that may occur during the updation/modification of the index
 Delete aliases when they become old and create new one
 You can create time-based aliases as well
 Use Routing
 A way to know which shard contains what document
 Reduces the lookup time during searches
 When bulk indexing
 Timeout after every push
 Push should be of maximum size 2-3MB
Why not SOLR?
 SOLR is a better search engine than Elasticsearch
 But we require Term_vectors and analysis more than a search
 ES provides better APIs for analytics
 termvector with field and term statistics
 mtermvector
 search with explain enabled
 function_scoring (Didn’t mention before)
 If you need only a search engine, go for SOLR. If you need something more
than that Elasticsearch is the best choice.
Language Support
 We have
 JAVA wrappers : org.elasticsearch.*
 Python wrapper: py-elasticsearch
 Scala wrapper : elastic4s
 Domain Specific Language (DSL) : cURL/JSON as shown in every
example previously
Lets add some SPARK to ES…
 Apache Spark is an engine for large scale data processing
 It runs programs nearly 100 times faster than Hadoop
 Has language support for Python, Java, Scala and R
 For Project Consilience:
 Earlier I had thought of keeping the starting and end point of the whole
application to be Spark
 i.e. read files using spark, index them using Elasticsearch and apply clustering
using Spark’s MLlib
 Flat file reading is very direct in Spark
 spark.textfile() => parallel reading of the file in chunks
 spark.wholetextfile() => loads complete file into memory
Lets add some SPARK to ES…
 Earlier experiments were done in
Scala
 Scala gave us the advantage
of Functional programming
along with the Parallel
processing
 Now Java 8 also provides with
Functional programming so
Scala and Java won’t make
much difference
import org.elasticsearch.spark._ //ES-Spark connector
val conf = new SparkConf()
.setAppName(“super_spark")
.setMaster("local[2]")
.set("spark.executor.memory", "1g")
.set("spark.rdd.compress", "true")
.set("spark.storage.memoryFraction", "1")
.set("es.index.auto.create", "true")
.set(“es.node”, 9200)
// other configurations can be added as well
val sc = new SparkContext(conf)
// parallel reading for arrays. Same syntax in Java and Python
val data = sc.parallelize(1 to 10000).collect().filter(_ < 100)
data.foreach(println)
val textFile = sc.textFile("/home/cloudera/Documents/pg2265.txt")
val counts = textFile
.flatMap(line => line.split(" ")) // all tokens in an array
.filter(_ != ' ') // remove all empty tokens
.map(word => (word.replaceAll("p{P}", "") // remove
punctuations
.toLowerCase(), 1)) // convert to lower case
.reduceByKey(_ + _) // add as per key values
val thing = counts.collect()
sc.makeRDD(<put a Mapping here>).saveToEs("spark/docs")
 Tried the Spark-Hadoop-Elasticsearch connector but noticed some
overhead and unnecessary computations
 The project currently won’t accept large volumes of data and that too
frequently. So fast computation isn’t really required
 What we want is features to do clustering. Those features can easily be
provided by Elasticsearch
 May be in future, Spark will be added in the first phase of the project.
 As of now Spark will be used for Clustering of the documents. The
library MLlib provides APIs for this
Lets add some SPARK to ES…
THANKS!
QUESTIONS??
REFERENCES
 Learning Elasticsearch – Anurag Patel (Red Hat)
 Introduction to Elasticsearch – Roy Russo
 Apache Spark and Elasticsearch – Holden Karau UMD 2014
 Streamlining Search Indexing using Elastic Search and Spark (Holden
Karau)
 Video Link : https://www.youtube.com/watch?v=jYicnlunDQ0

Contenu connexe

Tendances

Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basicspmanvi
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistpmanvi
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in actionCodemotion
 

Tendances (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elastic search
Elastic searchElastic search
Elastic search
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basics
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologist
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in action
 

En vedette

Antidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenusAntidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenusAntidot
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search rideDuyhai Doan
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 

En vedette (6)

Antidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenusAntidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenus
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 

Similaire à Elasticsearch and Spark

Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersBen van Mol
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and ElasticsearchDean Hamstead
 
Elasticsearch: An Overview
Elasticsearch: An OverviewElasticsearch: An Overview
Elasticsearch: An OverviewRuby Shrestha
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with railsTom Z Zeng
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearchAnton Udovychenko
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیEhsan Asgarian
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsTiziano Fagni
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseRobert Lujo
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 

Similaire à Elasticsearch and Spark (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and Elasticsearch
 
Elasticsearch: An Overview
Elasticsearch: An OverviewElasticsearch: An Overview
Elasticsearch: An Overview
 
Elastic search
Elastic searchElastic search
Elastic search
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
The Power of Elasticsearch
The Power of ElasticsearchThe Power of Elasticsearch
The Power of Elasticsearch
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analytics
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 

Dernier

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 

Dernier (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 

Elasticsearch and Spark

  • 2. Agenda  Who am I?  Text searching  Full text based  Term based  Databases vs. Search engines  Why not simple SQL?  Why need Lucene?  Elasticsearch  Concepts/APIs  Network/Discovery  Split-brain Issue  Solutions  Data Structure  Inverted Index  SOLR – Dataverse’s Search  Why not SOLR for Consilience?  Elasticsearch – Consilience’s Search  Language integration  Python  Java  Scala  SPARK  Why Spark?  Where Spark?  When Spark?  Language support  Conclusion and Questions
  • 3. Who am I?  Animesh Pandey  Computer Science grad student @ Northeastern University, Boston  Intern for Project Consilience for Summer 2015  Job: integration of Elasticsearch and Spark into the existing project
  • 4. Text Searching  Text – a from of data  Text – available from various resources  Internet, books, articles etc.  We are concerned with digital text or converting the traditional text to digital  Digital text – internet, news articles, blogs, research papers  Traditional text – any text from a physical book, manuscript, typed papers, newspapers etc.  Traditional text conversion to digital text  Automatic - Optical Character Recognizers (OCR) e.g. Tesseract by Google Inc.  Manual - type to a system
  • 5. Full text based vs. Term based  Full text based search  Most general kind of search  Used everyday when using Google, Bing or Yahoo  In the background it is much more than a simple character by character match  Lot of pre-processing involved for a Full text search  Term based search  Generally comprises of exact term matching  You can think of it as a SQL query where try to find documents that contain the exact match of a specified word
  • 6. Databases vs. Search Engines The both have unique strengths but also have overlapping capabilities  Similarities:  Both can be stored as data stores  Basic updates and modifications can be done using both  Differences:  Search Engines  Used for both structured as well as unstructured data  The results are ordered as per the relevance of the result to the query  Databases  Used for structured data  There is relevance matching between the query and results
  • 7. Why not simple SQL?  MySQL provides us some ways to perform a full text search along with term based searches BUT …..  Needs MyISAM storage engine. It was the default storage engine of MySQL.  MyISAM is optimized for read operations with few write operations or may be none.  But you cannot avoid write (update/modify) operations.  MyISAM creates one index for one table.  No. of tables = No. of index => more tables more complexity.  Relational DBs have locks. They won’t read/write operations if already one operation is being executed.
  • 8. How does a search engine help?  Efficient indexing of data  You don’t need multiple indices like you needed in Databases  Index is on all fields/combinations of fields  Analyzing data  Text search  Tokenzing => splitting of text  Stemming => converting words to their root forms  Filtering => removal of certain words  Relevance Scoring
  • 9. In order to solve the problems mentioned before there are several Open Source search engines….
  • 10.  Information Retrieval Software Library  Free/Open Source  Supported by Apache Foundation  Created by Doug Cutting  Since 1999 In order to use it there are two Java libraries available….. APACHE LUCENE
  • 11.  Built on Lucene  Perfect for single server search  Part of the Lucene project (Lucene comes with Solr)  Large user and developer base  This is Dataverse’s Search engine. Later will talk why using Elasticsearch here won’t make a big difference APACHE SOLR
  • 12. { "status" : 200, "name" : "Fafnir", "cluster_name" : "elasticsearch", "version" : { "number" : "1.4.2", "build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c", "build_timestamp" : "2014-12-16T14:11:12Z", "build_snapshot" : false, "lucene_version" : "4.10.2" }, "tagline" : "You Know, for Search" } ELASTICSEARCH  Free/Open source  Built on top of Lucene  Created by Shay Banon @kimchy  Current stable version is 1.6.0  Has wrappers in many languages
  • 13.  RESTful Service  JSON API over HTTP  Chrome Plugins – Marvel Sense and POSTman  Can be used from Java, Python and many other languages  High availability and clustering is very easy to set up  Long term persistence What does Elasticsearch add to Lucene?
  • 14. Elasticsearch is a “download and use” distro Executables Log files Node Configs Data Storage ├── bin │ ├── elasticsearch │ ├── elasticsearch.in.sh │ └── plugin ├── config │ ├── elasticsearch.yml │ └── logging.yml ├── data │ └── cluster1 ├── lib │ ├── elasticsearch-x.y.z.jar │ ├── ... │ └── └── logs ├── elasticsearch.log └── elasticsearch_index_search_slowlog.log └── elasticsearch_index_indexing_slowlog.log Jar Distributions
  • 15.  Here we can initialize the basic configuration required to start an ES node. Following are the config types that are generally changed.  cluster.name – the cluster to which it’ll join  node.name – specify name of the node  node.master – whether the node is a master  node.data – whether this node will hold data  path.data – path of the index  path.conf – path of the config folder (scripts or any file put in this folder)  path.logs – path of the logs elasticsearch.yml – Config file of Elasticsearch curl -XPUT "http://localhost:9200/social_media/" -d' { "settings": { "node": { "master": true }, "path": { "conf": "D:/social_media/config/" }, "index": { "number_of_shards": 3, "number_of_replicas": 1 } } }'
  • 16. Underlying Lucene Inverted Index  This is term to document mapping  Inverted index contains terms mapped to all documents in which it occurred  Every document is paired with the term frequency of the term being considered  Sum all term frequencies to get corpus frequency of the term
  • 17. Shards and Replicas  Primary Shard  Created when indexing  Index has 1..N primary shards  Persistent  This is the actual data  Replica Shard  Index has 0..N primary replicas  Not persistent  The is copy of the data  Promoted to Primary shard if the node fails
  • 18. Nodes discovery  Nodes discovery in ES is using multicast  Unicast is also possible  Can be modified by changing elasticsearch.yml  In multicast the master node will send requests to all nodes to check which are waiting for connection discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: [“host1", "host2:port", "host3"]
  • 19. Split-brain Issue  Suppose we have three node cluster which has 1 master and 2 slaves  Suppose due to some reason connection to NODE 2 fails  NODE 2 will promote its replica shards to primary shards and will convert itself to a Master  Cluster will be in an inconsistent state  Indexing request to NODE 2 won’t be reflected to NODE 1 – NODE 3  This will result in two different indices => different results
  • 20. Solving the Split-brain issue  Specify the number of masters in a cluster  discovery.zen.minimum_master_nodes = (N/2 + 1), where N is the number of nodes in a cluster  In the three node cluster, the cluster with one node will fail and the production will come to know about such issue  discovery.zen.ping.timeout should be increased in a slow network so that nodes get extra time to ping to each other  Default value is 3 seconds
  • 21. Elasticsearch APIs  There are certain number of APIs provided by elasticsearch. We will be covering the ones useful to us:  INDEX API  SETTING API  MAPPING API  TERMVECTOR/MTERMVECTOR API  BULK API  SEARCH API
  • 22. Processing of Text using Analyzers (Settings API)  Analyzers help in manipulating the text that is to be indexed.  Tokenizers, stemmers, token-filters are the most used Analyzers.  Analyzers are usually given a name/id so that they can be used in future with any type of text.  There are other analyzers as well that are based on term-replacement, regular-expression pattern, punctuation characters.  Custom analyzers can also be created in ES. curl -XPUT "http://localhost:9200/social_media/tweet/_settings" -d' { "settings": { "index": { "number_of_shards": 3, "number_of_replicas": 1 }, "analysis": { "analyzer": { "my_english": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload", "cust_stop" ] } }, "filter": { "cust_stop": { "type": "stop", "stopwords_path": "stoplist.txt", } } } } }’
  • 23. Mapping of Documents to be indexed (Mappings API) curl -XPUT "http://localhost:9200/social_media/tweet/_mapping" -d '{ "tweet": { "properties": { "_id": { "type": "string", "store": True, "index": "not_analyzed" }, "text": { "type": "multi_field", "fields": { "text": { "include_in_all": False, "type": "string", "store": False, "index": "not_analyzed" }, "_analyzed": { "type": "string", "store": True, "index": "analyzed", "term_vector": "with_positions_offsets_payloads", "analyzer": “my_english” } } } }}}  Elasticsearch auto-maps fields but we can also specify the types.  Data types provided by ES:  String  Number  Boolean  Date-time  Geo-point (coordinates)  Attachment (requires plugin)  Consilience uses this for indexing PDF files
  • 24. Creation of Index  Specifying setting and mapping and sending a PUT request to Elasticsearch initializes the index  Now the task is to send documents to Elasticsearch  We have to keep in mind the mappings of each field in the document  Document Metadata fields  _id : identifier of the document  _index : index name  _type : mapping type  _source : enabled/disabled  _timestamp  _ttl  _size : size of uncompressed _source  _version
  • 25. Indexing a document (Index API) curl -XPOST "http://localhost:9200/social_media/tweet/616272192 012165183" -d '{ "_source": { "text": "random text", "exact_text": "random text" } }‘ For ES 1.6.0+ curl -XPOST "http://localhost:9200/social_media/tweet/616272192 012165183" -d '{ "text": "random text", "exact_text": "random text" }' { '_index': 'social_media', '_type': 'tweet', '_id': ‘616272192012165120', '_source': { 'text': '@bshor Thanks for the info; this will help us. Are these the 2 datasets you were uploading? https://t.co/W1M4vrQUEI https://t.co/ITRycQnPKz', 'exact_text': '@bshor Thanks for the info; this will help us. Are these the 2 datasets you were uploading? https://t.co/W1M4vrQUEI https://t.co/ITRycQnPKz' } } Document structure Indexing new document
  • 26. Retrieving term vectors (Termvector API)  termvector or mtermvector APIs are used for getting the term-vectors  We can change the above DSL according to our needs curl -XGET "http://localhost:9200/social_media/tweet/616272192012165183/_termve ctor" -d' { "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }' { "_index": "social_media", "_type": "tweet", "_id": "616272192012165183", "_version": 1, "found": true, "term_vectors": { "text": { "field_statistics": { "sum_doc_freq": 65, "doc_count": 6, "sum_ttf": 66 }, "terms": { "random": { "doc_freq": 1, "ttf": 1, "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 6, "payload": "d29yZA==" } ] }, "text": { "doc_freq": 1, "ttf": 1, "term_freq": 1, "tokens": [ { "position": 1, "start_offset": 7, "end_offset": 11, "payload": "d29yZA==" } ] } } } } }
  • 27. Processing independent documents  This can be done by using Analyze API  The analyzer my_english was defined in Slide 16  The above DSL results in where document was “Text to analyze” curl -XGET "http://localhost:9200/social_media/_analyze?analyzer=my_english&text=Text to analyze" { "tokens": [ { "token": "text", "start_offset": 0, "end_offset": 4, "type": "word", "position": 1 }, { "token": "analyze", "start_offset": 8, "end_offset": 15, "type": "word", "position": 3 } ] }
  • 28. Working with Shingles  Shingles are a way to index group of tokens like unigrams, bigrams etc. "shingle_filter" : { "type" : "shingle", "min_shingle_size" : 2, // for bigrams "max_shingle_size" : 2, "output_unigrams": True } curl -XGET "http://localhost:9200/social_media/_anal yze?analyzer=my_english_shingle&text=Text to analyze" { "tokens": [ { "token": "text", "start_offset": 0, "end_offset": 4, "type": "word", "position": 1 }, { "token": "text _", "start_offset": 0, "end_offset": 8, "type": "shingle", "position": 1 }, { "token": "_ analyze", "start_offset": 8, "end_offset": 15, "type": "shingle", "position": 2 }, { "token": "analyze", "start_offset": 8, "end_offset": 15, "type": "word", "position": 3 } ] }  This filter can be used in termvector API to get vectors containing both unigram and bigrams
  • 29. Searching in Index (Search API)  Default search  Exact phrase matching curl -XGET "http://localhost:9200/social_media/tweet/_search" -d' { "query": { "match": { "text._analyzed": “some Texts“ // will search for “some text”, “some” and “text” } }, "explain": true }‘ curl -XGET "http://localhost:9200/social_media/tweet/_search" -d' { "query": { "match_phrase": { "text": “some Texts“ // will search for “some Texts” as a phrase } }, "explain": true }‘
  • 30. Recommended Design Patterns  Keep the number of nodes odd  Take pre-cautions to avoid Split-brain issue  Regularly refresh indices  Add refresh_interval to settings  Manage heap size  ES_HEAP_SIZE <= ½ of the system’s RAM but not more than 32GB  export ES_HEAP_SIZE=10g  ./bin/elasticsearch -Xmx10g -Xms10g  Use Aliases  Searches are made using an index created from the original index  This prevents cluster down time or delays that may occur during the updation/modification of the index  Delete aliases when they become old and create new one  You can create time-based aliases as well  Use Routing  A way to know which shard contains what document  Reduces the lookup time during searches  When bulk indexing  Timeout after every push  Push should be of maximum size 2-3MB
  • 31. Why not SOLR?  SOLR is a better search engine than Elasticsearch  But we require Term_vectors and analysis more than a search  ES provides better APIs for analytics  termvector with field and term statistics  mtermvector  search with explain enabled  function_scoring (Didn’t mention before)  If you need only a search engine, go for SOLR. If you need something more than that Elasticsearch is the best choice.
  • 32. Language Support  We have  JAVA wrappers : org.elasticsearch.*  Python wrapper: py-elasticsearch  Scala wrapper : elastic4s  Domain Specific Language (DSL) : cURL/JSON as shown in every example previously
  • 33. Lets add some SPARK to ES…  Apache Spark is an engine for large scale data processing  It runs programs nearly 100 times faster than Hadoop  Has language support for Python, Java, Scala and R  For Project Consilience:  Earlier I had thought of keeping the starting and end point of the whole application to be Spark  i.e. read files using spark, index them using Elasticsearch and apply clustering using Spark’s MLlib  Flat file reading is very direct in Spark  spark.textfile() => parallel reading of the file in chunks  spark.wholetextfile() => loads complete file into memory
  • 34. Lets add some SPARK to ES…  Earlier experiments were done in Scala  Scala gave us the advantage of Functional programming along with the Parallel processing  Now Java 8 also provides with Functional programming so Scala and Java won’t make much difference import org.elasticsearch.spark._ //ES-Spark connector val conf = new SparkConf() .setAppName(“super_spark") .setMaster("local[2]") .set("spark.executor.memory", "1g") .set("spark.rdd.compress", "true") .set("spark.storage.memoryFraction", "1") .set("es.index.auto.create", "true") .set(“es.node”, 9200) // other configurations can be added as well val sc = new SparkContext(conf) // parallel reading for arrays. Same syntax in Java and Python val data = sc.parallelize(1 to 10000).collect().filter(_ < 100) data.foreach(println) val textFile = sc.textFile("/home/cloudera/Documents/pg2265.txt") val counts = textFile .flatMap(line => line.split(" ")) // all tokens in an array .filter(_ != ' ') // remove all empty tokens .map(word => (word.replaceAll("p{P}", "") // remove punctuations .toLowerCase(), 1)) // convert to lower case .reduceByKey(_ + _) // add as per key values val thing = counts.collect() sc.makeRDD(<put a Mapping here>).saveToEs("spark/docs")
  • 35.  Tried the Spark-Hadoop-Elasticsearch connector but noticed some overhead and unnecessary computations  The project currently won’t accept large volumes of data and that too frequently. So fast computation isn’t really required  What we want is features to do clustering. Those features can easily be provided by Elasticsearch  May be in future, Spark will be added in the first phase of the project.  As of now Spark will be used for Clustering of the documents. The library MLlib provides APIs for this Lets add some SPARK to ES…
  • 37. REFERENCES  Learning Elasticsearch – Anurag Patel (Red Hat)  Introduction to Elasticsearch – Roy Russo  Apache Spark and Elasticsearch – Holden Karau UMD 2014  Streamlining Search Indexing using Elastic Search and Spark (Holden Karau)  Video Link : https://www.youtube.com/watch?v=jYicnlunDQ0