Elasticsearch and Spark

Elasticsearch and
Spark
ANIMESH PANDEY
PROJECT CONSILIENCE

Agenda
 Who am I?
 Text searching
 Full text based
 Term based
 Databases vs. Search engines
 Why not simple SQL?
 Why need Lucene?
 Elasticsearch
 Concepts/APIs
 Network/Discovery
 Split-brain Issue
 Solutions
 Data Structure
 Inverted Index
 SOLR – Dataverse’s Search
 Why not SOLR for Consilience?
 Elasticsearch – Consilience’s Search
 Language integration
 Python
 Java
 Scala
 SPARK
 Why Spark?
 Where Spark?
 When Spark?
 Language support
 Conclusion and Questions

Who am I?
 Animesh Pandey
 Computer Science grad student @
Northeastern University, Boston
 Intern for Project Consilience for
Summer 2015
 Job: integration of Elasticsearch and
Spark into the existing project

Text Searching
 Text – a from of data
 Text – available from various resources
 Internet, books, articles etc.
 We are concerned with digital text or converting the traditional text to digital
 Digital text – internet, news articles, blogs, research papers
 Traditional text – any text from a physical book, manuscript, typed papers,
newspapers etc.
 Traditional text conversion to digital text
 Automatic - Optical Character Recognizers (OCR) e.g. Tesseract by Google Inc.
 Manual - type to a system

Full text based vs. Term based
 Full text based search
 Most general kind of search
 Used everyday when using
Google, Bing or Yahoo
 In the background it is much more
than a simple character by
character match
 Lot of pre-processing involved for
a Full text search
 Term based search
 Generally comprises of exact term
matching
 You can think of it as a SQL query
where try to find documents that
contain the exact match of a
specified word

Databases vs. Search Engines
The both have unique strengths but also have overlapping capabilities
 Similarities:
 Both can be stored as data stores
 Basic updates and modifications can be done using both
 Differences:
 Search Engines
 Used for both structured as well
as unstructured data
 The results are ordered as per
the relevance of the result to
the query
 Databases
 Used for structured data
 There is relevance
matching between the
query and results

Why not simple SQL?
 MySQL provides us some ways to perform a full text search along with term
based searches BUT …..
 Needs MyISAM storage engine. It was the default storage engine of MySQL.
 MyISAM is optimized for read operations with few write operations or may be
none.
 But you cannot avoid write (update/modify) operations.
 MyISAM creates one index for one table.
 No. of tables = No. of index => more tables more complexity.
 Relational DBs have locks. They won’t read/write operations if already one
operation is being executed.

How does a search engine help?
 Efficient indexing of data
 You don’t need multiple indices like you needed in Databases
 Index is on all fields/combinations of fields
 Analyzing data
 Text search
 Tokenzing => splitting of text
 Stemming => converting words to their root forms
 Filtering => removal of certain words
 Relevance Scoring

In order to solve the problems mentioned before there are several
Open Source search engines….

 Information Retrieval Software Library
 Free/Open Source
 Supported by Apache Foundation
 Created by Doug Cutting
 Since 1999
In order to use it there are two Java libraries available…..
APACHE LUCENE

 Built on Lucene
 Perfect for single server search
 Part of the Lucene project (Lucene comes with Solr)
 Large user and developer base
 This is Dataverse’s Search engine. Later will talk why using
Elasticsearch here won’t make a big difference
APACHE SOLR

{
"status" : 200,
"name" : "Fafnir",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.4.2",
"build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c",
"build_timestamp" : "2014-12-16T14:11:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}
ELASTICSEARCH
 Free/Open source
 Built on top of Lucene
 Created by Shay Banon @kimchy
 Current stable version is 1.6.0
 Has wrappers in many languages

 RESTful Service
 JSON API over HTTP
 Chrome Plugins – Marvel Sense and POSTman
 Can be used from Java, Python and many other languages
 High availability and clustering is very easy to set up
 Long term persistence
What does Elasticsearch add to Lucene?

Elasticsearch is a “download and use” distro
Executables
Log files
Node Configs
Data Storage
├── bin
│ ├── elasticsearch
│ ├── elasticsearch.in.sh
│ └── plugin
├── config
│ ├── elasticsearch.yml
│ └── logging.yml
├── data
│ └── cluster1
├── lib
│ ├── elasticsearch-x.y.z.jar
│ ├── ...
│ └──
└── logs
├── elasticsearch.log
└── elasticsearch_index_search_slowlog.log
└── elasticsearch_index_indexing_slowlog.log
Jar
Distributions

 Here we can initialize the basic configuration
required to start an ES node. Following are the
config types that are generally changed.
 cluster.name – the cluster to which it’ll join
 node.name – specify name of the node
 node.master – whether the node is a master
 node.data – whether this node will hold data
 path.data – path of the index
 path.conf – path of the config folder (scripts or
any file put in this folder)
 path.logs – path of the logs
elasticsearch.yml – Config file of Elasticsearch
curl -XPUT "http://localhost:9200/social_media/" -d'
{
"settings": {
"node": {
"master": true
},
"path": {
"conf": "D:/social_media/config/"
},
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}'

Underlying Lucene Inverted Index
 This is term to document mapping
 Inverted index contains terms mapped to
all documents in which it occurred
 Every document is paired with the term
frequency of the term being considered
 Sum all term frequencies to get corpus
frequency of the term

Shards and Replicas
 Primary Shard
 Created when indexing
 Index has 1..N primary shards
 Persistent
 This is the actual data
 Replica Shard
 Index has 0..N primary replicas
 Not persistent
 The is copy of the data
 Promoted to Primary shard if the node fails

Nodes discovery
 Nodes discovery in ES is using multicast
 Unicast is also possible
 Can be modified by changing elasticsearch.yml
 In multicast the master node will send requests to all nodes to check
which are waiting for connection
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [“host1", "host2:port", "host3"]

Split-brain Issue
 Suppose we have three node cluster which has 1 master and 2 slaves
 Suppose due to some reason connection to NODE 2 fails
 NODE 2 will promote its replica shards to primary shards and will convert itself to a
Master
 Cluster will be in an inconsistent state
 Indexing request to NODE 2 won’t be reflected to NODE 1 – NODE 3
 This will result in two different indices => different results

Solving the Split-brain issue
 Specify the number of masters in a cluster
 discovery.zen.minimum_master_nodes = (N/2 + 1), where N is the number of nodes in a
cluster
 In the three node cluster, the cluster with one node will fail and the production will come to
know about such issue
 discovery.zen.ping.timeout should be increased in a slow network so that nodes get
extra time to ping to each other
 Default value is 3 seconds

Elasticsearch APIs
 There are certain number of APIs provided by elasticsearch. We will
be covering the ones useful to us:
 INDEX API
 SETTING API
 MAPPING API
 TERMVECTOR/MTERMVECTOR API
 BULK API
 SEARCH API

Processing of Text using Analyzers (Settings API)
 Analyzers help in manipulating the
text that is to be indexed.
 Tokenizers, stemmers, token-filters are
the most used Analyzers.
 Analyzers are usually given a name/id
so that they can be used in future with
any type of text.
 There are other analyzers as well that
are based on term-replacement,
regular-expression pattern,
punctuation characters.
 Custom analyzers can also be
created in ES.
curl -XPUT
"http://localhost:9200/social_media/tweet/_settings" -d'
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"my_english": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload",
"cust_stop"
]
}
},
"filter": {
"cust_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt",
}
}
}
}
}’

Mapping of Documents to be indexed (Mappings API)
curl -XPUT
"http://localhost:9200/social_media/tweet/_mapping" -d
'{
"tweet": {
"properties": {
"_id": {
"type": "string",
"store": True,
"index": "not_analyzed"
},
"text": {
"type": "multi_field",
"fields": {
"text": {
"include_in_all": False,
"type": "string",
"store": False,
"index": "not_analyzed"
},
"_analyzed": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector":
"with_positions_offsets_payloads",
"analyzer": “my_english”
}
}
}
}}}
 Elasticsearch auto-maps fields but we
can also specify the types.
 Data types provided by ES:
 String
 Number
 Boolean
 Date-time
 Geo-point (coordinates)
 Attachment (requires plugin)
 Consilience uses this for indexing PDF
files

Creation of Index
 Specifying setting and mapping and sending a PUT request to
Elasticsearch initializes the index
 Now the task is to send documents to Elasticsearch
 We have to keep in mind the mappings of each field in the document
 Document Metadata fields
 _id : identifier of the document
 _index : index name
 _type : mapping type
 _source : enabled/disabled
 _timestamp
 _ttl
 _size : size of uncompressed _source
 _version

Indexing a document (Index API)
curl -XPOST
"http://localhost:9200/social_media/tweet/616272192
012165183" -d '{
"_source": {
"text": "random text",
"exact_text": "random text"
}
}‘
For ES 1.6.0+
curl -XPOST
"http://localhost:9200/social_media/tweet/616272192
012165183" -d '{
"text": "random text",
"exact_text": "random text"
}'
{
'_index': 'social_media',
'_type': 'tweet',
'_id': ‘616272192012165120',
'_source': {
'text': '@bshor Thanks for the info; this will
help us. Are these the 2 datasets you were
uploading? https://t.co/W1M4vrQUEI
https://t.co/ITRycQnPKz',
'exact_text': '@bshor Thanks for the info; this
will help us. Are these the 2 datasets you were
uploading? https://t.co/W1M4vrQUEI
https://t.co/ITRycQnPKz'
}
}
Document structure Indexing new document

Retrieving term vectors (Termvector API)
 termvector or mtermvector APIs are used for
getting the term-vectors
 We can change the above DSL according to
our needs
curl -XGET
"http://localhost:9200/social_media/tweet/616272192012165183/_termve
ctor" -d'
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}'
{
"_index": "social_media",
"_type": "tweet",
"_id": "616272192012165183",
"_version": 1,
"found": true,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 65,
"doc_count": 6,
"sum_ttf": 66
},
"terms": {
"random": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 6,
"payload": "d29yZA=="
}
]
},
"text": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 7,
"end_offset": 11,
"payload": "d29yZA=="
}
]
}
}
}
}
}

Processing independent documents
 This can be done by using Analyze API
 The analyzer my_english was defined in Slide 16
 The above DSL results in where document was
“Text to analyze”
curl -XGET "http://localhost:9200/social_media/_analyze?analyzer=my_english&text=Text to analyze"
{
"tokens": [
{
"token": "text",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "analyze",
"start_offset": 8,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}

Working with Shingles
 Shingles are a way to index group of
tokens like unigrams, bigrams etc.
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2, // for bigrams
"max_shingle_size" : 2,
"output_unigrams": True
}
curl -XGET
"http://localhost:9200/social_media/_anal
yze?analyzer=my_english_shingle&text=Text
to analyze"
{
"tokens": [
{
"token": "text",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "text _",
"start_offset": 0,
"end_offset": 8,
"type": "shingle",
"position": 1
},
{
"token": "_ analyze",
"start_offset": 8,
"end_offset": 15,
"type": "shingle",
"position": 2
},
{
"token": "analyze",
"start_offset": 8,
"end_offset": 15,
"type": "word",
"position": 3
}
]
}
 This filter can be used in
termvector API to get
vectors containing both
unigram and bigrams

Searching in Index (Search API)
 Default search
 Exact phrase matching
curl -XGET "http://localhost:9200/social_media/tweet/_search" -d'
{
"query": {
"match": {
"text._analyzed": “some Texts“ // will search for “some text”, “some” and “text”
}
},
"explain": true
}‘
curl -XGET "http://localhost:9200/social_media/tweet/_search" -d'
{
"query": {
"match_phrase": {
"text": “some Texts“ // will search for “some Texts” as a phrase
}
},
"explain": true
}‘

Recommended Design Patterns
 Keep the number of nodes odd
 Take pre-cautions to avoid Split-brain issue
 Regularly refresh indices
 Add refresh_interval to settings
 Manage heap size
 ES_HEAP_SIZE <= ½ of the system’s RAM but not more than 32GB
 export ES_HEAP_SIZE=10g
 ./bin/elasticsearch -Xmx10g -Xms10g
 Use Aliases
 Searches are made using an index created from the original index
 This prevents cluster down time or delays that may occur during the updation/modification of the index
 Delete aliases when they become old and create new one
 You can create time-based aliases as well
 Use Routing
 A way to know which shard contains what document
 Reduces the lookup time during searches
 When bulk indexing
 Timeout after every push
 Push should be of maximum size 2-3MB

Why not SOLR?
 SOLR is a better search engine than Elasticsearch
 But we require Term_vectors and analysis more than a search
 ES provides better APIs for analytics
 termvector with field and term statistics
 mtermvector
 search with explain enabled
 function_scoring (Didn’t mention before)
 If you need only a search engine, go for SOLR. If you need something more
than that Elasticsearch is the best choice.

Language Support
 We have
 JAVA wrappers : org.elasticsearch.*
 Python wrapper: py-elasticsearch
 Scala wrapper : elastic4s
 Domain Specific Language (DSL) : cURL/JSON as shown in every
example previously

Lets add some SPARK to ES…
 Apache Spark is an engine for large scale data processing
 It runs programs nearly 100 times faster than Hadoop
 Has language support for Python, Java, Scala and R
 For Project Consilience:
 Earlier I had thought of keeping the starting and end point of the whole
application to be Spark
 i.e. read files using spark, index them using Elasticsearch and apply clustering
using Spark’s MLlib
 Flat file reading is very direct in Spark
 spark.textfile() => parallel reading of the file in chunks
 spark.wholetextfile() => loads complete file into memory

 Earlier experiments were done in
Scala
 Scala gave us the advantage
of Functional programming
along with the Parallel
processing
 Now Java 8 also provides with
Functional programming so
Scala and Java won’t make
much difference
import org.elasticsearch.spark._ //ES-Spark connector
val conf = new SparkConf()
.setAppName(“super_spark")
.setMaster("local[2]")
.set("spark.executor.memory", "1g")
.set("spark.rdd.compress", "true")
.set("spark.storage.memoryFraction", "1")
.set("es.index.auto.create", "true")
.set(“es.node”, 9200)
// other configurations can be added as well
val sc = new SparkContext(conf)
// parallel reading for arrays. Same syntax in Java and Python
val data = sc.parallelize(1 to 10000).collect().filter(_ < 100)
data.foreach(println)
val textFile = sc.textFile("/home/cloudera/Documents/pg2265.txt")
val counts = textFile
.flatMap(line => line.split(" ")) // all tokens in an array
.filter(_ != ' ') // remove all empty tokens
.map(word => (word.replaceAll("p{P}", "") // remove
punctuations
.toLowerCase(), 1)) // convert to lower case
.reduceByKey(_ + _) // add as per key values
val thing = counts.collect()
sc.makeRDD(<put a Mapping here>).saveToEs("spark/docs")

 Tried the Spark-Hadoop-Elasticsearch connector but noticed some
overhead and unnecessary computations
 The project currently won’t accept large volumes of data and that too
frequently. So fast computation isn’t really required
 What we want is features to do clustering. Those features can easily be
provided by Elasticsearch
 May be in future, Spark will be added in the first phase of the project.
 As of now Spark will be used for Clustering of the documents. The
library MLlib provides APIs for this

REFERENCES
 Learning Elasticsearch – Anurag Patel (Red Hat)
 Introduction to Elasticsearch – Roy Russo
 Apache Spark and Elasticsearch – Holden Karau UMD 2014
 Streamlining Search Indexing using Elastic Search and Spark (Holden
Karau)
 Video Link : https://www.youtube.com/watch?v=jYicnlunDQ0

Elasticsearch and Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Elasticsearch and Spark

Similaire à Elasticsearch and Spark (20)

Dernier

Dernier (20)

Elasticsearch and Spark