An overview of Elasticsearch: main features, architecture, limitations. It includes also a description on how to query data both using REST API and using elastic4s library, with also a specific interest into integration of the search engine with Apache Spark.
4. “
What is Elasticsearch?
Elasticsearch (ES) is an open-source,
broadly-distributable,
readily-scalable, enterprise-grade
search engine. Accessible through an
extensive and elaborate API,
Elasticsearch can power extremely
fast searches that support your data
discovery applications.
4
5. ES terminology
A cluster is identified by a unique name which by default is "elasticsearch". This name
is important because a node can only be part of a cluster if the node is set up to join
the cluster by its name.
A node is identified by a name which by default is a random Universally Unique
IDentifier (UUID) that is assigned to the node at startup. You can define any node name
you want if you do not want the default.
An index is a collection of documents that have somewhat similar
characteristics. For example, you can have an index for customer data, another
index for a product catalog, and yet another index for order data. An index is
identified by a name (that must be all lowercase) and this name is used to refer
to the index when performing indexing, search, update, and delete operations
against the documents in it.
In a single cluster, you can define as many indexes as you want.
A type is a logical category/partition of your index whose semantics is completely up to
you. In general, a type is defined for documents that have a set of common fields.
A document is a basic unit of information that can be indexed. This document is
expressed in JSON
5
6. ES terminology (2)
Elasticsearch provides the ability to subdivide your index into multiple
pieces called shards.
Each shard is in itself a fully-functional and independent "index" that can be
hosted on any node in the cluster.
Shards:
▷ allow you to horizontally split/scale your content volume
▷ allow you to distribute and parallelize operations (potentially on
multiple nodes) thus increasing performance/throughput
The number of shards and replicas can be defined per index at the time the
index is created. After the index is created, you may change the number of
replicas dynamically anytime but you cannot change the number shards
after-the-fact.
Each Elasticsearch shard is a Lucene index. There is a maximum
number of documents you can have in a single Lucene index. As of
LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128)
documents. You can monitor shard sizes using the _cat/shards api.
6
8. Versioning
▷ ES supports content versioning.
▷ Documents can be edited. ES maintains track of
the version number (starting from 1).
▷ Version increments are atomic.
▷ ES may delete older version as long as index
grows.
8
9. Using ES as primary data storage
Key points:
▷ ES is an eventually consistent, near-realtime
storage engine.
▷ ES is not ACID compliant like a relational database.
▷ ES does not support transactions.
Currently, there are some (minor) issues related to
resiliency:
https://www.elastic.co/guide/en/elasticsearch/resilie
ncy/current/index.html
9
10. Using ES as primary data storage
(2)
▷ You can use it as primary data storage (a lot of
people already do it!) but be prepared to the
possibility to lose some of your data.
▷ You can mitigate the data losing problem using
backup snapshots.
Alternative: use it together with another resilient DB
(e.g. HBase), if needed reindex all data.
More info on ES resiliency:
https://www.elastic.co/blog/resiliency-elasticsearch
10
11. ES vs. Solr: main differences
ES and Solr have same performance, similar
documentation quality and APIs available in the most
common languages but:
▷ ES love REST!
▷ ES Query DSL syntax is really flexible and
powerful. Solr does not have something analogous.
▷ ES automagically index data without any type of
data mapping.
▷ ES cluster installation is easier and self-contained
(no external tools required)
▷ ES has a lot better data analytics capabilities (e.g.
aggregations)
11
13. Requirements, installation,
configuration (2)
The standard recommendation is to give 50% of the available
memory to Elasticsearch heap, while leaving the other 50% free. It
won’t go unused; Lucene will happily gobble up whatever is left
over.
Try to avoid crossing the 32 GB heap boundary
Disable swap completely on your system
> sudo swapoff -a
To run ES as a daemon:
> ./bin/elasticsearch -d -p pid
13
14. Requirements, installation,
configuration (3)
Configuration is stored in elasticsearch.yml. If you can’t disable
swapping set in the configuration file:
> bootstrap.mlockall: true
To check this:
GET _nodes?filter_path=**.mlockall
Let Linux process handle many file descriptors (set 64000)
Ubuntu ignores the limits.conf file for processes started by
init.d. To enable the limits.conf file, edit /etc/pam.d/su
Development VS Production mode
▷ Binding address
14
17. ES: write easily your own apps
▷ Provides APIs in several languages:
○ REST Web service
○ Java, Scala, Groovy for JVM languages
○ .NET
○ Ruby, Python, PHP as scripting languages
We are mainly interested in REST and Scala
APIs.
17
18. Rest API
To list indices:
> GET /_cat/indices?v
Create the index
> PUT /customer?pretty
Insert a document
> PUT /customer/external/1?pretty
{
"name": "John Doe"
}
ES automatically creates the index, if it does not exists 18
19. Rest API
Retrieve data, given the index
> GET /customer/external/1?pretty
To delete an index
> DELETE /customer?pretty
The golden rule
> <REST Verb> /<Index>/<Type>/<ID>
19
20. Rest API
ES has a short delay (1s) before a datum is available
Replace document using the ID
> PUT /customer/external/1?pretty
{
"name": "Jane Doe"
}
20
21. Rest API
If we don’t want to specify the ID during insertion, we use POST
> POST /customer/external?pretty
{
"name": "Jane Doe"
}
21
22. Rest API
Update a document
> POST /customer/external/1/_update?pretty
{
"doc": { "name": "Jane Doe", "age": 20 }
}
Delete a document
> DELETE /customer/external/2?pretty
22
23. Rest API
To manage multiple documents at a time, use the _bulk API
> POST /customer/external/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
Different operations can be mixed
> POST /customer/external/_bulk?pretty
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
23
24. Rest API - Search
There are two ways to query ES
> GET /bank/_search?q=*&sort=account_number:asc
> GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
]
}
24
25. Rest API - Search
Look for documents with mill or lane
> GET /bank/_search
{
"query": { "match": { "address": "mill lane" } }
}
25
26. Rest API - Search
You can build complex queries combining multiple conditions. Boolean queries can be
combined to make complex queries.
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
} 26
27. Rest API - Search
Bool queries support filter clauses. They are used to filter document regardless of their relevance
(scores). Filter clause used with range queries is generally used for numeric or date filtering.
GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
27
28. Rest API
● Term - single term search
● Match_phrase - analyzes the input if analyzers are defined
for the queried field and find documents matching the
criterias
● Query_string - query search, by default, on a _all field which
contains the text of several text fields at once. On top of
that, it's parsed and supports some operators (AND/OR...),
wildcards and so on (see related syntax).
28
29. Rest API - Aggregations
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state" : {
"terms": {
"field": "state.keyword"
}
}
}
}
Equivalent to
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT (*) DESC
29
30. Rest API
Most APIs that refer to an index parameter support execution
across multiple indices, using simple test1,test2,test3 notation
(or _all for all indices). It also support wildcards, for example:
test* or *test or te*t or *test*, and the ability to "add" (+) and
"remove" (-), for example: +test*,-test3.
Date math index name resolution enables you to search a range of
time-series indices, rather than searching all of your time-series
indices and filtering the results or maintaining aliases.
A date math index name takes the following form:
<static_name{date_math_expr{date_format|time_zone}}>
30
31. Rest API
All REST APIs accept a filter_path parameter that can be used to
reduce the response returned by elasticsearch. This parameter
takes a comma separated list of filters expressed with the dot
notation:
> GET
/_search?q=elasticsearch&filter_path=took,hits.hits.
_id,hits.hits._score
You can control which parts of the documents are returned
modifying the _source fields.
31
32. Rest API
You can delete multiple documents using a query:
POST twitter/_delete_by_query
{
"query": {
"match": {
"message": "some message"
}
}
}
32
33. Rest API - Advanced
Term vectors can be helpful to analyse text, build word clouds and
evaluate relevance of documents. ES supports them:
curl -XGET
'http://localhost:9200/twitter/tweet/1/_termvectors?
pretty=true'
33
34. Rest API - Advanced
Which shards will be searched on can be controlled by providing
the routing parameter. This parameter provides a field that will be
used to dispatch documents across shards. Documents with the
same value will be kept together to speed up queries.
Pagination can be achieved through scrolling operation, setting the
scroll parameter in query. Scrolls should be explicitly cleared as
soon as the scroll is not being used anymore using the
clear-scroll API.
34
35. Rest API - Advanced
Queries can be performed over different fields, types or indices. For
example you can query across all types
GET /twitter/_search?q=user:kimchy
… or specify multiple types...
GET /twitter/tweet,user/_search?q=user:kimchy
...or indices
GET /kimchy,elasticsearch/tweet/_search?q=tag:wow
You can use _all to match all indices
GET /_all/tweet/_search?q=tag:wow
GET /_search?q=tag:wow 35
36. Rest API - Templates
You can register search templates by storing it in the
config/scripts directory, in a file using the .mustache
extension. In order to execute the stored template, reference it by
it’s name under the template key
GET /_search/template
{
"file": "storedTemplate",
"params": {
"query_string": "search for these words"
}
}
36
37. Rest API - Count API
ES offers a simple and fast interface to count documents
PUT /twitter/tweet/1?refresh
{
"user": "kimchy"
}
GET /twitter/tweet/_count?q=user:kimchy
{
"count" : 1,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
37
38. Rest API - Aggregation
Aggregation can be performed using different functions or scripts:
▷ avg
▷ cardinality (distinct values). The count is approximated above a
threshold (configurable)
▷ geo
▷ max/min
▷ sum
▷ value count
38
39. Rest API - Histogram Aggregation
A multi-bucket values source based aggregation that can be applied on numeric values
extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets
over the values.
The Date Histogram Aggregation is a multi-bucket aggregation similar to the histogram
except it can only be applied on date values. Since dates are represented in
elasticsearch internally as long values, it is possible to use the normal histogram on
dates as well, though accuracy will be compromised. Have a look also to date range
aggregation that allows to define intervals of dates.
{
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
} 39
41. Scripting
ES supports different scripting languages:
▷ Painless (il più veloce, raccomandato. Compatibile con Java)
▷ Groovy
▷ Javascript
▷ Python
▷ …
▷ You can build your script using Java
"script": {
"lang": "...",
"inline" | "id" | "file": "...",
"params": { ... }
}
41
42. Recommendations
▷ Don’t return large result sets
▷ Avoid sparsity
▷ Avoid putting unrelated data in the same
index
▷ Use same names for fields
(normalization)
▷ Disable swapping
42
43. Elastic4s: a Scala DSL API for ES
A Scala wrapper around standard ES Java API:
▷ Type safe concise DSL.
▷ Integration with standard Scala futures.
▷ Integration with Scala collections library.
▷ Leverages the built-in Java client.
▷ Provides reactive-streams implementation.
▷ Currently compatible with latest ES 5.1.1
version.
https://github.com/sksamuel/elastic4s
43
44. Elastic4s: connection to ES
44
import com.sksamuel.elastic4s.ElasticClient
// Simple connection to local ES node
val client = ElasticClient.local
// Specify more configuration parameters before connection.
val settings = Settings.settingsBuilder()
.put("http.enabled", false)
.put("path.home", "/var/elastic/")
val client2 = ElasticClient.local(settings.build)
// Connection to remote nodes.
val settings = Settings.settingsBuilder().put("cluster.name",
"myClusterName").build()
val client = ElasticClient.remote(settings,
ElasticsearchClientUri("elasticsearch://somehost:9300"))
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/setup-configuration.html
45. Elastic4s: index creation
45
// Create a new dynamic index.
client.execute { create index "places" shards 3 replicas 2 }
// Create a new index specifying metadata for stored data.
client.execute {
create index "places" mappings (
"cities" as (
"id" typed IntegerType,
"name" typed StringType boost 4,
"content" typed StringType analyzer StopAnalyzer
)
)
}.await // Operations are always asynchronous, call await for sync
wait.
More info at https://github.com/sksamuel/elastic4s/blob/master/guide/createindex.md
46. Elastic4s: data indexing
46
// Index a new document.
client.execute {
index into "places" / "cities" id "uk" fields (
"name" -> "London",
"country" -> "United Kingdom",
"continent" -> "Europe",
"status" -> "Awesome"
)
}
▷ The ID, if not specified, is generated automatically by ES.
▷ Possibility to use bulk loads.
▷ Possibility to use case classes.
▷ DSL native support for routing, version, parent, timestamp and op
type. See http://www.elasticsearch.org/guide/reference/api/index_/
for more info.
47. Elastic4s: data queries
47
// Query for generic term paging the results.
search in "places"->"cities" query "paris" start 5 limit 10
// Search for docs matching “georgia” in “state” field.
search in "places"->"cities" query { termQuery("state", "georgia") }
// Or make a complex boolean query.
search in "places"->"cities" query {
bool {
must(
regexQuery("name", ".*cester"),
termQuery("status", "Awesome")
) not (
termQuery("weather", "hot")
)
}
}
* All queries are executed with client.execute() method.
48. Elastic4s: data queries (2)
48
// Sorting results.
search in "places"->"cities" query "europe" sort (
field sort "name",
field sort "status"
)
// Sub-aggregations: getting for each country the most frequent terms
// in city descriptions.
val freqWords = client.execute {
search in "places/cities" aggregations {
aggregation terms "by_country" field "country" aggs {
aggregation terms "frequent_words" field "content"
}
}
}
* Query results are in the form of array of SearchHit instances which contain things like the
id, index, type, version, etc. .
49. Elastic4s: get data directly
49
// Request a specific document.
client.execute {
get id "coldplay" from "bands" / "rock"
}
// Or request a set of known documents.
client.execute {
multiget(
get id "coldplay" from "bands/rock",
get id "keane" from "bands/rock"
)
}
50. Elastic4s: update and delete docs
50
// * Update a single document specifying its ID.
update(5).in("scifi/startrek").doc(
"name" -> "spock",
"race" -> "vulcan"
)
// * Delete document by ID.
delete id "u2" from "bands/rock"
// * Delete documents by query.
delete from index "bands" types "rock" where
termQuery("type", "rap")
* You can call it inside client.execute() or client.bulk() methods.
51. Elastic4s: others available
features
▷ Bulk operations: indexing, deleting,
updating.
▷ Helpers to reindex all data.
▷ Full-access to Java API to cover all
unavailable operations in DSL.
▷ Reactive-streams support for publisher
and subscriber operations.
51
52. ES: Apache Spark integration
▷ Apache Spark is a fast and
general-purpose cluster computing
system.
▷ ES provides Spark native support from
version 2.1 .
○ With classic RDD APIs.
○ With Spark SQL support and DataFrame
abstraction.
52
53. ES and Apache Spark: from RDD
to ES and back
53
// Define a case class.
case class Trip(departure: String, arrival: String)
// Create data to save into ES.
val upcomingTrip = Trip("London", "New York")
val lastWeekTrip = Trip("Rome", "Madrid")
// Create Spark RDD from previous data and save it to ES.
val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
rdd.saveToEs("spark/docs")
// Read data matching the given query as RDD.
val rdd2 = sc.esRDD("spark/docs", "?q=London")
JSON data can be read/write directly by using esJsonRDD method.
54. ES and Apache Spark: Spark SQL
read and filters support
54
val sqlContext = new SQLContext...
// Read index as a DataFrame.
val df =
sqlContext.read().format("org.elasticsearch.spark.sql").load("spark/tri
ps")
df.printSchema()
// root
//|-- departure: string (nullable = true)
//|-- arrival: string (nullable = true)
//|-- days: long (nullable = true)
// Filter data.
val filter =
df.filter(df("arrival").equalTo("OTP").and(df("days").gt(3))
55. ES and Apache Spark: Spark SQL
write support
55
// Case class used to define the DataFrame.
case class Person(name: String, surname: String, age: Int)
// Create DataFrame.
val people = sc.textFile("people.txt")
.map(_.split(","))
.map(p => Person(p(0), p(1), p(2).trim.toInt))
.toDF()
// Save DataFrame to ES.
people.saveToEs("spark/people")
56. Thanks!
Any questions?
You can find us at:
Fabio Del Vigna, fabio.delvigna@iit.cnr.it
Tiziano Fagni, tiziano.fagni@iit.cnr.it
56