SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Elasticsearch, a
distributed search
engine with real-time
analytics
Pisa, 2 December 2016
Fabio Del Vigna, Tiziano Fagni
Lab “WAFI” - “Cyber Intelligence”
IIT-CNR, Pisa
Talk outline
▷ Architectural overview
▷ Developer tools
▷ Demo time
2
Architectural overview
Main features and design of Elasticsearch
3
“
What is Elasticsearch?
Elasticsearch (ES) is an open-source,
broadly-distributable,
readily-scalable, enterprise-grade
search engine. Accessible through an
extensive and elaborate API,
Elasticsearch can power extremely
fast searches that support your data
discovery applications.
4
ES terminology
A cluster is identified by a unique name which by default is "elasticsearch". This name
is important because a node can only be part of a cluster if the node is set up to join
the cluster by its name.
A node is identified by a name which by default is a random Universally Unique
IDentifier (UUID) that is assigned to the node at startup. You can define any node name
you want if you do not want the default.
An index is a collection of documents that have somewhat similar
characteristics. For example, you can have an index for customer data, another
index for a product catalog, and yet another index for order data. An index is
identified by a name (that must be all lowercase) and this name is used to refer
to the index when performing indexing, search, update, and delete operations
against the documents in it.
In a single cluster, you can define as many indexes as you want.
A type is a logical category/partition of your index whose semantics is completely up to
you. In general, a type is defined for documents that have a set of common fields.
A document is a basic unit of information that can be indexed. This document is
expressed in JSON
5
ES terminology (2)
Elasticsearch provides the ability to subdivide your index into multiple
pieces called shards.
Each shard is in itself a fully-functional and independent "index" that can be
hosted on any node in the cluster.
Shards:
▷ allow you to horizontally split/scale your content volume
▷ allow you to distribute and parallelize operations (potentially on
multiple nodes) thus increasing performance/throughput
The number of shards and replicas can be defined per index at the time the
index is created. After the index is created, you may change the number of
replicas dynamically anytime but you cannot change the number shards
after-the-fact.
Each Elasticsearch shard is a Lucene index. There is a maximum
number of documents you can have in a single Lucene index. As of
LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128)
documents. You can monitor shard sizes using the _cat/shards api.
6
ES logical architecture
7
Versioning
▷ ES supports content versioning.
▷ Documents can be edited. ES maintains track of
the version number (starting from 1).
▷ Version increments are atomic.
▷ ES may delete older version as long as index
grows.
8
Using ES as primary data storage
Key points:
▷ ES is an eventually consistent, near-realtime
storage engine.
▷ ES is not ACID compliant like a relational database.
▷ ES does not support transactions.
Currently, there are some (minor) issues related to
resiliency:
https://www.elastic.co/guide/en/elasticsearch/resilie
ncy/current/index.html
9
Using ES as primary data storage
(2)
▷ You can use it as primary data storage (a lot of
people already do it!) but be prepared to the
possibility to lose some of your data.
▷ You can mitigate the data losing problem using
backup snapshots.
Alternative: use it together with another resilient DB
(e.g. HBase), if needed reindex all data.
More info on ES resiliency:
https://www.elastic.co/blog/resiliency-elasticsearch
10
ES vs. Solr: main differences
ES and Solr have same performance, similar
documentation quality and APIs available in the most
common languages but:
▷ ES love REST!
▷ ES Query DSL syntax is really flexible and
powerful. Solr does not have something analogous.
▷ ES automagically index data without any type of
data mapping.
▷ ES cluster installation is easier and self-contained
(no external tools required)
▷ ES has a lot better data analytics capabilities (e.g.
aggregations)
11
Requirements, installation,
configuration
▷ Java > 8
▷ Installazione
> curl -L -O
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch
-5.0.0.tar.gz
> elasticsearch -5.0.0/bin/elasticsearch
▷ Monitoring
> curl 'http://<IP macchina>:9200/_cat/health?v'
> curl 'http://<IP macchina>:9200/_cat/nodes?v'
> curl 'http://<IP macchina>:9200/_cat /indices?v'
12
Requirements, installation,
configuration (2)
The standard recommendation is to give 50% of the available
memory to Elasticsearch heap, while leaving the other 50% free. It
won’t go unused; Lucene will happily gobble up whatever is left
over.
Try to avoid crossing the 32 GB heap boundary
Disable swap completely on your system
> sudo swapoff -a
To run ES as a daemon:
> ./bin/elasticsearch -d -p pid
13
Requirements, installation,
configuration (3)
Configuration is stored in elasticsearch.yml. If you can’t disable
swapping set in the configuration file:
> bootstrap.mlockall: true
To check this:
GET _nodes?filter_path=**.mlockall
Let Linux process handle many file descriptors (set 64000)
Ubuntu ignores the limits.conf file for processes started by
init.d. To enable the limits.conf file, edit /etc/pam.d/su
Development VS Production mode
▷ Binding address
14
Requirements, installation,
configuration (4)
ES offers two configuration tools for an
easier deployment for large installations
Puppet
Chef
15
Developer tools
APIs and tools available to interact with Elasticsearch
16
ES: write easily your own apps
▷ Provides APIs in several languages:
○ REST Web service
○ Java, Scala, Groovy for JVM languages
○ .NET
○ Ruby, Python, PHP as scripting languages
We are mainly interested in REST and Scala
APIs.
17
Rest API
To list indices:
> GET /_cat/indices?v
Create the index
> PUT /customer?pretty
Insert a document
> PUT /customer/external/1?pretty
{
"name": "John Doe"
}
ES automatically creates the index, if it does not exists 18
Rest API
Retrieve data, given the index
> GET /customer/external/1?pretty
To delete an index
> DELETE /customer?pretty
The golden rule
> <REST Verb> /<Index>/<Type>/<ID>
19
Rest API
ES has a short delay (1s) before a datum is available
Replace document using the ID
> PUT /customer/external/1?pretty
{
"name": "Jane Doe"
}
20
Rest API
If we don’t want to specify the ID during insertion, we use POST
> POST /customer/external?pretty
{
"name": "Jane Doe"
}
21
Rest API
Update a document
> POST /customer/external/1/_update?pretty
{
"doc": { "name": "Jane Doe", "age": 20 }
}
Delete a document
> DELETE /customer/external/2?pretty
22
Rest API
To manage multiple documents at a time, use the _bulk API
> POST /customer/external/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
Different operations can be mixed
> POST /customer/external/_bulk?pretty
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
23
Rest API - Search
There are two ways to query ES
> GET /bank/_search?q=*&sort=account_number:asc
> GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
]
}
24
Rest API - Search
Look for documents with mill or lane
> GET /bank/_search
{
"query": { "match": { "address": "mill lane" } }
}
25
Rest API - Search
You can build complex queries combining multiple conditions. Boolean queries can be
combined to make complex queries.
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
} 26
Rest API - Search
Bool queries support filter clauses. They are used to filter document regardless of their relevance
(scores). Filter clause used with range queries is generally used for numeric or date filtering.
GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
27
Rest API
● Term - single term search
● Match_phrase - analyzes the input if analyzers are defined
for the queried field and find documents matching the
criterias
● Query_string - query search, by default, on a _all field which
contains the text of several text fields at once. On top of
that, it's parsed and supports some operators (AND/OR...),
wildcards and so on (see related syntax).
28
Rest API - Aggregations
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state" : {
"terms": {
"field": "state.keyword"
}
}
}
}
Equivalent to
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT (*) DESC
29
Rest API
Most APIs that refer to an index parameter support execution
across multiple indices, using simple test1,test2,test3 notation
(or _all for all indices). It also support wildcards, for example:
test* or *test or te*t or *test*, and the ability to "add" (+) and
"remove" (-), for example: +test*,-test3.
Date math index name resolution enables you to search a range of
time-series indices, rather than searching all of your time-series
indices and filtering the results or maintaining aliases.
A date math index name takes the following form:
<static_name{date_math_expr{date_format|time_zone}}>
30
Rest API
All REST APIs accept a filter_path parameter that can be used to
reduce the response returned by elasticsearch. This parameter
takes a comma separated list of filters expressed with the dot
notation:
> GET
/_search?q=elasticsearch&filter_path=took,hits.hits.
_id,hits.hits._score
You can control which parts of the documents are returned
modifying the _source fields.
31
Rest API
You can delete multiple documents using a query:
POST twitter/_delete_by_query
{
"query": {
"match": {
"message": "some message"
}
}
}
32
Rest API - Advanced
Term vectors can be helpful to analyse text, build word clouds and
evaluate relevance of documents. ES supports them:
curl -XGET
'http://localhost:9200/twitter/tweet/1/_termvectors?
pretty=true'
33
Rest API - Advanced
Which shards will be searched on can be controlled by providing
the routing parameter. This parameter provides a field that will be
used to dispatch documents across shards. Documents with the
same value will be kept together to speed up queries.
Pagination can be achieved through scrolling operation, setting the
scroll parameter in query. Scrolls should be explicitly cleared as
soon as the scroll is not being used anymore using the
clear-scroll API.
34
Rest API - Advanced
Queries can be performed over different fields, types or indices. For
example you can query across all types
GET /twitter/_search?q=user:kimchy
… or specify multiple types...
GET /twitter/tweet,user/_search?q=user:kimchy
...or indices
GET /kimchy,elasticsearch/tweet/_search?q=tag:wow
You can use _all to match all indices
GET /_all/tweet/_search?q=tag:wow
GET /_search?q=tag:wow 35
Rest API - Templates
You can register search templates by storing it in the
config/scripts directory, in a file using the .mustache
extension. In order to execute the stored template, reference it by
it’s name under the template key
GET /_search/template
{
"file": "storedTemplate",
"params": {
"query_string": "search for these words"
}
}
36
Rest API - Count API
ES offers a simple and fast interface to count documents
PUT /twitter/tweet/1?refresh
{
"user": "kimchy"
}
GET /twitter/tweet/_count?q=user:kimchy
{
"count" : 1,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
37
Rest API - Aggregation
Aggregation can be performed using different functions or scripts:
▷ avg
▷ cardinality (distinct values). The count is approximated above a
threshold (configurable)
▷ geo
▷ max/min
▷ sum
▷ value count
38
Rest API - Histogram Aggregation
A multi-bucket values source based aggregation that can be applied on numeric values
extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets
over the values.
The Date Histogram Aggregation is a multi-bucket aggregation similar to the histogram
except it can only be applied on date values. Since dates are represented in
elasticsearch internally as long values, it is possible to use the normal histogram on
dates as well, though accuracy will be compromised. Have a look also to date range
aggregation that allows to define intervals of dates.
{
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
} 39
Analysis
Analyzer
- 1 tokenizer (e.g. whitespace)
- Token filters
- stop words
- stemming
- ...
40
Scripting
ES supports different scripting languages:
▷ Painless (il più veloce, raccomandato. Compatibile con Java)
▷ Groovy
▷ Javascript
▷ Python
▷ …
▷ You can build your script using Java
"script": {
"lang": "...",
"inline" | "id" | "file": "...",
"params": { ... }
}
41
Recommendations
▷ Don’t return large result sets
▷ Avoid sparsity
▷ Avoid putting unrelated data in the same
index
▷ Use same names for fields
(normalization)
▷ Disable swapping
42
Elastic4s: a Scala DSL API for ES
A Scala wrapper around standard ES Java API:
▷ Type safe concise DSL.
▷ Integration with standard Scala futures.
▷ Integration with Scala collections library.
▷ Leverages the built-in Java client.
▷ Provides reactive-streams implementation.
▷ Currently compatible with latest ES 5.1.1
version.
https://github.com/sksamuel/elastic4s
43
Elastic4s: connection to ES
44
import com.sksamuel.elastic4s.ElasticClient
// Simple connection to local ES node
val client = ElasticClient.local
// Specify more configuration parameters before connection.
val settings = Settings.settingsBuilder()
.put("http.enabled", false)
.put("path.home", "/var/elastic/")
val client2 = ElasticClient.local(settings.build)
// Connection to remote nodes.
val settings = Settings.settingsBuilder().put("cluster.name",
"myClusterName").build()
val client = ElasticClient.remote(settings,
ElasticsearchClientUri("elasticsearch://somehost:9300"))
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/setup-configuration.html
Elastic4s: index creation
45
// Create a new dynamic index.
client.execute { create index "places" shards 3 replicas 2 }
// Create a new index specifying metadata for stored data.
client.execute {
create index "places" mappings (
"cities" as (
"id" typed IntegerType,
"name" typed StringType boost 4,
"content" typed StringType analyzer StopAnalyzer
)
)
}.await // Operations are always asynchronous, call await for sync
wait.
More info at https://github.com/sksamuel/elastic4s/blob/master/guide/createindex.md
Elastic4s: data indexing
46
// Index a new document.
client.execute {
index into "places" / "cities" id "uk" fields (
"name" -> "London",
"country" -> "United Kingdom",
"continent" -> "Europe",
"status" -> "Awesome"
)
}
▷ The ID, if not specified, is generated automatically by ES.
▷ Possibility to use bulk loads.
▷ Possibility to use case classes.
▷ DSL native support for routing, version, parent, timestamp and op
type. See http://www.elasticsearch.org/guide/reference/api/index_/
for more info.
Elastic4s: data queries
47
// Query for generic term paging the results.
search in "places"->"cities" query "paris" start 5 limit 10
// Search for docs matching “georgia” in “state” field.
search in "places"->"cities" query { termQuery("state", "georgia") }
// Or make a complex boolean query.
search in "places"->"cities" query {
bool {
must(
regexQuery("name", ".*cester"),
termQuery("status", "Awesome")
) not (
termQuery("weather", "hot")
)
}
}
* All queries are executed with client.execute() method.
Elastic4s: data queries (2)
48
// Sorting results.
search in "places"->"cities" query "europe" sort (
field sort "name",
field sort "status"
)
// Sub-aggregations: getting for each country the most frequent terms
// in city descriptions.
val freqWords = client.execute {
search in "places/cities" aggregations {
aggregation terms "by_country" field "country" aggs {
aggregation terms "frequent_words" field "content"
}
}
}
* Query results are in the form of array of SearchHit instances which contain things like the
id, index, type, version, etc. .
Elastic4s: get data directly
49
// Request a specific document.
client.execute {
get id "coldplay" from "bands" / "rock"
}
// Or request a set of known documents.
client.execute {
multiget(
get id "coldplay" from "bands/rock",
get id "keane" from "bands/rock"
)
}
Elastic4s: update and delete docs
50
// * Update a single document specifying its ID.
update(5).in("scifi/startrek").doc(
"name" -> "spock",
"race" -> "vulcan"
)
// * Delete document by ID.
delete id "u2" from "bands/rock"
// * Delete documents by query.
delete from index "bands" types "rock" where
termQuery("type", "rap")
* You can call it inside client.execute() or client.bulk() methods.
Elastic4s: others available
features
▷ Bulk operations: indexing, deleting,
updating.
▷ Helpers to reindex all data.
▷ Full-access to Java API to cover all
unavailable operations in DSL.
▷ Reactive-streams support for publisher
and subscriber operations.
51
ES: Apache Spark integration
▷ Apache Spark is a fast and
general-purpose cluster computing
system.
▷ ES provides Spark native support from
version 2.1 .
○ With classic RDD APIs.
○ With Spark SQL support and DataFrame
abstraction.
52
ES and Apache Spark: from RDD
to ES and back
53
// Define a case class.
case class Trip(departure: String, arrival: String)
// Create data to save into ES.
val upcomingTrip = Trip("London", "New York")
val lastWeekTrip = Trip("Rome", "Madrid")
// Create Spark RDD from previous data and save it to ES.
val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
rdd.saveToEs("spark/docs")
// Read data matching the given query as RDD.
val rdd2 = sc.esRDD("spark/docs", "?q=London")
JSON data can be read/write directly by using esJsonRDD method.
ES and Apache Spark: Spark SQL
read and filters support
54
val sqlContext = new SQLContext...
// Read index as a DataFrame.
val df =
sqlContext.read().format("org.elasticsearch.spark.sql").load("spark/tri
ps")
df.printSchema()
// root
//|-- departure: string (nullable = true)
//|-- arrival: string (nullable = true)
//|-- days: long (nullable = true)
// Filter data.
val filter =
df.filter(df("arrival").equalTo("OTP").and(df("days").gt(3))
ES and Apache Spark: Spark SQL
write support
55
// Case class used to define the DataFrame.
case class Person(name: String, surname: String, age: Int)
// Create DataFrame.
val people = sc.textFile("people.txt")
.map(_.split(","))
.map(p => Person(p(0), p(1), p(2).trim.toInt))
.toDF()
// Save DataFrame to ES.
people.saveToEs("spark/people")
Thanks!
Any questions?
You can find us at:
Fabio Del Vigna, fabio.delvigna@iit.cnr.it
Tiziano Fagni, tiziano.fagni@iit.cnr.it
56

Contenu connexe

Tendances

Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
Asegúr@IT IV - Remote File Downloading
Asegúr@IT IV - Remote File DownloadingAsegúr@IT IV - Remote File Downloading
Asegúr@IT IV - Remote File DownloadingChema Alonso
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Sematext Group, Inc.
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Oracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivityOracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivityrohit vishwakarma
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practiceJano Suchal
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2Rafał Kuć
 
Elastic 101 index operations
Elastic 101   index operationsElastic 101   index operations
Elastic 101 index operationsIsmaeel Enjreny
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get startedIsmaeel Enjreny
 

Tendances (20)

Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Asegúr@IT IV - Remote File Downloading
Asegúr@IT IV - Remote File DownloadingAsegúr@IT IV - Remote File Downloading
Asegúr@IT IV - Remote File Downloading
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Oracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivityOracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivity
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practice
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 
Elastic 101 index operations
Elastic 101   index operationsElastic 101   index operations
Elastic 101 index operations
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get started
 
SQL/MED and PostgreSQL
SQL/MED and PostgreSQLSQL/MED and PostgreSQL
SQL/MED and PostgreSQL
 

Similaire à Elasticsearch, a distributed search engine with real-time analytics

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack PresentationAmr Alaa Yassen
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Elastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptxElastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptxKnoldus Inc.
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Federico Panini
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesrahulmonikasharma
 
Getting started with Elasticsearch in .net
Getting started with Elasticsearch in .netGetting started with Elasticsearch in .net
Getting started with Elasticsearch in .netIsmaeel Enjreny
 
Getting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NETGetting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NETAhmed Abd Ellatif
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and ElasticsearchDean Hamstead
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 

Similaire à Elasticsearch, a distributed search engine with real-time analytics (20)

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Elastic search
Elastic searchElastic search
Elastic search
 
Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Elastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptxElastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptx
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challenges
 
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
 
Getting started with Elasticsearch in .net
Getting started with Elasticsearch in .netGetting started with Elasticsearch in .net
Getting started with Elasticsearch in .net
 
Getting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NETGetting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NET
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and Elasticsearch
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 

Dernier

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Dernier (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Elasticsearch, a distributed search engine with real-time analytics

  • 1. Elasticsearch, a distributed search engine with real-time analytics Pisa, 2 December 2016 Fabio Del Vigna, Tiziano Fagni Lab “WAFI” - “Cyber Intelligence” IIT-CNR, Pisa
  • 2. Talk outline ▷ Architectural overview ▷ Developer tools ▷ Demo time 2
  • 3. Architectural overview Main features and design of Elasticsearch 3
  • 4. “ What is Elasticsearch? Elasticsearch (ES) is an open-source, broadly-distributable, readily-scalable, enterprise-grade search engine. Accessible through an extensive and elaborate API, Elasticsearch can power extremely fast searches that support your data discovery applications. 4
  • 5. ES terminology A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name. A node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it. In a single cluster, you can define as many indexes as you want. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. A document is a basic unit of information that can be indexed. This document is expressed in JSON 5
  • 6. ES terminology (2) Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster. Shards: ▷ allow you to horizontally split/scale your content volume ▷ allow you to distribute and parallelize operations (potentially on multiple nodes) thus increasing performance/throughput The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number shards after-the-fact. Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api. 6
  • 8. Versioning ▷ ES supports content versioning. ▷ Documents can be edited. ES maintains track of the version number (starting from 1). ▷ Version increments are atomic. ▷ ES may delete older version as long as index grows. 8
  • 9. Using ES as primary data storage Key points: ▷ ES is an eventually consistent, near-realtime storage engine. ▷ ES is not ACID compliant like a relational database. ▷ ES does not support transactions. Currently, there are some (minor) issues related to resiliency: https://www.elastic.co/guide/en/elasticsearch/resilie ncy/current/index.html 9
  • 10. Using ES as primary data storage (2) ▷ You can use it as primary data storage (a lot of people already do it!) but be prepared to the possibility to lose some of your data. ▷ You can mitigate the data losing problem using backup snapshots. Alternative: use it together with another resilient DB (e.g. HBase), if needed reindex all data. More info on ES resiliency: https://www.elastic.co/blog/resiliency-elasticsearch 10
  • 11. ES vs. Solr: main differences ES and Solr have same performance, similar documentation quality and APIs available in the most common languages but: ▷ ES love REST! ▷ ES Query DSL syntax is really flexible and powerful. Solr does not have something analogous. ▷ ES automagically index data without any type of data mapping. ▷ ES cluster installation is easier and self-contained (no external tools required) ▷ ES has a lot better data analytics capabilities (e.g. aggregations) 11
  • 12. Requirements, installation, configuration ▷ Java > 8 ▷ Installazione > curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.0.0.tar.gz > elasticsearch -5.0.0/bin/elasticsearch ▷ Monitoring > curl 'http://<IP macchina>:9200/_cat/health?v' > curl 'http://<IP macchina>:9200/_cat/nodes?v' > curl 'http://<IP macchina>:9200/_cat /indices?v' 12
  • 13. Requirements, installation, configuration (2) The standard recommendation is to give 50% of the available memory to Elasticsearch heap, while leaving the other 50% free. It won’t go unused; Lucene will happily gobble up whatever is left over. Try to avoid crossing the 32 GB heap boundary Disable swap completely on your system > sudo swapoff -a To run ES as a daemon: > ./bin/elasticsearch -d -p pid 13
  • 14. Requirements, installation, configuration (3) Configuration is stored in elasticsearch.yml. If you can’t disable swapping set in the configuration file: > bootstrap.mlockall: true To check this: GET _nodes?filter_path=**.mlockall Let Linux process handle many file descriptors (set 64000) Ubuntu ignores the limits.conf file for processes started by init.d. To enable the limits.conf file, edit /etc/pam.d/su Development VS Production mode ▷ Binding address 14
  • 15. Requirements, installation, configuration (4) ES offers two configuration tools for an easier deployment for large installations Puppet Chef 15
  • 16. Developer tools APIs and tools available to interact with Elasticsearch 16
  • 17. ES: write easily your own apps ▷ Provides APIs in several languages: ○ REST Web service ○ Java, Scala, Groovy for JVM languages ○ .NET ○ Ruby, Python, PHP as scripting languages We are mainly interested in REST and Scala APIs. 17
  • 18. Rest API To list indices: > GET /_cat/indices?v Create the index > PUT /customer?pretty Insert a document > PUT /customer/external/1?pretty { "name": "John Doe" } ES automatically creates the index, if it does not exists 18
  • 19. Rest API Retrieve data, given the index > GET /customer/external/1?pretty To delete an index > DELETE /customer?pretty The golden rule > <REST Verb> /<Index>/<Type>/<ID> 19
  • 20. Rest API ES has a short delay (1s) before a datum is available Replace document using the ID > PUT /customer/external/1?pretty { "name": "Jane Doe" } 20
  • 21. Rest API If we don’t want to specify the ID during insertion, we use POST > POST /customer/external?pretty { "name": "Jane Doe" } 21
  • 22. Rest API Update a document > POST /customer/external/1/_update?pretty { "doc": { "name": "Jane Doe", "age": 20 } } Delete a document > DELETE /customer/external/2?pretty 22
  • 23. Rest API To manage multiple documents at a time, use the _bulk API > POST /customer/external/_bulk?pretty {"index":{"_id":"1"}} {"name": "John Doe" } {"index":{"_id":"2"}} {"name": "Jane Doe" } Different operations can be mixed > POST /customer/external/_bulk?pretty {"update":{"_id":"1"}} {"doc": { "name": "John Doe becomes Jane Doe" } } {"delete":{"_id":"2"}} 23
  • 24. Rest API - Search There are two ways to query ES > GET /bank/_search?q=*&sort=account_number:asc > GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ] } 24
  • 25. Rest API - Search Look for documents with mill or lane > GET /bank/_search { "query": { "match": { "address": "mill lane" } } } 25
  • 26. Rest API - Search You can build complex queries combining multiple conditions. Boolean queries can be combined to make complex queries. GET /bank/_search { "query": { "bool": { "must": [ { "match": { "age": "40" } } ], "must_not": [ { "match": { "state": "ID" } } ] } } } 26
  • 27. Rest API - Search Bool queries support filter clauses. They are used to filter document regardless of their relevance (scores). Filter clause used with range queries is generally used for numeric or date filtering. GET /bank/_search { "query": { "bool": { "must": { "match_all": {} }, "filter": { "range": { "balance": { "gte": 20000, "lte": 30000 } } } } } } 27
  • 28. Rest API ● Term - single term search ● Match_phrase - analyzes the input if analyzers are defined for the queried field and find documents matching the criterias ● Query_string - query search, by default, on a _all field which contains the text of several text fields at once. On top of that, it's parsed and supports some operators (AND/OR...), wildcards and so on (see related syntax). 28
  • 29. Rest API - Aggregations GET /bank/_search { "size": 0, "aggs": { "group_by_state" : { "terms": { "field": "state.keyword" } } } } Equivalent to SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT (*) DESC 29
  • 30. Rest API Most APIs that refer to an index parameter support execution across multiple indices, using simple test1,test2,test3 notation (or _all for all indices). It also support wildcards, for example: test* or *test or te*t or *test*, and the ability to "add" (+) and "remove" (-), for example: +test*,-test3. Date math index name resolution enables you to search a range of time-series indices, rather than searching all of your time-series indices and filtering the results or maintaining aliases. A date math index name takes the following form: <static_name{date_math_expr{date_format|time_zone}}> 30
  • 31. Rest API All REST APIs accept a filter_path parameter that can be used to reduce the response returned by elasticsearch. This parameter takes a comma separated list of filters expressed with the dot notation: > GET /_search?q=elasticsearch&filter_path=took,hits.hits. _id,hits.hits._score You can control which parts of the documents are returned modifying the _source fields. 31
  • 32. Rest API You can delete multiple documents using a query: POST twitter/_delete_by_query { "query": { "match": { "message": "some message" } } } 32
  • 33. Rest API - Advanced Term vectors can be helpful to analyse text, build word clouds and evaluate relevance of documents. ES supports them: curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors? pretty=true' 33
  • 34. Rest API - Advanced Which shards will be searched on can be controlled by providing the routing parameter. This parameter provides a field that will be used to dispatch documents across shards. Documents with the same value will be kept together to speed up queries. Pagination can be achieved through scrolling operation, setting the scroll parameter in query. Scrolls should be explicitly cleared as soon as the scroll is not being used anymore using the clear-scroll API. 34
  • 35. Rest API - Advanced Queries can be performed over different fields, types or indices. For example you can query across all types GET /twitter/_search?q=user:kimchy … or specify multiple types... GET /twitter/tweet,user/_search?q=user:kimchy ...or indices GET /kimchy,elasticsearch/tweet/_search?q=tag:wow You can use _all to match all indices GET /_all/tweet/_search?q=tag:wow GET /_search?q=tag:wow 35
  • 36. Rest API - Templates You can register search templates by storing it in the config/scripts directory, in a file using the .mustache extension. In order to execute the stored template, reference it by it’s name under the template key GET /_search/template { "file": "storedTemplate", "params": { "query_string": "search for these words" } } 36
  • 37. Rest API - Count API ES offers a simple and fast interface to count documents PUT /twitter/tweet/1?refresh { "user": "kimchy" } GET /twitter/tweet/_count?q=user:kimchy { "count" : 1, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 } } 37
  • 38. Rest API - Aggregation Aggregation can be performed using different functions or scripts: ▷ avg ▷ cardinality (distinct values). The count is approximated above a threshold (configurable) ▷ geo ▷ max/min ▷ sum ▷ value count 38
  • 39. Rest API - Histogram Aggregation A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. The Date Histogram Aggregation is a multi-bucket aggregation similar to the histogram except it can only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible to use the normal histogram on dates as well, though accuracy will be compromised. Have a look also to date range aggregation that allows to define intervals of dates. { "aggs" : { "articles_over_time" : { "date_histogram" : { "field" : "date", "interval" : "month" } } } } 39
  • 40. Analysis Analyzer - 1 tokenizer (e.g. whitespace) - Token filters - stop words - stemming - ... 40
  • 41. Scripting ES supports different scripting languages: ▷ Painless (il più veloce, raccomandato. Compatibile con Java) ▷ Groovy ▷ Javascript ▷ Python ▷ … ▷ You can build your script using Java "script": { "lang": "...", "inline" | "id" | "file": "...", "params": { ... } } 41
  • 42. Recommendations ▷ Don’t return large result sets ▷ Avoid sparsity ▷ Avoid putting unrelated data in the same index ▷ Use same names for fields (normalization) ▷ Disable swapping 42
  • 43. Elastic4s: a Scala DSL API for ES A Scala wrapper around standard ES Java API: ▷ Type safe concise DSL. ▷ Integration with standard Scala futures. ▷ Integration with Scala collections library. ▷ Leverages the built-in Java client. ▷ Provides reactive-streams implementation. ▷ Currently compatible with latest ES 5.1.1 version. https://github.com/sksamuel/elastic4s 43
  • 44. Elastic4s: connection to ES 44 import com.sksamuel.elastic4s.ElasticClient // Simple connection to local ES node val client = ElasticClient.local // Specify more configuration parameters before connection. val settings = Settings.settingsBuilder() .put("http.enabled", false) .put("path.home", "/var/elastic/") val client2 = ElasticClient.local(settings.build) // Connection to remote nodes. val settings = Settings.settingsBuilder().put("cluster.name", "myClusterName").build() val client = ElasticClient.remote(settings, ElasticsearchClientUri("elasticsearch://somehost:9300")) https://www.elastic.co/guide/en/elasticsearch/reference/2.3/setup-configuration.html
  • 45. Elastic4s: index creation 45 // Create a new dynamic index. client.execute { create index "places" shards 3 replicas 2 } // Create a new index specifying metadata for stored data. client.execute { create index "places" mappings ( "cities" as ( "id" typed IntegerType, "name" typed StringType boost 4, "content" typed StringType analyzer StopAnalyzer ) ) }.await // Operations are always asynchronous, call await for sync wait. More info at https://github.com/sksamuel/elastic4s/blob/master/guide/createindex.md
  • 46. Elastic4s: data indexing 46 // Index a new document. client.execute { index into "places" / "cities" id "uk" fields ( "name" -> "London", "country" -> "United Kingdom", "continent" -> "Europe", "status" -> "Awesome" ) } ▷ The ID, if not specified, is generated automatically by ES. ▷ Possibility to use bulk loads. ▷ Possibility to use case classes. ▷ DSL native support for routing, version, parent, timestamp and op type. See http://www.elasticsearch.org/guide/reference/api/index_/ for more info.
  • 47. Elastic4s: data queries 47 // Query for generic term paging the results. search in "places"->"cities" query "paris" start 5 limit 10 // Search for docs matching “georgia” in “state” field. search in "places"->"cities" query { termQuery("state", "georgia") } // Or make a complex boolean query. search in "places"->"cities" query { bool { must( regexQuery("name", ".*cester"), termQuery("status", "Awesome") ) not ( termQuery("weather", "hot") ) } } * All queries are executed with client.execute() method.
  • 48. Elastic4s: data queries (2) 48 // Sorting results. search in "places"->"cities" query "europe" sort ( field sort "name", field sort "status" ) // Sub-aggregations: getting for each country the most frequent terms // in city descriptions. val freqWords = client.execute { search in "places/cities" aggregations { aggregation terms "by_country" field "country" aggs { aggregation terms "frequent_words" field "content" } } } * Query results are in the form of array of SearchHit instances which contain things like the id, index, type, version, etc. .
  • 49. Elastic4s: get data directly 49 // Request a specific document. client.execute { get id "coldplay" from "bands" / "rock" } // Or request a set of known documents. client.execute { multiget( get id "coldplay" from "bands/rock", get id "keane" from "bands/rock" ) }
  • 50. Elastic4s: update and delete docs 50 // * Update a single document specifying its ID. update(5).in("scifi/startrek").doc( "name" -> "spock", "race" -> "vulcan" ) // * Delete document by ID. delete id "u2" from "bands/rock" // * Delete documents by query. delete from index "bands" types "rock" where termQuery("type", "rap") * You can call it inside client.execute() or client.bulk() methods.
  • 51. Elastic4s: others available features ▷ Bulk operations: indexing, deleting, updating. ▷ Helpers to reindex all data. ▷ Full-access to Java API to cover all unavailable operations in DSL. ▷ Reactive-streams support for publisher and subscriber operations. 51
  • 52. ES: Apache Spark integration ▷ Apache Spark is a fast and general-purpose cluster computing system. ▷ ES provides Spark native support from version 2.1 . ○ With classic RDD APIs. ○ With Spark SQL support and DataFrame abstraction. 52
  • 53. ES and Apache Spark: from RDD to ES and back 53 // Define a case class. case class Trip(departure: String, arrival: String) // Create data to save into ES. val upcomingTrip = Trip("London", "New York") val lastWeekTrip = Trip("Rome", "Madrid") // Create Spark RDD from previous data and save it to ES. val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip)) rdd.saveToEs("spark/docs") // Read data matching the given query as RDD. val rdd2 = sc.esRDD("spark/docs", "?q=London") JSON data can be read/write directly by using esJsonRDD method.
  • 54. ES and Apache Spark: Spark SQL read and filters support 54 val sqlContext = new SQLContext... // Read index as a DataFrame. val df = sqlContext.read().format("org.elasticsearch.spark.sql").load("spark/tri ps") df.printSchema() // root //|-- departure: string (nullable = true) //|-- arrival: string (nullable = true) //|-- days: long (nullable = true) // Filter data. val filter = df.filter(df("arrival").equalTo("OTP").and(df("days").gt(3))
  • 55. ES and Apache Spark: Spark SQL write support 55 // Case class used to define the DataFrame. case class Person(name: String, surname: String, age: Int) // Create DataFrame. val people = sc.textFile("people.txt") .map(_.split(",")) .map(p => Person(p(0), p(1), p(2).trim.toInt)) .toDF() // Save DataFrame to ES. people.saveToEs("spark/people")
  • 56. Thanks! Any questions? You can find us at: Fabio Del Vigna, fabio.delvigna@iit.cnr.it Tiziano Fagni, tiziano.fagni@iit.cnr.it 56