Solr vs. Elasticsearch - Case by Case

Solr vs. Elasticsearch
Case by Case
Alexandre Rafalovitch @arafalov
@SolrStart
www.solr-start.com

Meet the FRENEMIES
Friends (common)
• Based on Lucene
• Full-text search
• Structured search
• Queries, filters, caches
• Facets/stats/enumerations
• Cloud-ready
Elasticsearch*
* Elasticsearch is a trademark of Elasticsearch BV,
registered in the U.S. and in other countries.
Enemies (differences)
• Download size
• AdminUI vs. Marvel
• Configuration vs. Magic
• Nested documents
• Chains vs. Plugins
• Types and Rivers
• OpenSource vs. Commercial
• Etc.

This used to be Solr (now in Lucene/ES)
• Field types
• Dismax/eDismax
• Many of analysis filters (WordDelimiterFilter, Soundex, Regex,
HTML, kstem, Trim…)
• Multi-valued field cache
• …. (source: http://heliosearch.org/lucene-solr-history/ )
• Disclaimer: Nowadays, Elasticsearch hires awesome Lucene hackers

Basically - sisters
Source: https://www.flickr.com/photos/franzfume/11530902934/
First run
Expanded
Download
300
250
200
150
100
50
0
Solr Elasticsearch

Solr: Chubby or Rubenesque?
0.00 50.00 100.00 150.00 200.00 250.00 300.00
Elasticsearch+plugins
Solr
Code
Examples
Documentation
ES-Admin
ES-ICU
Extract/Tika
UIMA
Map-Reduce
Test Framework

Elasticsearch setup
Source: https://www.flickr.com/photos/deborah-is-lola/6815624125/
• Admin UI:
bin/plugin -i elasticsearch/marvel/latest
• Tika/Extraction:
bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/
2.4.1
• ICU (Unicode components):
bin/plugin -install elasticsearch/elasticsearch-analysis-icu/
2.4.1
• JDBC River (like DataImportHandler
subset):
bin/plugin --install jdbc --url
http://xbib.org/repository/org/xbib/elasticsearch/plugin/e
lasticsearch-river-jdbc/1.3.4.4/elasticsearch-river-jdbc-
1.3.4.4-plugin.zip
• JavaScript scripting support:
bin/plugin -install elasticsearch/elasticsearch-lang-javascript/
2.4.1
• On each node….
• Without dependency management (jars =
rabbits)

Index a document - Elasticsearch
1. Setup an index/collection
2. Define fields and types
3. Index content (using Marvel sense):
POST /test1/hello
{
"msg": "Happy birthday",
"names": ["Alex", "Mark"],
"when": "2014-11-01T10:09:08"
}
Alternative:
PUT /test1/hello/id1
{
"when": "2014-11-01T10:09:08"
}
An index, type and definitions are created automatically
So, where is our document:
GET /test1/hello/_search
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test1",
"_type": "hello",
"_id": "AUmIk4LDF4XvfpxnVJ2g",
"_score": 1,
"_source": {
"names": [
"Alex",
"Mark"
],
"when": "2014-11-01T10:09:08"
}}
]
}}

Behind the scenes
…..
{
"_index": "test1",
"_type": "hello",
"_score": 1,
"_source": {
"names": [
"Alex",
"Mark"
],
"when": "2014-11-01T10:09:08"
}
….
GET /test1/hello/_mapping
{
"test1": {
"mappings": {
"hello": {
"properties": {
"msg": {
"type": "string"
},
"names": {
"type": "string"
},
"when": {
"type": "date",
"format": "dateOptionalTime"
}}}}}}

Basic search in Elasticsearch
…..
{
"_index": "test1",
"_type": "hello",
"_score": 1,
"_source": {
"names": [
"Alex",
"Mark"
],
"when": "2014-11-01T10:09:08"
}
….
• GET /test1/hello/_search?q=foobar – no results
• GET /test1/hello/_search?q=Alex – YES on names?
• GET /test1/hello/_search?q=alex – YES lower case
• GET /test1/hello/_search?q=happy – YES on msg?
• GET /test1/hello/_search?q=2014 – YES???
• GET /test1/hello/_search?q="birthday alex" – YES
• GET /test1/hello/_search?q="birthday mark" – NO
Issues:
1. Where are we actually searching?
2. Why are lower-case searches work?
3. What's so special about Alex?

All about _all and why strings are tricky
• By default, we search in the field _all
• What's an _all field in Solr terms?
<field name="_all" type="es_string" multiValued="true" indexed="true" stored="false"/>
<copyField source="*" dest="_all"/>
• And the default mapping for Elasticsearch "string" type is like:
<fieldType name="es_string" class="solr.TextField" multiValued="true" positionIncrementGap="0" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
• Elasticsearch equivalent to Solr's solr.StrField is:
{"type" : "string", "index" : "not_analyzed"}

Can Solr do the same kind of magic?
• curl 'http://localhost:8983/solr/collection1/update/json/docs' -H 'Content-type:
application/json' -d @msg.json
curl 'http://localhost:8983/solr/collection1/select'
{
"responseHeader":{
"status":0,
"QTime":18,
"params":{}},
"response":{"numFound":1,"start":0,"docs":[
{
"msg":["Happy birthday"],
"names":["Alex", "Mark"],
"when":["2014-11-01T10:09:08Z"],
"_id":"e9af682d-e775-42f2-90a5-c932b5fbb691",
"_version_":1484096406012559360}]
}}
curl 'http://localhost:8983/solr/collection1/schema/fields'
{
"responseHeader":{
"status":0,
"QTime":1},
"fields":[
{"name":"_all", "type":"es_string",
"multiValued":true,
"indexed":true, "stored":false},
{"name":"_id", "type":"string",
"multiValued":false,
"indexed":true, "required":true,
"stored":true, "uniqueKey":true},
{"name":"_version_", "type":"long",
"indexed":true, "stored":true},
{"name":"msg", "type":"es_string"},
{"name":"names", "type":"es_string"},
{"name":"w • Output slightly re-formated hen", "type":"tdates"}]}

Nearly the same magic
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">

<processor class="solr.UUIDUpdateProcessorFactory" />
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>
<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>
<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDateFieldUpdateProcessorFactory">
<arr name="format">
<str>yyyy-MM-dd'T'HH:mm:ss</str>
<str>yyyyMMdd'T'HH:mm:ss</str>
</arr>
</processor>
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">es_string</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">tdates</str>
</lst>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Not quite the same magic:
• URP chain happens before copyField
• Date/Ints are converted first
• copyText converts content back to string
• _all field also gets copy of _id and _version
• All auto-mapped fields HAVE to be multivalued
• No (ES-Style) types, just collections
• Unable to reproduce cross-field search
• Still rough around the edges
• Requires dynamic schema, so adding new types
becomes a challenge
• Auto-mapping is NOT recommended for production
• Dynamic fields solution is still more mature

Explicit mapping - Solr
• In schema.xml (or dynamic equivalent)
• Uses Java Factories
• Related content (e.g. stopwords) are usually in separate files (recently added REST-managed)
• French example:
<fieldType name="text_fr" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_fr.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball" />
<filter class="solr.FrenchLightStemFilterFactory"/>
</analyzer>
</fieldType>

Explicit mapping - Elasticsearch
• Created through PUT command
• Also can be stored in config/default-mapping.json or
config/mappings/[index_name]
• Mappings for all types in one index should be compatible to avoid problems
• Usually uses predefined mapping names. Has many names, including for
languages
• Explicit mapping is through named cross-references, rather than duplicated in-place
stack (like Solr)
• Related content is usually also in the definition. Sometimes in file (e.g.
stopwords_path – needs to be on all nodes)
• French example (next slide):

Explicit mapping – Elasticsearch - French
{
"settings": {
"analysis": {
"filter": {
"french_elision": {
"type": "elision",
"articles": [ "l", "m", "t", "qu",
"n", "s", "j", "d", "c", "jusqu", "quoiqu",
"lorsqu", "puisqu"
]
},
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"french_keywords": {
"type": "keyword_marker",
"keywords": []
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
….
"analyzer": {
"french": {
"tokenizer": "standard",
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_keywords",
"french_stemmer"
]
}
}
}
}
}

Default analyzer - Elasticsearch
Indexing
1. the analyzer defined in the field
mapping, else
2. the analyzer defined in the _analyzer
field of the document, else
3. the default analyzer for the type,
which defaults to
4. the analyzer named default in the
index settings, which defaults to
5. the analyzer named default at node
level, which defaults to
6. the standard analyzer
Query
1. the analyzer defined in the query
itself, else
2. the analyzer defined in the field
mapping, else
3. the default analyzer for the type,
which defaults to
4. the analyzer named default in the
index settings, which defaults to
5. the analyzer named default at node
level, which defaults to
6. the standard analyzer

Index many documents – Elasticsearch
POST /test3/entries/_bulk
{ "index": {"_id": "1" } }
{"msg": "Hello", "names": ["Jack", "Jill"]}
{ "index": {"_id": "2" } }
{"msg": "Goodbye", "names": "Jason"}
{ "delete" : {"_id" : "3" } }
NOTE: Rivers (similar to DIH) MAY be deprecated.
Use Logstash instead (180Mb on disk, including 2 jRuby runtimes !!!)

Index many documents - Solr
JSON - simple
[
{
"_id": "1",
"msg": "Hello",
"names": ["Jack", "Jill"]
},
{
"_id": "2",
"msg": "Goodbye",
"names": "Jason"
}
]
JSON – with commands
{
"add": { "doc": {
"_id": "1",
"msg": "Hello",
"names": ["Jack", "Jill"]
} },
"add": { "doc": {
"_id": "2",
"msg": "Goodbye",
"names": "Jason"
} },
"delete": { "_id":3 }
}
Also:
• CSV
• XML
• XML+XSLT
• JSON+transform (4.10)
• DataImportHandler
• Map-Reduce
External tools
• Logstash (owned by ES)

Comparing search - Search
• Same but different
• Same: vast majority of the features
come from Lucene
• Different: representation of search
parameters
• Solr: URL query with many – cryptic –
parameters
• Elasticsearch:
• Search lite: URL query with a
limited set of parameters (basic
Lucene query)
• Query DSL: JSON with multi-leveled
structure
Lucene
Impl ES
only
Solr
only

Search compared – Simple searches
{
"when": "2014-11-01T10:09:08"
}
{
"msg": "Happy New Year",
"names": ["Jack", "Jill"],
"when": "2015-01-01T00:00:01"
}
{
"msg": "Goodbye",
"names": ["Jack", "Jason"],
"when": "2015-06-01T00:00:00"
}
Elasticsearch (Marvel Sense GET):
• /test1/hello/_search – all
• /test1/hello/_search?q=happy birthday Alex– 2
• /test1/hello/_search?q=names:Alex – 1
Solr (GET http://localhost:8983/solr/…):
• /collection1/select – all
• /collection1/select?q=happy birthday Alex – 2
• /test1/hello/_search?q=names:Alex – 1

Search Compared – Query DSL
Elasticsearch
{
"query": {
"query_string": {
"fields": ["msg^5", "names"],
"query": "happy birthday Alex",
"minimum_should_match": "100%"
}
}
}
Solr
…/collection1/select
?q=happy birthday Alex
&defType=dismax
&qf=msg^5 names
&mm=100%

Search Compared – Query DSL - combo
Search future entries about Jack. Return only the best one.
Elasticsearch
{
"size" : 1,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "jack"
}},
"filter": {
"range": {
"when": {
"gte": "now"
}}}}}}
Solr
…/collection1/select
?q=jack
&fq=when:[NOW TO *]
&rows=1

Parent/Child structures
Inner objects
• Mapping: Object
• Dynamic mapping (default)
• NOT separate Lucene docs
• Map to flattened
multivalued fields
• Search matches against
value from ANY of inner
objects
{
"followers.age": [19, 26],
"followers.name":
[alex, lisa]
}
Elasticsearch
Nested objects
• Mapping: nested
• Explicit mapping
• Lucene block storage
• Inner documents are hidden
• Cannot return inner docs only
• Can do nested & inner
Parent and Child
• Mapping: _parent
• Explicit references
• Separate documents
• In-memory join
• SLOW
Solr
Nested objects
• Lucene block storage
• All documents are visible
• Child JSON is less natural

Cloud deployment – quick take
1. General concepts are similar:
• Node discovery
• Sharding
• Replication
• Routing
1. Implementations are very, very different (layer above Lucene)
2. Solr uses Apache Zookeeper
3. Elasticsearch has its own algorithms
4. No time to discuss
5. Let's focus on the critical path: Node discovery/cloud-state management
6. Use a 3rd party analysis: Kyle Kingsbury's Jepsen tests

Jepsen test of Zookeper
Use Zookeeper. It’s mature, well-designed, and battle-tested.

Jepsen test of Elasticsearch
If you are an Elasticsearch user (as I am): good luck.

Innovator’s dilemma
• Solr's usual attitude
• An amazingly useful product for many different uses
• And wants everybody to know it
• …Right in the collection1 example
• “You will need all this eventually, might as well learn it first”
• Elasticsearch is small and shiny (“trust us, the magic exists”)
• Elasticsearch + Logstash + Kibana => power-punch triple combo
• Especially when comparing to Solr (and not another commercial solution)
• Feature release process
• Elasticsearch: kimchy: “LGTM” (Looks good to me)
• Solr: full Apache process around it
• Solr – needs to buckle down and focus on onboarding experience
• Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Solr vs. Elasticsearch
Case by Case
Alexandre Rafalovitch
www.solr-start.com
@arafalov
@SolrStart

Solr vs. Elasticsearch - Case by Case

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Solr vs. Elasticsearch - Case by Case

Similaire à Solr vs. Elasticsearch - Case by Case (20)

Plus de Alexandre Rafalovitch

Plus de Alexandre Rafalovitch (7)

Dernier

Dernier (20)

Solr vs. Elasticsearch - Case by Case

Notes de l'éditeur