Scaling Up Drupal 8 with NoSQL Databases

© 2019 Frédéric G. MARAND - licensed under a Creative Commons Attribution 4.0 International License.
Scaling up and accelerating Drupal 8 with NoSQL
Frédéric G. MARAND
drupal.org: fgm - irc/twitter: @osinet
<MongoDB module maintainer />

Topic ?
Simple idea: “No SQL”
● Alternate storage engines: KV, Structures, Document,
Graph, Columnar…
● No standard, often no fixed schema, no joins, no FKs
● → Engine-specific application design
● Drupal architecture ?
Evolved idea: Not Only SQL
● For engines, add equivalent features to SQL
● For Drupal, combine SQL et NoSQL solutions
● Start from the default SQL-based architecture
● Offload services to non-SQL implementations
○ front-end caches, search engines, queue servers
○ specialized storage: cache, KV, lock, sessions…
● Often involves NoSQL as cache for SQL
espace 1 espace 2

NOSQL: do you need it ?
● Start by observing the current state
○ Database queries → devel + webprofiler
○ Cache → heisencache (D7), webprofiler (D8)
○ Build cacheability → renderviz
● Observe behaviour
○ Core observability built-in: DBTNG logging, cache decorators, QueryInterface for KV, config, content…
○ Monitoring module (400 sites) by Karan Poddar (Google SoC) and MD Systems
○ Add your choice of time-series store (e.g. Prometheus, InfluxDB) and UI (e.g. Grafana)
○ ⇨ Use it !
● You want to see this when it happens ⟶

“ “
Peter Drucker
If you can’t
measure it, you
can’t improve it.

Fixing an identiﬁed problem is cheaper than “trying things”
Fix from acquired information
● It /MAY/ involve taking queries off the main DB to a NoSQL solution
● But poorly conﬁgured NoSQL may make it worse.

“Just do it” ?
● Drupal is built on SQL:
○ Views depends on it by default
○ Most sites rely on Views data model awareness
○ → Contrib often assumes SQL, injects @database
○ NoSQL support doable, rarely done
● Contrib support level is limited
○ Most NoSQL contrib not ported from D7 to D8
○ Drupalshop knowledge limited except biggest or
specialized
○ Products may die… e.g. RethinkDB
● Pro support from publishers = costs. Availability.
● Extra support needed = costs
NoSQL == added build costs
→ balance gains vs costs
Example case: RethinkDB
At DevDays Milan 2016, after lots of work, Gizra’s @RoySegall
demoed a Drupal 8 ORM/ODM for RethinkDB.
Then, this happened...

“ “
http://www.commitstrip.com/en/2012/04/10/what-do-you-mean-its-oversized
Do you really need it ?

Caching ahead of real work
Default situation with SQL
● Browser caching, limited
● Internal / dynamic page cache in main SQL DB
● Need DB connection, a few SELECT queries
● Fetch cache from DB
● All data from main storage
● ⇨ Serve cached pages in about 20 msec
All this work makes DoS-ing comparatively cheap.
NoSQL improvements
● Add caching ahead of site itself
○ Browser
■ Optimized browser caching (Cache-Control)
■ PWA: use browser local storage
○ CDN
■ CDN module (2k sites)
■ Akamai module (600 sites)
■ ⇨ Serve cached pages in about 15 msec (TTFB)
■ Web-scale
○ Varnish and other reverse proxies
■ ⇨ Serve cached pages in about 10 msec (TTFB)
■ Core support
■ Varnish Purger (3k sites)
● ⇨ Most request will mean 0 SQL queries
○ DoS-ing more costly, especially with CDN
● Move page caches off main DB: next section

Storage: the “Big 3”
The most active NoSQL suites for Drupal 8.x
Redis
● Type: Key-value (structure server)
● Module
○ redis
● DB-Engines ranking:
○ #1 Key-value store
● Usage
○ Drupal 7: 10k sites
○ Drupal 8: 10k sites
● Supported by
○ Drupal 7: Makina Corpus
○ Drupal 8: MD Systems
Memcached
● Type: Key-value
● Module
○ memcache
○ #3 Key-value store
○ #5 Key-value store (Hazelcast)
● Usage (memcache_storage)
○ Drupal 7: 32k (2k) sites
○ Drupal 8: 15k (800) sites
● Supported by:
○ Acquia
○ Tag1 Consulting
MongoDB / CosmosDB
● Type: Document store
● Module
○ mongodb
○ #1 Document store (MongoDB)
○ #4 Document store (CosmosDB)
● Usage
○ Drupal 7: 300 sites
○ Drupal 8: 50 sites
● Supported by
○ OSInet

Redis
https://www.drupal.org/project/redis
● Driver support
○ phpredis and predis both supported
● Supported Services
○ Driver adapter for custom code
○ Cache, including invalidations
○ Flood
○ Lock
○ Lock.Persistent
○ Queue
● CLI support
○ Not included
● Other modules
○ Redis Watchdog: logger + UI
Recent events (from @Berdir)
● Deadlock/race condition on node_list invalidations
(#2966607) finally fixed in core 8.8.x with latest
release
● php-redis 5.0 broke module, fixed in latest 8.x and 7.x
releases
● Module users: please test and report !

Performance / scalability
Redis
https://www.drupal.org/project/redis
● Performance, single-server
○ Memory-only implementation
■ Usually among the fastest
■ Often the fastest
■ Even with concurrent access
○ Persistent
■ A bit slower even with just RDB
■ Slower with AOF
● Persistence, single instance
○ RDB:
■ compact snapshots, shippable off-site
■ data loss: since latest snapshot
○ AOF
■ up to last-second fsync’ed journal
■ less compact
● Fault-tolerance: Sentinel 2
○ master/slave supervision
○ automatic failover possible
○ observability support
● Scaling
○ Cluster-based sharding
○ Master → Slaves → Slaves
○ No strong consistency
○ Recommended conﬁg: 6 servers
● Cloud-native:
○ Redis Enteprise Cloud
○ AWS Elasticache, Azure, Google Memorystore
○ many others

Redis
https://www.drupal.org/project/memcache
● Driver support
○ memcache extension (limited availability)
○ memcached extension
○ PHP ≥ 5.6
○ Cache, including invalidations
○ Lock
○ Lock.Persistent removed in #2995907
○ Sessions ported, then removed in 7.x
○ Monitoring UI
● CLI support
○ Not included: core commands
● Other module: memcache_storage
○ Cache with core SQL invalidations
○ No lock
○ Monitoring UI
Recent events (from @Berdir)
● Deadlock/race condition on node_list invalidations
(#2966607) finally fixed in core 8.8.x with latest
release, based on Redis fix.

● Performance, single-server
○ Memory-only implementation
■ Usually among the fastest
■ Slower than in-memory Redis
■ A bit faster than to MySQL / MongoDB K/V
○ Persistence: extstore NVRAM support
■ No signiﬁcant slowdown
■ Usually a bad idea (expectations)
■ https://memcached.org/blog/persistent-m
emory/
● Fault-tolerance
○ Module support for sharded clusters
○ Consistent hashing: avoid thundering herd prob.
○ Replication: with Hazelcache
Redis
https://www.drupal.org/project/memcache
● Scaling
○ Cluster-based sharding
○ Consistent hashing allows elastic scaling
○ Recommended conﬁg: 2 instances per
cluster, 1 cluster per bin, with some
exceptions: usually 10-20 instances per D8 site
○ Some bins must stay in core (form, update)
● Monitoring
○ Instant: module-provided memcache_admin
○ Evolved: phpmemcacheadmin
● Cloud-native
○ AWS Elasticache
○ Azure Memcached Cloud
○ Google AppEngine Memcache

Mainstream packages
MongoDB
https://www.drupal.org/project/mongodb
Drupal 7 features
● Driver support:
○ mongo extension for PHP 5.x
○ mongodb extension for PHP 7.x
○ MongoDB 2.x, 3.x
○ Block
○ Cache
○ Path
○ Queue
● Unsupported services
○ Field storage
○ Lock
○ (Session)
○ Watchdog = logger + UI
● Other modules
○ Views driver: EFQ Views
Drupal 8.x-2.x features
● Driver support
○ mongodb extension for PHP ≥ 7.1
○ mongodb/mongodb php driver
○ MongoDB 3.x, 4.x
○ Key-value (e.g. State)
○ Key-value expirable (e.g. *tempstore*, form_cache)
○ Watchdog = logger + UI
● CLI support
○ Drupal Console 1.9.x
○ Drush 9.x
● Other services
○ Entity/ﬁeld storage
● Other modules
○ MongoDB Indexer

Exotic packages
MongoDB
Drupal 8.x-1.x
● Driver support:
○ mongo extension for PHP 5.x
○ MongoDB 3.x
● Supported services
○ Complete NoSQL distribution
○ @database implementation
○ No SQL DBMS needed
○ Unpatched Drupal core
● Status
○ Sponsored by MongoDB, led by chx
○ Development halted before Drupal 8.0.0
● Performance:
○ About 4x faster than equivalent Drupal core
Drumongous
● Driver support
○ mongo extension for PHP ≥ 5.6
○ MongoDB ≥ 3.6
○ Complete NoSQL distribution
○ @database implementation
● Source: patched Drupal core + module
○ https://gitlab.com/daffie/drumongous/
○ https://gitlab.com/daffie/mongodb
● CLI support
○ Drupal Console 1.x
○ Drush 9.x
● Status
○ Off-drupal.org
○ No issue queue
○ Active, led by daffie

espace réservé non accepté
Engine features
● Fault-tolerance
○ Built-in replication
○ Recommended conﬁg: 2+1 servers
● Scaling
○ Read-only replicas
○ Data-center awareness
○ Sharding
● Both supported by existing module
Monitoring / Ops
● In-module: logs
● Cloud: MongoDB Atlas, free monitoring, OpsManager
Cloud native
● Azure: CosmosDB
● MongoDB: Atlas
● Mlab (née Mongolab)
MongoDB
Production example
Custom social network (2M users), migrated from MySQL:
MySQL slow queries: -85%, uncached content build time: -98%

Other NoSQL support modules
NoSQL Product Module Wrapper Features 7.x 8.x Supported ?
Neo4J neo4j Y - Y Y N
RethinkDB renthinkdb Y ORM N Y ?
CouchDB couchdb Y Node export Y N N
Couchbase couchbase Y Logger + UI Y N ?
ElasticSearch elasticsearch_connector Y Logger + improved UI,
Statistics, Views
Y N Y
SearchAPI Y Y
AWS DynamoDB dynamodb N Cache Y N ?
AWS SimpleDB awssdk, creeper Y - Y N ?
Riak riak_ﬁeld_storage Y Field storage, map-reduce Y N unsupported
Apache Cassandra cassandra Y Example app 6.x N unsupported
Tokyo Tyrant node/844354 N Logger + UI 6.x N unapproved

NoSQL Sessions ?
● Why the weak/removed session support, especially for memcache ?
○ Memcache session support is baked in PHP memcached extension
○ It was popular in Drupal 6.x time
○ It is popular in Symfony, even documented on symfony.com
○ So ?
● Experience
○ Session data
○ Instance restart → all sessions data on instance lost
○ Bigger session data saturating bin → evictions
○ LRU means vulnerability to DoS-ing and blocking admins via evictions
○ DB load is bigger in Drupal than most frameworks
■ Session DB load is a smaller part of load for us

Logs in core
The “SQL” problem
● All sites really need some sort of logging feature
● Smaller sites only have a database
○ ⇨ Database Logging default-enabled
● Code is not perfect, throws notices, errors
● Modules are verbose, log debug info
● “Drupal is too slow, please help, agency is stuck”
○ ⇨ Audit : 1500 inserts/min in watchdog table
○ ⇨ Other audits: watchdog > 99% of site size
● DBlog inserts compete with content work
● Owner disables logging
○ ⇨ now misses essential info
● Does not disable logging
○ ⇨ now can’t find essential info buried in noise
The core NoSQL module
● Core has been bundling a syslog client since 6.0
● Decouple logs from DB load
○ ⇨ No more SQL logs workload
● But where do they go ?
○ ⇨ Needs OS-level configuration
● How are logs cleaned ?
○ ⇨ Needs OS-level configuration
● Where is the UI ?
○ ⇨ Needs extra tools
● Solutions ?
○ D7 has logging hook
○ D8 has PSR/3 standard logging
○ ⇨ Contributions

NoSQL on-site logs
(mongodb|redis)_watchdog
● mongodb_watchdog
○ Logger service
■ Standard Drupal PSR/3 logs backend
■ Pre-storage ﬁltering
■ Uses capped collections: auto-rotation, no ops
■ Dedicated database: zero contention
■ Per-request event tracing
○ Improved logs UI
■ Based on core UI
■ Groups recurring events on single line
■ Details page for occurrences
■ Per-HTTP-request log page
○ Most common reason to deploy MongoDB on D8
● redis_watchdog
○ Logger service
○ Logs UI based on core UI
○ Usage: 1 site

Off-site logs: BELK stack
BELK stack
● Beats (typically FileBeat)
● Elastic Search
● Logstash
● Kibana
Operation
● Drupal syslog → local syslog server → local logs
● DON’T log straight from Drupal
● Filebeat pulls logs, sends to Logstash
● Logstash massages logs, sends to ES
● ES provides storage, indexing
● Kibana provides UI
Deployment
● Hosted with site
● SaaS: Loggly, Logz.io, ...

Off-site logs: Graylog
Graylog
● Dual server: ES (logs, search) + MongoDB (meta, conf)
● Includes GROK log handling
● Accept syslog or GELF input
● Designed from Splunk
Operation
● Drupal syslog → local syslog server → local logs
● DON’T log straight from Drupal via monolog_gelf
● Local syslog forwards to Graylog2
● Graylog2 massages logs, sends to ES
● ES provides storage, indexing
● Graylog2 provides UI
Deployment
● Hosted with site
● SaaS: StackHero

(source: Graylog)
Off-site logs: BELK vs Graylog design

Non-SQL Logs: do I need them ?
● Small site, little trafﬁc, single webmaster: just use dblog
● Any other site: upgrade to something else
○ Hosting company provides a logs dashboard (e.g. Splunk): use it
■ syslog into their stack, via local syslog then pull
○ Have an internal ops team ?
■ syslog into internal BELK or Graylog
○ No ops expertise ? don’t have time to learn Kibana/Graylog ? hosting company
doesn’t provide real time logs access ?
■ Want to minimize costs and/or have logs in-site ?
● use mongodb_watchdog
■ Otherwise, use SaaS logs vendor
● Datadog, Scalyr, Loggly or Papertrail (SolarWinds), Logz.io...

Queue API services
● Core: mostly for Batch API
● General D8 use: proxy invalidation
○ Invalidation queues
● Commerce sites
○ ERP links
○ Third-party catalog/inventory
● Media sites
○ Real time news feeds ingestion
○ Deferred derived media generation

Queue modules
SQL and NoSQL
SQL
● Core bundled: queue.database service
○ used by all Drupal sites
● advanced_queue project
○ created for Drupal Commerce projects
○ used by Commerce 2.x
NoSQL: storage-based
● Core bundled: queue.memory service
● Redis:
○ 7.x: redis_queue project
○ 8.x: redis project
● MongoDB
○ 7.x: mongodb project
NoSQL: message servers
● Beanstalkd
○ 6.x/7.x: popular, used by drupal.org itself
○ 8.x complete port, but no users (?)
● RabbitMQ
○ 7.x: little used, 8.x: most popular
○ Users include public TV, major french e-tailer
○ Hardened by production at these levels
● AWS SQS
○ 7.x: some use, but no 8.x port
● Apache Kafka
○ 8.x only
○ Created for largest french retail chain
● Other queue services
○ Less used: Gearman, IronMQ, 0MQ
○ No 8.x versions

Queue API modules by usage D7/D8

NoSQL Queue: do I need it ?
● Mainstream Drupal site without Varnish / CDN
○ probably not, advancedqueue is still a nice improvement though
● Content site with a lot of generated content, Varnish and/or CDN
○ consider using Redis (D8), MongoDB (D7), RabbitMQ (D8)
○ or use Kafka (D8) if you need to (e.g. corporate mandate)
● Drupal Commerce standalone
○ advancedqueue is normally enough
● Site generating lots of dynamic media (image, video, sound) ...or ingesting fast feeds (> 1 item/sec)
○ need a dedicated message server

NoSQL Queue: which should I use ?
● The one your ops team supports best
○ Content management has a low event rate (< 1 event/sec)
● Kafka-class is for high-throughput queues
○ Think LinkedIn, Twitter, Netﬂix, Spotify, Airbnb, Paypal…
● RabbitMQ is solid
○ usually well known and monitored
○ D8 driver used for years on Cyber Monday, Black Friday, Olympic games...
● Beanstalkd is simple
○ It “just works”
○ Good ﬁrst queue upgrading from DB

SQL-based search
● Search has long been the weakest core feature in Drupal
○ In spite of improvements with each version
● Relevant issues
○ Good recall, but bad precision
○ Multilingual support, but no language awareness
○ Low awareness of language inﬂections → preprocessing API
○ Limited ability to handle asian (CJK) languages
○ Slow updates, cron-based pull mode
○ Indexing costs impacting site users
○ Indexed search for content only → search plugins
○ Other entity types limited to unindexed search by default
○ No support for restricted content search
● Useful complements: porterstemmer, snowball_stemmer
● SQL Alternative: Search API database search. Similar.

NoSQL search solutions
Cloud-based / SaaS
● SaaS offerings:
○ Algolia
○ Google CSE
● Drupal Hosting offerings (alphabetic order):
○ Acquia Search SOLR
○ Amazee.io SOLR
○ Pantheon SOLR
○ Platform.sh ElasticSearch / SOLR
On-site / near-site
● Core support: Search API (14% of D7, 16% of D8 sites)
● Standard solution:
○ Local SOLR
○ Multilingual search supported
● Alternatives:
○ Elastic Search → heart of BELK suite
○ Xunsearch: Xapian for Chinese
○ Xapian (8.x dev)
● D7 backends not on D8:
○ Elastic Search via Elastica
○ Google Search Appliance: killed by Google
○ MongoDB via MongoDB module
○ Sphinx
● Proprietary search engine publishers have custom,
unpublished, non-GPL (!) Drupal modules

SQL and NoSQL search solutions by usage in D8

Non-core search: which should I use ?
● Any content deserves search
● SQL
○ Core for small content quantities
○ Search API DB backend used by drupal.org
● SaaS
○ For entry level: Algolia/Google = 0 recurring cost, near 0 set-up cost
○ Both perform better than core, but non-free
● Drupal PaaS have managed ES/SOLR
● Others: cost equilibrium
○ ES/SOLR have setup and recurring costs of possession (server load)
○ SaaS has lower set-up costs, but recurring fees
○ Core search has the cost of lost opportunity

Best current practice: NoSQL in general
Drupal 8 core tries hard to be SQL-agnostic
● Every use of the DB goes through @database
○ So anything able to pass for a SQL engine may be used
○ The mongodb_dbtng, mongodb 8.x-1.x, and Drumongous projects do just that
● Even Views has a query plugin. Project efq_views (7.x, 8.x) supports NoSQL engines that way
● No service except “storage” services should receive databases
○ Write a storage service for your data, deﬁning its interface
○ Write a SQL provider implementing it, receiving @database
○ Tag the service as “backend_overridable”
○ Core mostly does it, custom code should always do it.
● References:
○ https://www.drupal.org/project/drupal/issues/2302617
○ https://www.drupal.org/node/2306083

Best current practice: MongoDB
● Connecting to MongoDB with 8.x-2.x
○ Using multiple databases ? Use @mongodb.client_factory
■ The client you get is a standard mongodb/mongodb Client instance
■ You have to handle topology
○ Using single database ? Use @mongodb.database_factory
■ The database you get is a standard mongodb/mongodb Database instance
■ Your DB topology is now conﬁgurable in settings
○ You probably don’t want to use Doctrine ODM, especially when interacting with Drupal data
● Designing a custom schema
○ Start from the queries, not from some canonicalization
○ For large scale data sets, consider:
■ Splitting live and archive data for sharding
■ Having a write DB and a read DB, and a CLI-based service between them - read about CQRS
○ Never use a monotonic increasing key for sharding
○ In most cases, joined data in lists don’t need to be as up-to-date as primary views
■ Embed “light” versions of dependent objects for lists, only use $lookup and DBRef joins on full datum view

“ “
There, I said it !
Contribution is
its own reward

Join us for
contribution opportunities
Thursday, October 31, 2019
9:00-18:00
Room: Europe Foyer 2
Mentored
Contribution
First Time
Contributor Workshop
General
Contribution
#DrupalContributions
9:00-14:00
Room: Diamond Lounge
9:00-18:00
Room: Europe Foyer 2

What did you think?
Locate this session at the DrupalCon Amsterdam website:
https://drupal.kuoni-congress.info/2019/program/
Take the Survey!
https://www.surveymonkey.com/r/DrupalConAmsterdam

Scaling Up Drupal 8 with NoSQL Databases

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scaling Up Drupal 8 with NoSQL Databases

Similaire à Scaling Up Drupal 8 with NoSQL Databases (20)

Plus de OSInet

Plus de OSInet (15)

Dernier

Dernier (20)

Scaling Up Drupal 8 with NoSQL Databases