From cache to in-memory data grid. Introduction to Hazelcast.

From cache to in-memory data grid.
Introduction to Hazelcast.
By Taras Matyashovsky

About me
• Software engineer/TL
• Worked for outsource companies, product
companies and tried myself in startups/
freelancing
• 7+ years production Java experience
• Fan of Agile methodologies, CSM

What?
• This presentation:
• covers basics of caching and popular cache
types
• explains evolution from simple cache to
distributed, and from distributed to IMDG
• not describes usage of NoSQL solutions for
caching
• is not intended for products comparison or
for promotion of Hazelcast as the best
solution

Why?
• to expand horizons regarding modern
distributed architectures and solutions
• to share experience from my current
project where Infinispan was replaced
with Hazelcast as in-memory distributed
cache solution

Agenda
1st part:
• Why software caches?
• Common cache attributes
• Cache access patterns
• Cache types
• Distributed cache vs. IMDG

Agenda
2nd part:
• Hazelcast in a nutshell
• Hazelcast configuration
• Live demo sessions
• in-memory distributed cache
• write-through cache with Postgres as storage
• search in distributed cache
• parallel processing using executor service and entry
processor
• Infinispan vs. Hazelcast
• Best practices and personal recommendations

Why Software Caching?
• application performance:
• many concurrent users
• time and costs overhead to access
application’s data stored in RDBMS or file
system
• database-access bottlenecks caused by too
many simultaneous requests

So Software Caches
• improve response times by reducing data access
latency
• offload persistent storages by reducing number
of trips to data sources
• avoid the cost of repeatedly creating objects
• share objects between threads
• only work for IO-bound applications

So Software Caches
are essential for modern
high-loaded applications

But
• memory size
• is limited
• can become unacceptably huge
• synchronization complexity
• consistency between the cached data state
and data source’s original data
• durability
• correct cache invalidation
• scalability

Common Cache Attributes
• maximum size, e.g. quantity of entries
• cache algorithm used for invalidation/eviction,
e.g.:
• least recently used (LRU)
• least frequently used (LFU)
• FIFO
• eviction percentage
• expiration, e.g.:
• time-to-live (TTL)
• absolute/relative time-based expiration

Cache Access Patterns
• cache aside
• read-through
• refresh-ahead
• write-through
• write-behind

Cache Aside Pattern
• application is responsible for reading and writing
from the storage and the cache doesn't interact
with the storage at all
• the cache is “kept aside” as a faster and more
scalable in-memory data store
Client
Cache
Storage

Read-Through/Write-Through
• the application treats cache as the main data
store and reads/writes data from/to it
• the cache is responsible for reading and writing
this data to the database
Client Cache Storage

Write-Behind Pattern
• modified cache entries are asynchronously
written to the storage after a configurable delay

Refresh-Ahead Pattern
• automatically and asynchronously reload
(refresh) any recently accessed cache entry from
the cache loader prior to its expiration

Cache Strategy Selection
RT/WT vs. cache-aside:
• RT/WT simplifies application code
• cache-aside may have blocking behavior
• cache-aside may be preferable when there are
multiple cache updates triggered to the same
storage from different cache servers

Cache Strategy Selection
Write-through vs. write-behind:
• write-behind caching may deliver considerably
higher throughput and reduced latency
compared to write-through caching
• implication of write-behind caching is that
database updates occur outside of the cache
transaction
• write-behind transaction can conflict with an
external update

Cache Types
• local cache
• replicated cache
• distributed cache
• remote cache
• near cache

Local Cache
a cache that is local to
(completely contained within)
a particular cluster node

Local Cache
Pros:
• simplicity
• performance
• no serialization/deserialization overhead
Cons:
• not a fault-tolerant
• scalability

Local Cache
Solutions:
• EhCache
• Google Guava
• Infinispan local cache mode

Replicated Cache
a cache that replicates its data
to all cluster nodes

Get in Replicated Cache
Each cluster node (JVM) accesses the data from its
own memory, i.e. local read:

Put in Replicated Cache
Pushing the new version of the data to all other
cluster nodes:

Replicated Cache
Pros:
• best read performance
• fault–tolerant
• linear performance scalability for reads
Cons:
• poor write performance
• additional network load
• poor and limited scalability for writes
• memory consumption

Replicated Cache
Solutions:
• open-source:
• Infinispan
• commercial:
• Oracle Coherence
• EhCache + Terracota

Distributed Cache
a cache that partitions its data
among all cluster nodes

Get in Distributed Cache
Access often must go over the network to another
cluster node:

Put in Distributed Cache
Resolving known limitation of replicated cache:

Put in Distributed Cache
• the data is being sent to a primary cluster node
and a backup cluster node if backup count is 1
• modifications to the cache are not considered
complete until all backups have acknowledged
receipt of the modification, i.e. slight
performance penalty
• such overhead guarantees that data consistency
is maintained and no data is lost

Failover in Distributed Cache
Failover involves promoting backup data to be
primary storage:

Local Storage in Distributed Cache
Certain cluster nodes can be configured to store
data, and others to be configured to not store
data:

Distributed Cache
Pros:
• linear performance scalability for reads and
writes
• fault-tolerant
Cons:
• increased latency of reads (due to network
round-trip and serialization/deserialization
expenses)

Distributed Cache Summary
Distributed in-memory key/value stores
supports a simple set of “put” and “get”
operations and optionally read-through and
write-through behavior for writing and
reading values to and from underlying
disk-based storage such as an RDBMS

Distributed Cache Summary
Depending on the product additional
features like:
• ACID transactions
• eviction policies
• replication vs. partitioning
• active backups
also became available as the products
matured

Distributed Cache
Solutions:
• open-source:
• Infinispan
• Hazelcast
• NoSQL storages, e.g. Redis, Cassandra,
MongoDB, etc.
• commercial:
• Oracle Coherence
• Terracota

Remote Cache
a cache that is located remotely and
should be accessed by a client(s)

Remote Cache
Majority of existing distributed/replicated
caches solutions support 2 modes:
• embedded mode
• when cache instance is started within the same JVM
as your application
• client-server mode
• when remote cache instance is started and clients
connect to it using a variety of different protocols

Remote Cache
Solutions:
• Infinispan remote cache mode
• Hazelcast client-server mode
• Memcached

Near Cache
a hybrid cache;
it typically fronts a distributed cache or a
remote cache with a local cache

Get in Near Cache
When an object is fetched from remote node, it is
put to local cache, so subsequent requests are
handled by local node retrieving from local cache:

Near Cache
Pros:
• it is best used for read only data
Cons:
• increases memory usage since the near cache
items need to be stored in the memory of the
member
• reduces consistency

In-memory Data Grid
In-memory distributed cache plus:
• ability to support co-location of computations
with data in a distributed context and move
computation to data
• distributed MPP processing based on standard
SQL and/or Map/Reduce, that allows to
effectively compute over data stored in-memory
across the cluster

IMDC vs. IMDG
• in-memory distributed caches were
developed in response to a growing need
for data high-availability
• in-memory data grids were developed to
respond to the growing complexities of
data processing

IMDG in a nutshell
Adding distributed SQL and/or MapReduce
type processing required a complete
re-thinking of distributed caches, as focus
has shifted from pure data management to
hybrid data and compute management

Hazelcast
The leading open source
in-memory data grid
free alternative to proprietary solutions,
such as Oracle Coherence,
VMWare Pivotal Gemfire and
Software AG Terracotta

Hazelcast Use-Cases
• scale your application
• share data across cluster
• partition your data
• balance the load
• send/receive messages
• process in parallel on many JVMs, i.e. MPP

Hazelcast Features
• dynamic clustering, backup, discovery,
fail-over
• distributed map, queue, set, list, lock,
semaphore, topic, executor service, etc.
• transaction support
• map/reduce API
• Java client for accessing the cluster
remotely

Hazelcast Configuration
• programmatic configuration
• XML configuration
• Spring configuration
Nuance:
It is very important that the configuration on all
members in the cluster is exactly the same,
it doesn’t matter if you use the XML based
configuration or the programmatic configuration.

Sample Application
Technologies:
• Spring Boot 1.0.1
• Hazelcast 3.2
• Postgres 9.3
Application:
• RESTful web service to get/put data from/to cache
• RESTful web service to execute tasks in the cluster
• one Instance of Hazelcast per application
* Some samples are not optimal and created just to demonstrate usage of existing Hazelcast API

Global Hazelcast Configuration
Defined global Hazelcast configuration in separate
config in common module. It contains skeleton for
future Hazelcast instance as well as global
configuration settings:
• instance configuration skeleton
• common properties
• group name and password
• TCP based network configuration
• join config
• multicast and TCP/IP config
• default distributed map configuration skeleton

Hazelcast Instance
Each module that uses Hazelcast for distributed
cache should have its own separate Hazelcast
instance.
The “Hazelcast Instance” is a factory for creating
individual cache objects.
Each cache has a name and potentially distinct
configuration settings (expiration, eviction,
replication, and more).
Multiple instances can live within the same JVM.

Hazelcast Cluster Group
Groups are used in order to have multiple isolated
clusters on the same network instead of a single
cluster.
JVM can host multiple Hazelcast instances (nodes).
Each node can only participate in one group and it
only joins to its own group, does not mess with
others.
In order to achieve this group name and group
password configuration properties are used.

Hazelcast Network Config
In our environment multicast mechanism for
joining the cluster is not supported, so only TCP/IP-cluster
approach will be used.
In this case there should be a one or more well
known members to connect to.

Hazelcast Map Store
• useful for reading and writing map entries from
and to an external data source
• one instance per map per node will be created
• word of caution: the map store should NOT call
distributed map operations, otherwise you
might run into deadlocks

Hazelcast Map Store
• map pre-population via loadAllKeys method that
returns the set of all “hot” keys that need to be
loaded for the partitions owned by the member
• write through vs. write behind using “write-delay-
seconds” configuration (0 or bigger)
• MapLoaderLifecycleSupport to be notified of
lifecycle events, i.e. init and destroy

Live Demo “Executor Service”

Hazelcast Executor Service
• extends the java.util.concurrent.ExecutorService,
but is designed to be used in a distributed
environment
• scaling up via threads pool size
• scaling out is automatic via addition of new
Hazelcast instances

• provides different ways to route tasks:
• any member
• specific member
• the member hosting a specific key
• all or subset of members
• supports execution callback

Drawbacks:
• work-queue has no high availability:
• each member will create local ThreadPoolExecutors
with ordinary work-queues that do the real work but
not backed up by Hazelcast
• work-queue is not partitioned:
• it could be that one member has a lot of unprocessed
work, and another is idle
• no customizable load balancing

Hazelcast Features
More useful features:
• entry listener
• transactions support, e.g. local, distributed
• map reduce API out-of-the-box
• custom serialization/deserialization mechanism
• distributed topic
• clients

Hazelcast Missing Features
Missing useful features:
• update configuration in running cluster
• load balancing for executor service

Infinispan vs. Hazelcast
Infinispan Hazelcast
Pros • backed by relatively large
company for use in largely
distributed environments
(JBoss)
• been in active use for
several years
• well-written documentation
• a lot of examples of different
configurations as well as
solutions to common
problems
• easy setup
• more performant than
Infinispan
• simple node/cluster
discovery mechanism
• relies on only 1 jar to be
included on classpath
• brief documentation
completed with simple
code samples

Infinispan vs. Hazelcast
Infinispan Hazelcast
Cons • relies on JGroups that
proven to be buggy
especially under high load
• configuration can be
overly complex
• ~9 jars are needed in
order to get Infinispan up
and running
• code appears very
complex and hard to
debug/trace
• backed by a startup based
in Palo Alto and Turkey,
just received Series A 2.5
M funding from Bain
Capital Ventures
• customization points are
fairly limited
• some exceptions can be
difficult to diagnose due to
poorly written exception
messages
• still quite buggy

Best Practices
• each specific Hazelcast instance should have its
unique instance name
• each specific Hazelcast instance should have its
unique group name and password
• each specific Hazelcast instance should start on
separate port according to predefined ranges

Personal Recommendations
• use XML configuration in production, but don’t
use spring:hz schema. Our Spring based “lego
bricks” approach for building resulting Hazelcast
instance is quite decent.
• don’t use Hazelcast for local caches as it was
never designed with that purpose and always
performs serialization/deserialization
• don’t use library specific classes, use common
collections, e.g. ConcurrentMap, and you will be
able to replace underlying cache solution easily

Hazelcast Drawbacks
• still quite buggy
• poor documentation for more complex
cases
• enterprise edition costs money, but
includes:
• elastic memory
• JAAS security
• .NET and C++ clients

Thank you!
by Taras Matyashovsky

References
• http://docs.oracle.com/cd/E18686_01/coh.37/e18677/cache_intro.htm
• http://coherence.oracle.com/display/COH31UG/Read-Through,+Write-
Through,+Refresh-Ahead+and+Write-Behind+Caching
• http://blog.tekmindsolutions.com/oracle-coherence-diffrence-between-replicated-
cache-vs-partitioneddistributed-cache/
• http://www.slideshare.net/MaxAlexejev/from-distributed-caches-to-inmemory-data-
grids
• http://www.slideshare.net/jaxlondon2012/clustering-your-application-with-hazelcast
• http://www.gridgain.com/blog/fyi/cache-data-grid-database/
• http://gridgaintech.wordpress.com/2013/10/19/distributed-caching-is-dead-long-
live/
• http://www.hazelcast.com/resources/the-book-of-hazelcast/
• https://labs.consol.de/java-caches/part-3-3-peer-to-peer-with-hazelcast/
• http://hazelcast.com/resources/thinking-distributed-the-hazelcast-way/
• https://github.com/tmatyashovsky/hazelcast-samples/

From cache to in-memory data grid. Introduction to Hazelcast.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From cache to in-memory data grid. Introduction to Hazelcast.

Similar to From cache to in-memory data grid. Introduction to Hazelcast. (20)

More from Taras Matyashovsky

More from Taras Matyashovsky (11)

Recently uploaded

Recently uploaded (20)

From cache to in-memory data grid. Introduction to Hazelcast.