This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
3. About me
• Software engineer/TL
• Worked for outsource companies, product
companies and tried myself in startups/
freelancing
• 7+ years production Java experience
• Fan of Agile methodologies, CSM
4. What?
• This presentation:
• covers basics of caching and popular cache
types
• explains evolution from simple cache to
distributed, and from distributed to IMDG
• not describes usage of NoSQL solutions for
caching
• is not intended for products comparison or
for promotion of Hazelcast as the best
solution
5. Why?
• to expand horizons regarding modern
distributed architectures and solutions
• to share experience from my current
project where Infinispan was replaced
with Hazelcast as in-memory distributed
cache solution
7. Agenda
2nd part:
• Hazelcast in a nutshell
• Hazelcast configuration
• Live demo sessions
• in-memory distributed cache
• write-through cache with Postgres as storage
• search in distributed cache
• parallel processing using executor service and entry
processor
• Infinispan vs. Hazelcast
• Best practices and personal recommendations
9. Why Software Caching?
• application performance:
• many concurrent users
• time and costs overhead to access
application’s data stored in RDBMS or file
system
• database-access bottlenecks caused by too
many simultaneous requests
10. So Software Caches
• improve response times by reducing data access
latency
• offload persistent storages by reducing number
of trips to data sources
• avoid the cost of repeatedly creating objects
• share objects between threads
• only work for IO-bound applications
12. But
• memory size
• is limited
• can become unacceptably huge
• synchronization complexity
• consistency between the cached data state
and data source’s original data
• durability
• correct cache invalidation
• scalability
13. Common Cache Attributes
• maximum size, e.g. quantity of entries
• cache algorithm used for invalidation/eviction,
e.g.:
• least recently used (LRU)
• least frequently used (LFU)
• FIFO
• eviction percentage
• expiration, e.g.:
• time-to-live (TTL)
• absolute/relative time-based expiration
15. Cache Aside Pattern
• application is responsible for reading and writing
from the storage and the cache doesn't interact
with the storage at all
• the cache is “kept aside” as a faster and more
scalable in-memory data store
Client
Cache
Storage
16. Read-Through/Write-Through
• the application treats cache as the main data
store and reads/writes data from/to it
• the cache is responsible for reading and writing
this data to the database
Client Cache Storage
17. Write-Behind Pattern
• modified cache entries are asynchronously
written to the storage after a configurable delay
Client Cache Storage
18. Refresh-Ahead Pattern
• automatically and asynchronously reload
(refresh) any recently accessed cache entry from
the cache loader prior to its expiration
Client Cache Storage
19. Cache Strategy Selection
RT/WT vs. cache-aside:
• RT/WT simplifies application code
• cache-aside may have blocking behavior
• cache-aside may be preferable when there are
multiple cache updates triggered to the same
storage from different cache servers
20. Cache Strategy Selection
Write-through vs. write-behind:
• write-behind caching may deliver considerably
higher throughput and reduced latency
compared to write-through caching
• implication of write-behind caching is that
database updates occur outside of the cache
transaction
• write-behind transaction can conflict with an
external update
32. Get in Distributed Cache
Access often must go over the network to another
cluster node:
33. Put in Distributed Cache
Resolving known limitation of replicated cache:
34. Put in Distributed Cache
• the data is being sent to a primary cluster node
and a backup cluster node if backup count is 1
• modifications to the cache are not considered
complete until all backups have acknowledged
receipt of the modification, i.e. slight
performance penalty
• such overhead guarantees that data consistency
is maintained and no data is lost
35. Failover in Distributed Cache
Failover involves promoting backup data to be
primary storage:
36. Local Storage in Distributed Cache
Certain cluster nodes can be configured to store
data, and others to be configured to not store
data:
37. Distributed Cache
Pros:
• linear performance scalability for reads and
writes
• fault-tolerant
Cons:
• increased latency of reads (due to network
round-trip and serialization/deserialization
expenses)
38. Distributed Cache Summary
Distributed in-memory key/value stores
supports a simple set of “put” and “get”
operations and optionally read-through and
write-through behavior for writing and
reading values to and from underlying
disk-based storage such as an RDBMS
39. Distributed Cache Summary
Depending on the product additional
features like:
• ACID transactions
• eviction policies
• replication vs. partitioning
• active backups
also became available as the products
matured
41. Remote Cache
a cache that is located remotely and
should be accessed by a client(s)
42. Remote Cache
Majority of existing distributed/replicated
caches solutions support 2 modes:
• embedded mode
• when cache instance is started within the same JVM
as your application
• client-server mode
• when remote cache instance is started and clients
connect to it using a variety of different protocols
44. Near Cache
a hybrid cache;
it typically fronts a distributed cache or a
remote cache with a local cache
45. Get in Near Cache
When an object is fetched from remote node, it is
put to local cache, so subsequent requests are
handled by local node retrieving from local cache:
46. Near Cache
Pros:
• it is best used for read only data
Cons:
• increases memory usage since the near cache
items need to be stored in the memory of the
member
• reduces consistency
49. In-memory Data Grid
In-memory distributed cache plus:
• ability to support co-location of computations
with data in a distributed context and move
computation to data
• distributed MPP processing based on standard
SQL and/or Map/Reduce, that allows to
effectively compute over data stored in-memory
across the cluster
50. IMDC vs. IMDG
• in-memory distributed caches were
developed in response to a growing need
for data high-availability
• in-memory data grids were developed to
respond to the growing complexities of
data processing
51. IMDG in a nutshell
Adding distributed SQL and/or MapReduce
type processing required a complete
re-thinking of distributed caches, as focus
has shifted from pure data management to
hybrid data and compute management
54. Hazelcast
The leading open source
in-memory data grid
free alternative to proprietary solutions,
such as Oracle Coherence,
VMWare Pivotal Gemfire and
Software AG Terracotta
55. Hazelcast Use-Cases
• scale your application
• share data across cluster
• partition your data
• balance the load
• send/receive messages
• process in parallel on many JVMs, i.e. MPP
56. Hazelcast Features
• dynamic clustering, backup, discovery,
fail-over
• distributed map, queue, set, list, lock,
semaphore, topic, executor service, etc.
• transaction support
• map/reduce API
• Java client for accessing the cluster
remotely
57. Hazelcast Configuration
• programmatic configuration
• XML configuration
• Spring configuration
Nuance:
It is very important that the configuration on all
members in the cluster is exactly the same,
it doesn’t matter if you use the XML based
configuration or the programmatic configuration.
60. Sample Application
Technologies:
• Spring Boot 1.0.1
• Hazelcast 3.2
• Postgres 9.3
Application:
• RESTful web service to get/put data from/to cache
• RESTful web service to execute tasks in the cluster
• one Instance of Hazelcast per application
* Some samples are not optimal and created just to demonstrate usage of existing Hazelcast API
61. Global Hazelcast Configuration
Defined global Hazelcast configuration in separate
config in common module. It contains skeleton for
future Hazelcast instance as well as global
configuration settings:
• instance configuration skeleton
• common properties
• group name and password
• TCP based network configuration
• join config
• multicast and TCP/IP config
• default distributed map configuration skeleton
62. Hazelcast Instance
Each module that uses Hazelcast for distributed
cache should have its own separate Hazelcast
instance.
The “Hazelcast Instance” is a factory for creating
individual cache objects.
Each cache has a name and potentially distinct
configuration settings (expiration, eviction,
replication, and more).
Multiple instances can live within the same JVM.
63. Hazelcast Cluster Group
Groups are used in order to have multiple isolated
clusters on the same network instead of a single
cluster.
JVM can host multiple Hazelcast instances (nodes).
Each node can only participate in one group and it
only joins to its own group, does not mess with
others.
In order to achieve this group name and group
password configuration properties are used.
64. Hazelcast Network Config
In our environment multicast mechanism for
joining the cluster is not supported, so only TCP/IP-cluster
approach will be used.
In this case there should be a one or more well
known members to connect to.
66. Hazelcast Map Store
• useful for reading and writing map entries from
and to an external data source
• one instance per map per node will be created
• word of caution: the map store should NOT call
distributed map operations, otherwise you
might run into deadlocks
67. Hazelcast Map Store
• map pre-population via loadAllKeys method that
returns the set of all “hot” keys that need to be
loaded for the partitions owned by the member
• write through vs. write behind using “write-delay-
seconds” configuration (0 or bigger)
• MapLoaderLifecycleSupport to be notified of
lifecycle events, i.e. init and destroy
69. Hazelcast Executor Service
• extends the java.util.concurrent.ExecutorService,
but is designed to be used in a distributed
environment
• scaling up via threads pool size
• scaling out is automatic via addition of new
Hazelcast instances
70. Hazelcast Executor Service
• provides different ways to route tasks:
• any member
• specific member
• the member hosting a specific key
• all or subset of members
• supports execution callback
71. Hazelcast Executor Service
Drawbacks:
• work-queue has no high availability:
• each member will create local ThreadPoolExecutors
with ordinary work-queues that do the real work but
not backed up by Hazelcast
• work-queue is not partitioned:
• it could be that one member has a lot of unprocessed
work, and another is idle
• no customizable load balancing
72. Hazelcast Features
More useful features:
• entry listener
• transactions support, e.g. local, distributed
• map reduce API out-of-the-box
• custom serialization/deserialization mechanism
• distributed topic
• clients
73. Hazelcast Missing Features
Missing useful features:
• update configuration in running cluster
• load balancing for executor service
75. Infinispan vs. Hazelcast
Infinispan Hazelcast
Pros • backed by relatively large
company for use in largely
distributed environments
(JBoss)
• been in active use for
several years
• well-written documentation
• a lot of examples of different
configurations as well as
solutions to common
problems
• easy setup
• more performant than
Infinispan
• simple node/cluster
discovery mechanism
• relies on only 1 jar to be
included on classpath
• brief documentation
completed with simple
code samples
76. Infinispan vs. Hazelcast
Infinispan Hazelcast
Cons • relies on JGroups that
proven to be buggy
especially under high load
• configuration can be
overly complex
• ~9 jars are needed in
order to get Infinispan up
and running
• code appears very
complex and hard to
debug/trace
• backed by a startup based
in Palo Alto and Turkey,
just received Series A 2.5
M funding from Bain
Capital Ventures
• customization points are
fairly limited
• some exceptions can be
difficult to diagnose due to
poorly written exception
messages
• still quite buggy
78. Best Practices
• each specific Hazelcast instance should have its
unique instance name
• each specific Hazelcast instance should have its
unique group name and password
• each specific Hazelcast instance should start on
separate port according to predefined ranges
79. Personal Recommendations
• use XML configuration in production, but don’t
use spring:hz schema. Our Spring based “lego
bricks” approach for building resulting Hazelcast
instance is quite decent.
• don’t use Hazelcast for local caches as it was
never designed with that purpose and always
performs serialization/deserialization
• don’t use library specific classes, use common
collections, e.g. ConcurrentMap, and you will be
able to replace underlying cache solution easily
80. Hazelcast Drawbacks
• still quite buggy
• poor documentation for more complex
cases
• enterprise edition costs money, but
includes:
• elastic memory
• JAAS security
• .NET and C++ clients