SlideShare une entreprise Scribd logo
1  sur  10
Télécharger pour lire hors ligne
NoSQL initiative and its influences on social and
                        semantic Web

                               Stefan Prutianu, Stefan Ceriu
             Faculty of Computer Science, „Al. I. Cuza“ University, Iasi, Romania
                          { stefan.prutianu, stefan.ceriu}@info.uaic.ro



       Abstract. In this paper we describe NoSQL, a series of non-relational database
       technologies and products developed to address the current problems the
       RDMS system are facing: lack of true scalability, poor performance on high
       data volumes and low availability. Some of these products have already been
       involved in production and they perform very well: Amazon’s Dynamo,
       Google’s Bigtable, Cassandra, etc. Also we provide a view on how these
       systems influence the applications development in the social and semantic Web
       sphere.

       Keywords: NoSQL, distributed computing, distributed non-relational database,
       semantic Web, social Web, scalability




1 Introduction


   Modern relational database technologies tend to have serious problems when it
comes to managing huge volumes of data (eBay - 2PB of data overall [2]) as they are
today and these problems are: scalability, performance and rigid schema design.[1]
   Vertical scaling (increasing the computational power of a single node) is just a
temporary solution until the data grows again beyond the storage limit.
   Horizontal scaling in traditional relational database management system
(partitioning, sharding) means dividing the data into multiple databases according to
some application-specific boundaries, but splitting the data across multiple servers
breaks the relationships stored within the database, the most valuable property of a
relational database and it is also not transparent to the application’s business logic.
   Read slaves is a form of horizontal scaling used in RDMS (Relational Database
Management System) where a read-only slave database is replicating the master
database so every write is redirected to the master database and every read to one of
the read slave replicas, but it is still not true scaling because of single failure point.
   Large relational databases (multi terabytes or petabytes in size) usually perform
slowly on complex queries because of the amount of data they have to scan and
because these systems design is disk-oriented and disk operations are time
consuming. [3]
   RDMS requires that the database schema be designed before starting using the data
(tables, columns, relationships) and in most cases such a schema will require changes
(adding new features, adjusting or fine tuning some other features) but changing the
database schema is very hard in such systems (updating rows may lock them and it is
a very time consuming operation). [12]
   NoSQL is the common name under a set of new technologies, design practices and
open-source developed projects which address the problems that large scale
distributed applications and platforms are facing: scalability, availability,
performance, fault tolerance.The NoSQL trend is not intended to replace the relational
database model; instead it proposes new solutions to problems that the traditional
database model cannot solve.
   This paper is structured as follows. Section 2 describes the NoSQL trend in detail
with its proposed solutions and results, Section 3 presents how NoSQL influenced the
application development in Social and Semantic Web sphere and Section 4 concludes
our survey.




2 NoSQL


2.1 Overview

NoSQL proponents started to manifest more seriously in early 2009 when they
proposed solutions of distributed databases that can be used in systems where the
relational features present in RDMS are not needed. The inspiration points for these
were the closed-source distributed databases already available in some large
corporations such as: Dynamo from Amazon and Bigtable from Google.
   These solutions along with the open-source projects (Cassandra, Hypertable,
HBase, Redis) share a number of characteristics: key-value storage, run on a large
number of machines, data are partitioned and shared among these machines.
   Another common characteristic of these is that in order to get the level of
scalability, availability, performance and fault (partition) tolerance desired the data
consistency requirement is relaxed and this is because of the Eric Brewer’s CAP
Theorem which proves that in a distributed environment you cannot get Consistency,
Availability and Partition Tolerance at the same time [6] so most of these system
achieve a particular form of weak consistency named eventual consistency.
   Consistency means that a system operates fully or not at all; in a distributed
environment if an update is made to some node, all its replicas are updated until any
read from those replicas are performed. Consistency can be achieved by using
relational databases because they focus on ACID (Atomicity, Consistency, Isolation,
Durability) properties.
   Availability means that a system is always available to perform requested tasks.
   Partition Tolerance is the ability of a distributed system to work even in case of
partition forming – one or more nodes are isolated from the others due to
network/communication failures.
Eventual consistency is a specific form of weak consistency; if no new updates are
made to the object, eventually all reads will return the updated value. [7] DNS
(Domain Name System) is a system that implements eventual consistency.

   Dynamo. Amazon’s Dynamo is a highly available key-value structured storage
system[4]. It was developed to meet Amazon’s needs for reliability and scaling.
Access to data is provided through a primary-key interface (get(key), put(key) and
overloads of these operations), scalability and availability are achieved through a
combination of techniques: consistent hashing for data partitioning and replication,
data consistency is facilitated by object versioning, consistency among replicas during
updated uses a quorum technique and decentralized replica synchronization protocol
and for failure detection and membership updates a gossip-based protocol is used.
Amazon’s engineers motivated their choice when implementing this system by the
fact that most of the services their platform exposes store and retrieve data by a
primary key thus not requiring the complex querying and management functionality
within a RDMS, the cost of maintenance for a RDMS and also using traditional
storage models the availability would be sacrificed in favor of consistency. Dynamo
components for request coordination, membership and failure detection and local
persistence engine are all implemented in Java. Local persistence component has a
pluggable design and uses engines like: BDB (Berkeley Database) Transactional Data
Store, BDB Java Edition, MySQL and in-memory buffer with persistent backing
storage.[4]

   Bigtable. Google’s Bigtable is a distributed storage system for managing
structured data designed to be highly scalable. This system has proven its efficiency
in important applications from Google: Personalized Search, Google Analytics,
Google Earth, Google Finance. Bigtable does not support a full relational data model;
instead it provides clients with a simple data model indexed using row, columns and
timestamps. From the data model point of view Bigtable is a sparse, distributed
persistent, multi-dimensional sorted map where each value in the map is an
uninterpreted array of bytes. Row keys are arbitrary strings and data in Bigtable is
maintained in lexicographic order by row keys and every read/write under a row key
is considered an atomic operation regardless the number of columns involved.
Columns are grouped in sets called column families and usually these contain
information of the same type. Timestamps are introduced because each cell can
contain multiple versions of the same data. Bigtable API provides functions for:
creating and deleting tables and column families, reads/updates under a particular key
and other operations involving cluster management. A master model is use to manage
load balancing and fault tolerance. For internal persistence Bigtable uses SSTable
(immutable sorted file of key-value pairs) file format in conjunction with GFS
(Google File System). [5]
2.2 Design patterns [8], [11]

API Model. Because the underlying data model can be considered as a large
distributed hashtable (DHT) the basic API (Application Programming Interface) could
be:
- get(key) – extract the value at the given key .
- put(key) – updates the values at the given key .
- delete(key) – removes the key and its associated value .

Machine Infrastructure. The infrastructure for these kind of systems is composed of
a large number of machines with commodity hardware connected together through a
network. Each machine (physical node) has the same software configuration, but the
hardware characteristics may not be the same. Within each physical node there are a
number of virtual nodes running.

Partition Schemes. Most large scale distributed system uses a consistent hashing
technique due to its flexibility when the number of virtual nodes is altered. When
nodes are added or removed keys and data need to be redistributed and a consistent
hashing technique minimizes the amount of these changes. In the consistent hashing
technique the key space is finite; the output range of a hash function is treated as a
fixed ring. Both virtual node ids and data items keys take values in this circular space
and the owner of a set of keys identifying data items is considered as the first virtual
node encountered walking the ring clockwise from that key. In case of virtual nodes
crashes all the keys owned by the failing node will be adopted by its clockwise
neighbor thus the rest of virtual nodes on the ring are not affected.

Data Replication. In order to achieve high availability and performance same data
need to be available on multiple nodes – replicas. In Dynamo [4] the list of nodes
responsible for storing a particular key is called preference list and the size of this list
is configured by a preset parameter. While read actions can be performed on any
replica, update actions can lead to some consistency issues because the updates need
to be propagated to all the replicas.

Data Models. The basic data access method is to use a key in order to retrieve or
update a value. Value can be: blob (binary large object) [4], document, column family
(rows and columns, but the rows can have as few or as many columns as desired) [5],
graph or collection.

Storage Models. The most used strategy is to design this component in a pluggable
fashion where storage mechanisms can be: MySQL DB, Berkeley DB, filesystem -
SSTables, or in-memory storage – memtables.

Consistency Management. The same data is available on multiple nodes at a given
time and the problem that arises is to synchronize these replicas in order to preserve a
consistent view of data from the client perspective. In such systems where availability
and partition tolerance are an important requirement strict consistency cannot be
achieved at the same time with first two properties (CAP Theorem) thus a form of
weak consistency – eventual consistency is implemented in these systems. There are
various mechanism that will guarantee such systems will eventually become
consistent after a period of time (inconsistency window) during which
synchronization is performed.
Timestamps. Using the history of operations performed on a row of data can be
decided to what value the row will eventually converge to. The drawbacks of this
method are: requires synchronized clocks on nodes, don’t capture causality, a decision
is hard to take when write operations happened simultaneously.
Vector clocks. A vector clock is a tuple {t1, t2,…,tn} of clock values from each node.
When a write operation is performed on node i it sets ti to its clock value. Given two
vector clocks v1 and v2, v1 < v2 (if for all k v1[k] ≤ v2[k]) implies the global time
ordering of events. There are certain rules that replicas follow when updating their
vector clock:
     - when an internal operation happens at replica i it will advance its vector
          clock vi[i]
     - when replica i sends a message to replica j it also attaches its vector clock to
          the message
     - when replica j receives a message from replica i it will advance its clock
          vj[j] and then merge it with the vector clock received in the message vj[k] =
          max(vi[k], vi[k])
Single Master Model. In this model each data partition has a master node and multiple
slave nodes. Updates are redirected to the master node and then, asynchronously, the
update propagates to the slave nodes. Sometimes using this model a system can
become unavailable if the master has failed and none of the replicas have been
updated yet.
Multi-Master Model. In certain key ranges intensive requests for updates will cause
the Single Master Model to be unable to spread the workload correctly. Multi-Master
Model allows updates to be performed at any replicas.
Quorum Based 2PC. Assuming that there are N replicas of some data and a
coordinator node, when an update is requested the coordinator sends the request to all
the N replicas but it has to wait for only W (W < N) successful answers. The same
happens in read actions, the coordinator sends the request to the N replicas, but has to
wait only for R (R < N) successful responses and from all the answering nodes the
one with the highest timestamp is selected. This protocol is flexible because
configuring the W and R values accordingly different levels of consistency can be
achieved: W+R>N – strict consistency, W+R ≤ N - the model of consistency is
relaxed to a weaker one.

Membership Management. Since nodes in a cluster may fail or recover the need for
a technique that will allow nodes to know about each other arises.
Omniscient Master. When nodes leave or join a cluster they communicate with a
master node that holds the authoritative view of the cluster. This method is simple and
provides a consistent view of cluster status, but these is still a single point of failure
and the model is not partition tolerant.
Gossip. This is a method to propagate cluster status to all the members. Every preset
amount of time a node selects another to communicate its view about the cluster with.
Every node maintains a timestamp of the information about itself and the rest of the
cluster. This method is scalable and failure tolerant but provides eventual consistency
about cluster status.


2.3 Open-Source Projects



Dynamo [4] and Bigtable [5] constituted a great starting point for developing open-
source, non-relational, distributed and horizontal scalable databases. NoSQL
movement began in early 2009 and grows rapidly into a consistent list of free and
competitive products providing most of necessary properties in distributed systems:
schema-free, replication support, easy API, eventual consistency, performance.
Bellow is presented a non-exhaustive list of current databases and their classifications
along three important characteristics: scalability, data and query model, internal
persistence model.


                        Scalability              Data and Query Model            Persistence
                                                                                   Model


                     Add new          Support    Data             Query API
                   machines       for multiple   Model
                 transparently    datacenters
                to applications

 Cassandra                                       Column          Thrift          Memtable/
                                                 family                          SStable

 HBase                                           Column       Thrift, REST       Memtable/
                                                 family                          SStable on
                                                                                 HDFS

 Riak                                            Document     Nested hashes         ?


 Scalaris                                        Key/value       get/put         in-memory only


 Voldemort                        under          Key/value       get/put         BDB, MySQL
                                  development

 CouchDB                                         Document     map/reduce views   append-only B-
                                                                                 Tree

 MongoDB                                         Document        Cursor             B-Tree

 Neo4j                                              Graph        Graph           on-disk   linked
                                                                                 lists

 Redis                                           Collection      Collection      in-memory


 Tokyo                                           Key/value       get/put         hash or B-Tree
 Cabinet
 Chordless                                       Key/value    Java, simple RPC      ?
Add new          Support    Data          Query API     Persistence
                   machines       for multiple   Model                       Model
                 transparently    datacenters
                to applications

     InfoGrid                                      Graph   Java, http/REST     ?

      Sones                                        Graph      .Net             ?




 Table. 1. Classification by scalability, data and query model and persistence model
[1], [13]

This table summarizes the most important characteristics of a subset from non-
relational database systems currently available. The rest of this section will focus on
describing some of these databases.

   Cassandra. This system development started at Facebook and one of its designers
was a co-author of Dynamo. At the moment the project is open source and still under
“heavy development” at The Apache Software Foundation. Their authors define it as
a “structured storage system over a P2P network”. [11] This system combines the
distributed architecture of Dynamo and the column family model from Bigtable. From
the data model point of view Cassandra it is a multi-dimensional map indexed by a
key where each application creates its own key space. Besides column family a new
concept of super columns is introduced which represents lists of columns. Data is
sorted at write operations and also within a row columns are sorted by their name.
Partitioning subsystem is similar to Dynamo approach - consistent hashing is used.
The same concepts of coordinator node and preference list as in Dynamo are used for
data replication. Cluster management uses a variant of Gossip technique – Scuttlebutt
anti-entropy Gossip. Internal persistence relies on the local file system and storage
structure is similar to the one in Bigtable: SSTable, memtable, commit logs,
compaction and Bloom filters. The system is written in Java and high level libraries
are available for: Ruby, Perl, Python, Scala. Facebook, Digg and Rackspace use this
system in production. [11], [12]
   Voldemort. Key-value store systems developed by Linkenin engineers implements
most of the features available in Dynamo: partition and replication (consistent
hashing, preference list), object versioning (vector clocks), pluggable storage
component (BDB, in-memory, MySQL). Voldemort also comes with a series of new
features: serialization, support for read-only nodes, compression. Linkedin uses this
system as its underlying storage system. [11], [12]
   Riak. Key-value store system that uses documents as values, using the same
architecture and algorithms as Dynamo. Implementation is done in Erlang and various
client libraries are available: Jiak Client (Erlang (JSON)), Riak (Erlang (raw)),
Pyhton, PHP, Ruby, Java, JavaScript. There are no known examples of usages in
production. [12]
   Redis. Key-value store where values can have multiple types: strings, lists, sets,
ordered sets. Replication is achieved via a Master – Slave model, client libraries
(available in PHP, Ruby, Scala) are responsible for partitioning. It uses a memory-
driven approach with asynchronously snapshots to disk for local persistence. Some
other supported operations depend on the values data types: increments, decrements,
atomic multi-set (Strings); push, pop , range get (Lists); intersection, union, difference
(Sets), sorting. It is written in ANSI C and it is used in production at: Github, Engine
Yard, VideoWiki. [12]
     Neo4j. This is a disk-based (data is stores in a custom binary format), fully
transactional Java persistence engine that stores data structures in graphs. Some of its
most important features are: graph-oriented mode for data representation (stores,
nodes, relationships and properties), high scalability (both across the same machine
but also on multiple machines), OO simple Java API, optional layers to expose itself
as a RDS Store, express meta model semantics using OWL, query the graph using
SPARQL. [14]


3 NoSQL in the social and semantic Web context

Semantic Web is an initiative of the World Wide Web Consortium (W3C) which
involves transforming the Web so that the data available today can be understood and
reused by machines. On a less abstract level this means attaching meta-data to the
resources on the Web and to specify relationships between these resources. The core
of the Semantic Web is a set of design principles, standards already widely used on
the Web - XML, XML Schema, formal definitions of language used in expressing
data models - Resource Description Framework (RDF), vocabulary for describing
properties of models based on RDF - Resource Description Framework Schema
(RDFS), vocabulary for creating ontologies - Ontology Web Language (OWL), data
query services - SPARQL and other, under development, standards - Rule Interchange
Format (RIF ), Unifying Logic and Proof layers.
   Social Web is the term used to describe how people socialize and interact each
other throughout the WWW. Classic examples of distributed web applications that
favored development of large social networks are: Facebook, MySpace, Linkedin,
Flickr, Twitter, Del.icio.us, etc.
   Regarding NoSQL influence on Semantic Web the vast list of database system
developed, each exposing new techniques of managing data, contains some examples
that may address problems like: managing RDF stores, managing ontologies or
creating SPARQL endpoints.
   Neo4j is probably the most obvious example of such a store system. Its graph-
oriented data model makes it perfect to store RDF triples or complex ontologies.
Despite the fact that databases using this graph-oriented data view are able to manage
a much reduced volume of information that the other types of non-relational data
stores (key-value, column family, documents) this volume is still a large one: billions
of nodes and relationships. Neo4j developers affirm that the traversal component of
this system is a high-performance one and it’s over 6 years of enrolment in production
rises the degree of confidence in this system. [14]
   HBase (The Hadoop Database) is a scalable, distributed, column oriented, dynamic
schema database for structured data, modeled after Google Bigtable and under
development at ASF (Apache Software Foundation). HBase data model can be
viewed as a multi-dimensional map where values are indexed by 4 keys (TableName,
RowKey, ColumnKey, Timestamp). Values are binary data, rows are sorted in
lexicographic order and columns are grouped in column families. The database
schema is flexible and it can be modified at run-time. Such a dynamic schema allows
this system to store Semantic Web data. An example of such a modeling can be found
in [17].
   Applications in the Social Web sphere have a longer history than Semantic Web
applications so the scalability, performance, availability or huge volumes of data
became issues vital to these applications. Cassandra, one of the most important non-
relational distributed stores, is already used in production in large social applications:
Facebook, Digg. A comparison with MySQL on 50 GB of data shows that Cassandra
performs better. [11]

                              Read                           Write
MySQL                         ~350 ms                        ~300 ms
Cassandra                     0.12 ms                        15 ms



          Table. 2. Performance comparison between MySQL and Cassandra on 50 GB
of data



4 Conclusion

   RDMS have served large informational systems for over 30 year but current
amount of data that needs to be managed causes multiple problems with these
systems. In order to address problems like: scalability, performance, availability a
new set of technologies and non-relational databases have been developed and they
are collectively known under the term NoSQL.
   This paper presents the techniques and design practices that lye under these new
database products most of which are inspired from already existing and reliable
systems like Amazon’s Dynamo and Google’s Bigtable. Also few ideas on how these
systems already influence the applications development for the semantic and social
Web are expressed.
   The NoSQL trend began to grow rapidly in early 2009 and within a relatively short
period of time a big number of non-relational database solutions appeared and part of
them already became components of various large scale applications. As future
research we are thinking at studying in even great detail the current techniques used in
designing such system and possibly eliminating the vulnerabilities that may cause
some of them to fail in certain scenarios.
References

   1.    Ellis, Jonathan: NoSQL Ecosystem,
         http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/ (2009)
   2.    Shoup, Randy: eBay Marketplace Architecture: Architectural Strategies, Patterns, and
         Focuses (2007)
   3.    Bloor, Robin: 6 Reason Why Relational Database Will Be Superseded (2008)
   4.    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G.,Lakshman, A., Pilchin, A.,
         Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available
         Key-value Store (2007)
   5.    Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach , D. A.,
         Burrows, M., Chandra, T., Fikes, A., Gruber, R. E.: Bigtable: A Distributed Storage
         System for Structured Data (2006)
   6.    Brewer, Eric A.: Towards Robust Distributed Systems, Principles Of Distributed
         Computing (2000)
   7.    Vogels, W: Eventually Consistent,
         http://www.allthingsdistributed.com/2008/12/eventually_consistent.html (2008)
   8.    Ho, Ricky: Pragmatic Programming Techniques,
         http://horicky.blogspot.com/2009/11/nosql-patterns.html (2009)
   9.    Wiggins, Adam: SQL Databases Don’t Scale,
         http://adam.blog.heroku.com/past/2009/7/6/sql_databases_dont_scale/ (2009)
   10.   Browne, Julian: Brewer’s CAP Theorem,
         http://www.julianbrowne.com/article/viewer/brewers-cap-theorem (2009)
   11.   NOSQL debrief, http://blog.oskarsson.nu/2009/06/nosql-debrief.html (2009)
   12.   Gupta, Vineet: NoSQL Databases – Part 1- Landscape,
         http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html (2010)
   13.   NoSQL – Your Ultimate Guide to Non – Relational Universe, http://nosql-
         databases.org/
   14.   Neo4j – the graph database, http://neo4j.org/
   15.   Semantic Web, http://en.wikipedia.org/wiki/Semantic_Web
   16.   Social Web, http://en.wikipedia.org/wiki/Social_web
   17.   Mateescu, Gabriel: Finding the way through the semantic Web with HBase,
         http://www.ibm.com/developerworks/opensource/library/os-
         hbase/index.html?ca=dgr-twtrHBasedth-
         OS&S_TACT=105AGY83&S_CMP=TWDW (2009)

Contenu connexe

Tendances

2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )SBGC
 
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASESTRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASEScsandit
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data queryIJDKP
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...IJECEIAES
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET Journal
 
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...ijcsa
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documentssubash chandra
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data FusionIRJET Journal
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESijwscjournal
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 

Tendances (17)

2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
 
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASESTRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data query
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...
 
Cognitive automation
Cognitive automationCognitive automation
Cognitive automation
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document Clustering
 
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data Fusion
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULES
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 

En vedette (6)

Géneros periodísticos
Géneros periodísticosGéneros periodísticos
Géneros periodísticos
 
Géneros periodísticos
Géneros periodísticosGéneros periodísticos
Géneros periodísticos
 
GENEROS PERIODISTICOS
GENEROS PERIODISTICOSGENEROS PERIODISTICOS
GENEROS PERIODISTICOS
 
Retratos de otra época
Retratos de otra épocaRetratos de otra época
Retratos de otra época
 
Sancho panza en la ínsula barataria
Sancho panza en la ínsula baratariaSancho panza en la ínsula barataria
Sancho panza en la ínsula barataria
 
Géneros Periodísticos
Géneros PeriodísticosGéneros Periodísticos
Géneros Periodísticos
 

Similaire à No Sql On Social And Sematic Web

Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic WebIrina Hutanu
 
Nosql availability & integrity
Nosql availability & integrityNosql availability & integrity
Nosql availability & integrityFahri Firdausillah
 
مقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيمقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيMohamed Galal
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfajajkhan16
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databasesijaia
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)sones GmbH
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
no sql presentation
no sql presentationno sql presentation
no sql presentationchandanm2
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sqlAnuja Gunale
 

Similaire à No Sql On Social And Sematic Web (20)

Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 
No sql database
No sql databaseNo sql database
No sql database
 
Nosql availability & integrity
Nosql availability & integrityNosql availability & integrity
Nosql availability & integrity
 
Know what is NOSQL
Know what is NOSQL Know what is NOSQL
Know what is NOSQL
 
مقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيمقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربي
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdf
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databases
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
nosql.pptx
nosql.pptxnosql.pptx
nosql.pptx
 
no sql presentation
no sql presentationno sql presentation
no sql presentation
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 

Dernier

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Dernier (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

No Sql On Social And Sematic Web

  • 1. NoSQL initiative and its influences on social and semantic Web Stefan Prutianu, Stefan Ceriu Faculty of Computer Science, „Al. I. Cuza“ University, Iasi, Romania { stefan.prutianu, stefan.ceriu}@info.uaic.ro Abstract. In this paper we describe NoSQL, a series of non-relational database technologies and products developed to address the current problems the RDMS system are facing: lack of true scalability, poor performance on high data volumes and low availability. Some of these products have already been involved in production and they perform very well: Amazon’s Dynamo, Google’s Bigtable, Cassandra, etc. Also we provide a view on how these systems influence the applications development in the social and semantic Web sphere. Keywords: NoSQL, distributed computing, distributed non-relational database, semantic Web, social Web, scalability 1 Introduction Modern relational database technologies tend to have serious problems when it comes to managing huge volumes of data (eBay - 2PB of data overall [2]) as they are today and these problems are: scalability, performance and rigid schema design.[1] Vertical scaling (increasing the computational power of a single node) is just a temporary solution until the data grows again beyond the storage limit. Horizontal scaling in traditional relational database management system (partitioning, sharding) means dividing the data into multiple databases according to some application-specific boundaries, but splitting the data across multiple servers breaks the relationships stored within the database, the most valuable property of a relational database and it is also not transparent to the application’s business logic. Read slaves is a form of horizontal scaling used in RDMS (Relational Database Management System) where a read-only slave database is replicating the master database so every write is redirected to the master database and every read to one of the read slave replicas, but it is still not true scaling because of single failure point. Large relational databases (multi terabytes or petabytes in size) usually perform slowly on complex queries because of the amount of data they have to scan and because these systems design is disk-oriented and disk operations are time consuming. [3] RDMS requires that the database schema be designed before starting using the data (tables, columns, relationships) and in most cases such a schema will require changes
  • 2. (adding new features, adjusting or fine tuning some other features) but changing the database schema is very hard in such systems (updating rows may lock them and it is a very time consuming operation). [12] NoSQL is the common name under a set of new technologies, design practices and open-source developed projects which address the problems that large scale distributed applications and platforms are facing: scalability, availability, performance, fault tolerance.The NoSQL trend is not intended to replace the relational database model; instead it proposes new solutions to problems that the traditional database model cannot solve. This paper is structured as follows. Section 2 describes the NoSQL trend in detail with its proposed solutions and results, Section 3 presents how NoSQL influenced the application development in Social and Semantic Web sphere and Section 4 concludes our survey. 2 NoSQL 2.1 Overview NoSQL proponents started to manifest more seriously in early 2009 when they proposed solutions of distributed databases that can be used in systems where the relational features present in RDMS are not needed. The inspiration points for these were the closed-source distributed databases already available in some large corporations such as: Dynamo from Amazon and Bigtable from Google. These solutions along with the open-source projects (Cassandra, Hypertable, HBase, Redis) share a number of characteristics: key-value storage, run on a large number of machines, data are partitioned and shared among these machines. Another common characteristic of these is that in order to get the level of scalability, availability, performance and fault (partition) tolerance desired the data consistency requirement is relaxed and this is because of the Eric Brewer’s CAP Theorem which proves that in a distributed environment you cannot get Consistency, Availability and Partition Tolerance at the same time [6] so most of these system achieve a particular form of weak consistency named eventual consistency. Consistency means that a system operates fully or not at all; in a distributed environment if an update is made to some node, all its replicas are updated until any read from those replicas are performed. Consistency can be achieved by using relational databases because they focus on ACID (Atomicity, Consistency, Isolation, Durability) properties. Availability means that a system is always available to perform requested tasks. Partition Tolerance is the ability of a distributed system to work even in case of partition forming – one or more nodes are isolated from the others due to network/communication failures.
  • 3. Eventual consistency is a specific form of weak consistency; if no new updates are made to the object, eventually all reads will return the updated value. [7] DNS (Domain Name System) is a system that implements eventual consistency. Dynamo. Amazon’s Dynamo is a highly available key-value structured storage system[4]. It was developed to meet Amazon’s needs for reliability and scaling. Access to data is provided through a primary-key interface (get(key), put(key) and overloads of these operations), scalability and availability are achieved through a combination of techniques: consistent hashing for data partitioning and replication, data consistency is facilitated by object versioning, consistency among replicas during updated uses a quorum technique and decentralized replica synchronization protocol and for failure detection and membership updates a gossip-based protocol is used. Amazon’s engineers motivated their choice when implementing this system by the fact that most of the services their platform exposes store and retrieve data by a primary key thus not requiring the complex querying and management functionality within a RDMS, the cost of maintenance for a RDMS and also using traditional storage models the availability would be sacrificed in favor of consistency. Dynamo components for request coordination, membership and failure detection and local persistence engine are all implemented in Java. Local persistence component has a pluggable design and uses engines like: BDB (Berkeley Database) Transactional Data Store, BDB Java Edition, MySQL and in-memory buffer with persistent backing storage.[4] Bigtable. Google’s Bigtable is a distributed storage system for managing structured data designed to be highly scalable. This system has proven its efficiency in important applications from Google: Personalized Search, Google Analytics, Google Earth, Google Finance. Bigtable does not support a full relational data model; instead it provides clients with a simple data model indexed using row, columns and timestamps. From the data model point of view Bigtable is a sparse, distributed persistent, multi-dimensional sorted map where each value in the map is an uninterpreted array of bytes. Row keys are arbitrary strings and data in Bigtable is maintained in lexicographic order by row keys and every read/write under a row key is considered an atomic operation regardless the number of columns involved. Columns are grouped in sets called column families and usually these contain information of the same type. Timestamps are introduced because each cell can contain multiple versions of the same data. Bigtable API provides functions for: creating and deleting tables and column families, reads/updates under a particular key and other operations involving cluster management. A master model is use to manage load balancing and fault tolerance. For internal persistence Bigtable uses SSTable (immutable sorted file of key-value pairs) file format in conjunction with GFS (Google File System). [5]
  • 4. 2.2 Design patterns [8], [11] API Model. Because the underlying data model can be considered as a large distributed hashtable (DHT) the basic API (Application Programming Interface) could be: - get(key) – extract the value at the given key . - put(key) – updates the values at the given key . - delete(key) – removes the key and its associated value . Machine Infrastructure. The infrastructure for these kind of systems is composed of a large number of machines with commodity hardware connected together through a network. Each machine (physical node) has the same software configuration, but the hardware characteristics may not be the same. Within each physical node there are a number of virtual nodes running. Partition Schemes. Most large scale distributed system uses a consistent hashing technique due to its flexibility when the number of virtual nodes is altered. When nodes are added or removed keys and data need to be redistributed and a consistent hashing technique minimizes the amount of these changes. In the consistent hashing technique the key space is finite; the output range of a hash function is treated as a fixed ring. Both virtual node ids and data items keys take values in this circular space and the owner of a set of keys identifying data items is considered as the first virtual node encountered walking the ring clockwise from that key. In case of virtual nodes crashes all the keys owned by the failing node will be adopted by its clockwise neighbor thus the rest of virtual nodes on the ring are not affected. Data Replication. In order to achieve high availability and performance same data need to be available on multiple nodes – replicas. In Dynamo [4] the list of nodes responsible for storing a particular key is called preference list and the size of this list is configured by a preset parameter. While read actions can be performed on any replica, update actions can lead to some consistency issues because the updates need to be propagated to all the replicas. Data Models. The basic data access method is to use a key in order to retrieve or update a value. Value can be: blob (binary large object) [4], document, column family (rows and columns, but the rows can have as few or as many columns as desired) [5], graph or collection. Storage Models. The most used strategy is to design this component in a pluggable fashion where storage mechanisms can be: MySQL DB, Berkeley DB, filesystem - SSTables, or in-memory storage – memtables. Consistency Management. The same data is available on multiple nodes at a given time and the problem that arises is to synchronize these replicas in order to preserve a consistent view of data from the client perspective. In such systems where availability and partition tolerance are an important requirement strict consistency cannot be achieved at the same time with first two properties (CAP Theorem) thus a form of
  • 5. weak consistency – eventual consistency is implemented in these systems. There are various mechanism that will guarantee such systems will eventually become consistent after a period of time (inconsistency window) during which synchronization is performed. Timestamps. Using the history of operations performed on a row of data can be decided to what value the row will eventually converge to. The drawbacks of this method are: requires synchronized clocks on nodes, don’t capture causality, a decision is hard to take when write operations happened simultaneously. Vector clocks. A vector clock is a tuple {t1, t2,…,tn} of clock values from each node. When a write operation is performed on node i it sets ti to its clock value. Given two vector clocks v1 and v2, v1 < v2 (if for all k v1[k] ≤ v2[k]) implies the global time ordering of events. There are certain rules that replicas follow when updating their vector clock: - when an internal operation happens at replica i it will advance its vector clock vi[i] - when replica i sends a message to replica j it also attaches its vector clock to the message - when replica j receives a message from replica i it will advance its clock vj[j] and then merge it with the vector clock received in the message vj[k] = max(vi[k], vi[k]) Single Master Model. In this model each data partition has a master node and multiple slave nodes. Updates are redirected to the master node and then, asynchronously, the update propagates to the slave nodes. Sometimes using this model a system can become unavailable if the master has failed and none of the replicas have been updated yet. Multi-Master Model. In certain key ranges intensive requests for updates will cause the Single Master Model to be unable to spread the workload correctly. Multi-Master Model allows updates to be performed at any replicas. Quorum Based 2PC. Assuming that there are N replicas of some data and a coordinator node, when an update is requested the coordinator sends the request to all the N replicas but it has to wait for only W (W < N) successful answers. The same happens in read actions, the coordinator sends the request to the N replicas, but has to wait only for R (R < N) successful responses and from all the answering nodes the one with the highest timestamp is selected. This protocol is flexible because configuring the W and R values accordingly different levels of consistency can be achieved: W+R>N – strict consistency, W+R ≤ N - the model of consistency is relaxed to a weaker one. Membership Management. Since nodes in a cluster may fail or recover the need for a technique that will allow nodes to know about each other arises. Omniscient Master. When nodes leave or join a cluster they communicate with a master node that holds the authoritative view of the cluster. This method is simple and provides a consistent view of cluster status, but these is still a single point of failure and the model is not partition tolerant. Gossip. This is a method to propagate cluster status to all the members. Every preset amount of time a node selects another to communicate its view about the cluster with. Every node maintains a timestamp of the information about itself and the rest of the
  • 6. cluster. This method is scalable and failure tolerant but provides eventual consistency about cluster status. 2.3 Open-Source Projects Dynamo [4] and Bigtable [5] constituted a great starting point for developing open- source, non-relational, distributed and horizontal scalable databases. NoSQL movement began in early 2009 and grows rapidly into a consistent list of free and competitive products providing most of necessary properties in distributed systems: schema-free, replication support, easy API, eventual consistency, performance. Bellow is presented a non-exhaustive list of current databases and their classifications along three important characteristics: scalability, data and query model, internal persistence model. Scalability Data and Query Model Persistence Model Add new Support Data Query API machines for multiple Model transparently datacenters to applications Cassandra Column Thrift Memtable/ family SStable HBase Column Thrift, REST Memtable/ family SStable on HDFS Riak Document Nested hashes ? Scalaris Key/value get/put in-memory only Voldemort under Key/value get/put BDB, MySQL development CouchDB Document map/reduce views append-only B- Tree MongoDB Document Cursor B-Tree Neo4j Graph Graph on-disk linked lists Redis Collection Collection in-memory Tokyo Key/value get/put hash or B-Tree Cabinet Chordless Key/value Java, simple RPC ?
  • 7. Add new Support Data Query API Persistence machines for multiple Model Model transparently datacenters to applications InfoGrid Graph Java, http/REST ? Sones Graph .Net ? Table. 1. Classification by scalability, data and query model and persistence model [1], [13] This table summarizes the most important characteristics of a subset from non- relational database systems currently available. The rest of this section will focus on describing some of these databases. Cassandra. This system development started at Facebook and one of its designers was a co-author of Dynamo. At the moment the project is open source and still under “heavy development” at The Apache Software Foundation. Their authors define it as a “structured storage system over a P2P network”. [11] This system combines the distributed architecture of Dynamo and the column family model from Bigtable. From the data model point of view Cassandra it is a multi-dimensional map indexed by a key where each application creates its own key space. Besides column family a new concept of super columns is introduced which represents lists of columns. Data is sorted at write operations and also within a row columns are sorted by their name. Partitioning subsystem is similar to Dynamo approach - consistent hashing is used. The same concepts of coordinator node and preference list as in Dynamo are used for data replication. Cluster management uses a variant of Gossip technique – Scuttlebutt anti-entropy Gossip. Internal persistence relies on the local file system and storage structure is similar to the one in Bigtable: SSTable, memtable, commit logs, compaction and Bloom filters. The system is written in Java and high level libraries are available for: Ruby, Perl, Python, Scala. Facebook, Digg and Rackspace use this system in production. [11], [12] Voldemort. Key-value store systems developed by Linkenin engineers implements most of the features available in Dynamo: partition and replication (consistent hashing, preference list), object versioning (vector clocks), pluggable storage component (BDB, in-memory, MySQL). Voldemort also comes with a series of new features: serialization, support for read-only nodes, compression. Linkedin uses this system as its underlying storage system. [11], [12] Riak. Key-value store system that uses documents as values, using the same architecture and algorithms as Dynamo. Implementation is done in Erlang and various client libraries are available: Jiak Client (Erlang (JSON)), Riak (Erlang (raw)), Pyhton, PHP, Ruby, Java, JavaScript. There are no known examples of usages in production. [12] Redis. Key-value store where values can have multiple types: strings, lists, sets, ordered sets. Replication is achieved via a Master – Slave model, client libraries
  • 8. (available in PHP, Ruby, Scala) are responsible for partitioning. It uses a memory- driven approach with asynchronously snapshots to disk for local persistence. Some other supported operations depend on the values data types: increments, decrements, atomic multi-set (Strings); push, pop , range get (Lists); intersection, union, difference (Sets), sorting. It is written in ANSI C and it is used in production at: Github, Engine Yard, VideoWiki. [12] Neo4j. This is a disk-based (data is stores in a custom binary format), fully transactional Java persistence engine that stores data structures in graphs. Some of its most important features are: graph-oriented mode for data representation (stores, nodes, relationships and properties), high scalability (both across the same machine but also on multiple machines), OO simple Java API, optional layers to expose itself as a RDS Store, express meta model semantics using OWL, query the graph using SPARQL. [14] 3 NoSQL in the social and semantic Web context Semantic Web is an initiative of the World Wide Web Consortium (W3C) which involves transforming the Web so that the data available today can be understood and reused by machines. On a less abstract level this means attaching meta-data to the resources on the Web and to specify relationships between these resources. The core of the Semantic Web is a set of design principles, standards already widely used on the Web - XML, XML Schema, formal definitions of language used in expressing data models - Resource Description Framework (RDF), vocabulary for describing properties of models based on RDF - Resource Description Framework Schema (RDFS), vocabulary for creating ontologies - Ontology Web Language (OWL), data query services - SPARQL and other, under development, standards - Rule Interchange Format (RIF ), Unifying Logic and Proof layers. Social Web is the term used to describe how people socialize and interact each other throughout the WWW. Classic examples of distributed web applications that favored development of large social networks are: Facebook, MySpace, Linkedin, Flickr, Twitter, Del.icio.us, etc. Regarding NoSQL influence on Semantic Web the vast list of database system developed, each exposing new techniques of managing data, contains some examples that may address problems like: managing RDF stores, managing ontologies or creating SPARQL endpoints. Neo4j is probably the most obvious example of such a store system. Its graph- oriented data model makes it perfect to store RDF triples or complex ontologies. Despite the fact that databases using this graph-oriented data view are able to manage a much reduced volume of information that the other types of non-relational data stores (key-value, column family, documents) this volume is still a large one: billions of nodes and relationships. Neo4j developers affirm that the traversal component of this system is a high-performance one and it’s over 6 years of enrolment in production rises the degree of confidence in this system. [14] HBase (The Hadoop Database) is a scalable, distributed, column oriented, dynamic schema database for structured data, modeled after Google Bigtable and under
  • 9. development at ASF (Apache Software Foundation). HBase data model can be viewed as a multi-dimensional map where values are indexed by 4 keys (TableName, RowKey, ColumnKey, Timestamp). Values are binary data, rows are sorted in lexicographic order and columns are grouped in column families. The database schema is flexible and it can be modified at run-time. Such a dynamic schema allows this system to store Semantic Web data. An example of such a modeling can be found in [17]. Applications in the Social Web sphere have a longer history than Semantic Web applications so the scalability, performance, availability or huge volumes of data became issues vital to these applications. Cassandra, one of the most important non- relational distributed stores, is already used in production in large social applications: Facebook, Digg. A comparison with MySQL on 50 GB of data shows that Cassandra performs better. [11] Read Write MySQL ~350 ms ~300 ms Cassandra 0.12 ms 15 ms Table. 2. Performance comparison between MySQL and Cassandra on 50 GB of data 4 Conclusion RDMS have served large informational systems for over 30 year but current amount of data that needs to be managed causes multiple problems with these systems. In order to address problems like: scalability, performance, availability a new set of technologies and non-relational databases have been developed and they are collectively known under the term NoSQL. This paper presents the techniques and design practices that lye under these new database products most of which are inspired from already existing and reliable systems like Amazon’s Dynamo and Google’s Bigtable. Also few ideas on how these systems already influence the applications development for the semantic and social Web are expressed. The NoSQL trend began to grow rapidly in early 2009 and within a relatively short period of time a big number of non-relational database solutions appeared and part of them already became components of various large scale applications. As future research we are thinking at studying in even great detail the current techniques used in designing such system and possibly eliminating the vulnerabilities that may cause some of them to fail in certain scenarios.
  • 10. References 1. Ellis, Jonathan: NoSQL Ecosystem, http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/ (2009) 2. Shoup, Randy: eBay Marketplace Architecture: Architectural Strategies, Patterns, and Focuses (2007) 3. Bloor, Robin: 6 Reason Why Relational Database Will Be Superseded (2008) 4. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G.,Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-value Store (2007) 5. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach , D. A., Burrows, M., Chandra, T., Fikes, A., Gruber, R. E.: Bigtable: A Distributed Storage System for Structured Data (2006) 6. Brewer, Eric A.: Towards Robust Distributed Systems, Principles Of Distributed Computing (2000) 7. Vogels, W: Eventually Consistent, http://www.allthingsdistributed.com/2008/12/eventually_consistent.html (2008) 8. Ho, Ricky: Pragmatic Programming Techniques, http://horicky.blogspot.com/2009/11/nosql-patterns.html (2009) 9. Wiggins, Adam: SQL Databases Don’t Scale, http://adam.blog.heroku.com/past/2009/7/6/sql_databases_dont_scale/ (2009) 10. Browne, Julian: Brewer’s CAP Theorem, http://www.julianbrowne.com/article/viewer/brewers-cap-theorem (2009) 11. NOSQL debrief, http://blog.oskarsson.nu/2009/06/nosql-debrief.html (2009) 12. Gupta, Vineet: NoSQL Databases – Part 1- Landscape, http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html (2010) 13. NoSQL – Your Ultimate Guide to Non – Relational Universe, http://nosql- databases.org/ 14. Neo4j – the graph database, http://neo4j.org/ 15. Semantic Web, http://en.wikipedia.org/wiki/Semantic_Web 16. Social Web, http://en.wikipedia.org/wiki/Social_web 17. Mateescu, Gabriel: Finding the way through the semantic Web with HBase, http://www.ibm.com/developerworks/opensource/library/os- hbase/index.html?ca=dgr-twtrHBasedth- OS&S_TACT=105AGY83&S_CMP=TWDW (2009)