- Back to Home »
- Datamodels
Posted by : Sushanth
Tuesday, 22 December 2015
NoSQL:
In this article, we will discuss about some of the short comings of RDBMS and learn some of the underlying datamodels of the NoSQl databases.
Shortcomings of RDBMS:
Relational databases are widely used and SQL has become defacto standard to talk to different databases. Problem with SQL into which many application developers run into are, the assemble structures of objects which are stored in memory needs to be stripped out into individual pieces while storing in database, this is called impedence mismatch problem.
Application developers have been frustrated with the impedance mismatch between the relational model and the in-memory data structures.
Need for NoSQL:
In today’s time data is becoming easier to access and capture through third parties such as Facebook, Google+ and others. Personal user information, social graphs, geo location data, user-generated content and machine logging data are just a few examples where the data has been increasing exponentially. To avail the above service properly, it is required to process huge amount of data. Which SQL databases were never designed. The evolution of NoSql databases is to handle these huge data properly.
What is NoSQL:
NoSQL is a non-relational database management systems, different from traditional relational database management systems in some significant ways. It is designed for distributed data stores where very large scale of data storing needs (for example Google or Facebook which collects terabits of data every day for their users). These type of data storing may not require fixed schema, avoid join operations and typically scale horizontally.
Characteristics of NoSQL:
- Stands for Not Only SQL
- No declarative query language
- No predefined schema
- Key-Value pair storage, Column Store, Document Store, Graph databases
- Eventual consistency rather ACID property
- Unstructured and unpredictable data
- CAP Theorem
- Prioritizes high performance, high availability and scalability
- Stands for Not Only SQL
- No declarative query language
- No predefined schema
- Key-Value pair storage, Column Store, Document Store, Graph databases
- Eventual consistency rather ACID property
- Unstructured and unpredictable data
- CAP Theorem
- Prioritizes high performance, high availability and scalability
-open source
- Came out of 21st century web culture
Even though NoSQL don’t have any schema, it has Implicit schema . If you say order[price],it expects to have a price column.
DataModels:
NoSQL databases use different data models.
Companies have developed different datamodels to cater their own requirements.For example, Google developed Bigtable and amzon developed Dynamo which are their own datastorage systems.
There are four data models :
1.key/value
2.document based
3.Column family based
4.Graph based
An aggregate is a collection of data that we interact with as a unit. Aggregates form the boundaries for ACID operations with the database. The first three i.e,Key-value, document, and column-family databases can all be seen as forms of aggregate-oriented database. Aggregates make it easier for the database to manage data storage over clusters.
Instead of splattering data across a number of tables, in these datamodel, it can be stored in one go and the database what you aggregate boundaries are .
NoSQL databases from Different Companies:
1.Key / Value databases:
The simplest one is key/value pair. Database does n’t know what the value is, the value can be a image or a document. The model is reduced to a simple hash table which consists of key / value pairs. It is often easily distributed across multiple servers. The most famous products of this group include Redis, Dynamo and Riak.
Example application: Consider a forum software where you have a home profile page that gives the user's statistics (messages posted, etc) and the last ten messages by them. The page reads from a key that is based on the user's id and retrieves a string of JSON that represents all the relevant information. A background process recalculates the information every 15 minutes and writes to the store independently.
2.Document databases:
The second data model which is common is document data model. This data model consists of document collections where single document can have multiple fields, without necessarily defining a schema. Each doc can be a complex data structure ,usually that datastructure is represented in forms of JSON,it can be done in xml also.
Portions of documents can be retrieved using some rows and columns. Difference with key/value is ,its opaque and the document is transparent.
Example application: Consider a software that creates profiles of refugee children with the aim of reuniting them with their families. The details you need to record for each child vary tremendously with circumstances of the event and they are built up piecemeal, for example a young child may know their first name and you can take a picture of them but they may not know their parent's first names. Later a local may claim to recognise the child and provide you with additional information that you definitely want to record but until you can verify the information you have to treat it sceptically.
The best known and used are MongoDB, CouchDB and ArangoDB;
Sample document:
In document databases generally we query using an id.
3.Column family:
The third type of data model is column family datamodel.It says it has a single key called row key with in that multiple column family where each column family is a combination of columns that can fit together.Data is stored in sections of columns which offers more flexibility and easy aggregation. Facebook's Cassandra, BigTable from Google or Amazon's SimpleDB are the examples which belongs to this group.Its slightly complex datamodel,its easy to pull individual columns.
4.Graph:
Graph databases are not aggregated oriented at all. It is basically of node and graph datamodel.Its easy to handle relationships and its optimized to handle those.
This domain model consists of vertices interconnected by edges which creates a rich graph structure.
Examples of this group are for example OrientDB, Neo4J, Sones and InfinityGraph.
Example application: Any application that requires social networking is best suited to a graph database. These same principles can be extended to any application where you need to understand what people are doing, buying or enjoying so that you can recommend further things for them to do, buy or like. Any time you need to answer the question along the lines of "What restaurants do the sisters of people who are over-40, enjoy skiing and have visited Kenya dislike?" a graph database will usually help.
Aggregate oriented takes a lot of data which is splattered and put them into bigger lump .where as graph break into smaller units and play with smaller units carefully.
CAP Theorem (Brewer’s Theorem)
CAP theorem states that there are three basic requirements which exist in a special relation when designing applications for a distributed architecture.
Consistency - This means that the data in the database remains consistent after the execution of an operation. For example after an update operation all clients see the same data.
Availability - This means that the system is always on (service guarantee availability), no downtime.
Partition Tolerance - This means that the system continues to function even the communication among the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot communicate with one another.
In theoretically it is impossible to fulfill all 3 requirements. CAP provides the basic requirements for a distributed system to follow 2 of the 3 requirements. Therefore all the current NoSQL database follow the different combinations of the C, A, P from the CAP theorem. Here is the brief description of three combinations CA, CP, AP :
CA - Single site cluster, therefore all nodes are always in contact. When a partition occurs, the system blocks.
CP - Some data may not be accessible, but the rest is still consistent/accurate.
AP - System is still available under partitioning, but some of the data returned may be inaccurate.
CP - Some data may not be accessible, but the rest is still consistent/accurate.
AP - System is still available under partitioning, but some of the data returned may be inaccurate.
NoSQL pros/cons
Advantages
- High scalability
- Distributed Computing
- Lower cost
- Schema flexibility, semi-structure data
- No complicated Relationships
Disadvantages
- No standardization
- Limited query capabilities (so far)
- Eventual consistent is not intuitive to program for
- High scalability
- Distributed Computing
- Lower cost
- Schema flexibility, semi-structure data
- No complicated Relationships
Disadvantages
- No standardization
- Limited query capabilities (so far)
- Eventual consistent is not intuitive to program for
Summary:
Category
|
Description
|
Name of the database
|
Document Oriented
|
Data is stored as documents. An example format may be like - FirstName="Arun", Address="St. Xavier's Road", Spouse=[{Name:"Kiran"}], Children=[{Name:"Rihit", Age:8}]
|
CouchDB, Jackrabbit, MongoDB, OrientDB, SimpleDB,Terrastore etc.
|
XML database
|
Data is stored in XML format
|
BaseX, eXist, MarkLogic Server etc.
|
Graph databases
|
Data is stored as a collection of nodes, where nodes are analogous to objects in a programming language. Nodes are connected using edges.
|
AllegroGraph, DEX, Neo4j, FlockDB, Sones GraphDB etc.
|
Key-value store
|
In Key-value-store category of NoSQL database, an user can store data in schema-less way. A key may be strings, hashes, lists, sets, sorted sets and values are stored against these keys.
|
Cassandra, Riak, Redis, memcached, BigTable etc.
|