Wednesday, October 10, 2012

Neo4j - A Graph Database

Introduction

This blog looks at Neo4j, a graph Database.Glossary of a few terms you need to know as you read on:
Sharding (or Horizontal partitioning) is a database design principle whereby the content of a database table are split across physical locations by rows instead of by columns (using referential integrity). Each partition forms part of a shard. Multiple shards together provide a complete data set but the partitioned shards are split logically to ensure faster reads and writes.
Database Replication is the process of sharing data between the primary and one or more redundant databases, to improve reliability, fault-tolerance, or accessibility. Typically, data is immediately copied over to backup location upon write so as to be available for recovery and/or read-only resources.

Neo4j

Neo4j is a Graph Oriented NoSQL database. This is distinct from the other three categories – Key-Value, Big Table, Document databases. These last three categories are all described as Aggregate databases. What this means is these databases define their basic unit of persistence (or the domain) as an chunk of data that holds meaning when read/written as a chunk. According to Martin Fowler (http://martinfowler.com/bliki/AggregateOrientedDatabase.html), an aggregate is “a fundamental unit of storage which is a rich structure of closely related data: for key-value stores it's the value, for document stores it's the document, and for column-family stores it's the column family. In DDD terms, this group of data is an aggregate.” These types of databases are very effective when you are querying them with the aggregates as your dimension data and the attributes as your facts. But, when you wish to change your query mode to enforce one or more of your attributes as dimensions, you run into sever performance issues and hence you revert to writing Map-Reduce functions.
Enter Graph Databases. Conceptually, graph databases are built using Nodes and Relationships. Each of those may contain properties. Thus, there isn’t an aggregate of your domain objects but a set of Nodes connected via Relationships. Neo4j and InfiniteGraph are two examples of Graph Databases.

The Data Model

The moment one hears about Nodes and Relationships, one tends to think of RDBMS concepts by instinct. This is a mistake. You have shift your thinking in terms of Nodes and Relationships and not entities and their relationships. What you get with a Graph databases is what you produce as your domain model and you can translate it directly into a Graph. There is no further normalization of your model into tables, relationship tables etc.
Neo4j, as an example of a Graph Database, is schema-less database. It allows you to do bottom-up data model design, is fully ACID compliant. It is written in Java and is available as a standalone database server or an embedded database server.

The circular objects represent Nodes and the arrows represent Relationships. As you can see, there are multiple relationships between Nodes and this artifact came out of a design session for the application. This translates into your Neo4j database design without further work to create tables, columns, keys etc.
Nodes have properties that are basically key-value pairs. By writing you code in Java, you can take advantage of strongly typing your attributes. Every Relationship must have a Start Node and an End Node. The Start and End Nodes can the same. They can also have properties similar to Nodes.

Creating Nodes and Relationships:

 GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase("/sample/mydb");  
 Transaction tx = db.beginTx();  
 try {  
  Node personA = db.createNode();  
  personA.setProperty("name", "Person A");  
  Node projectCRM = db.createNode();  
  projectCRM.setProperty("name", "CRM Project");  
 personA.createRelationshipTo(projectCRM, DynamicRelationshipType.withName("LEADS"));  
  tx.success();  
 } finally {  
  tx.finish();  
 }  

Querying and Indexes

Querying for data in Neo4j is typically programmatic. You can use the Java API available with the distribution to query for data. Neo4j by design, as a Graph, is indexed and hence creating indexes is usually limited to certain app-specific functionality. Creating an index is typically so as to focus on a set of nodes – say most frequently queried, or a logical grouping of nodes etc. So, as a best practice, index only what you need and not every node or relationship. (http://docs.neo4j.org/chunked/milestone/indexing.html)
Lucene is the default indexing implementation for Neo4j. This code snippet enhances the previous snippet by including code to add indexes.

 GraphDatabaseService db = = new GraphDatabaseFactory().newEmbeddedDatabase("/sample/mydb");  
 Transaction tx = db.beginTx();  
 try {  
  Node personA = db.createNode();  
  personA.setProperty("name", "Person A");  
  Node projectCRM = db.createNode();  
  projectCRM.setProperty("name", "CRM Project");  
 personA.createRelationshipTo(projectCRM, DynamicRelationshipType.withName("LEADS"));  
 Index<Node> projects = db.index().forNodes("projects");  
 Index<Relationship> projectTeamMembers = db.index().forRelationships("projectTeamMembers");  
 db.index().forNodes("projects").add(projectCRM, "name",  
  projectCRM.getProperty("name"));  
 tx.success();  
 } finally {  
  tx.finish();  
 }  


Indexes usually are used to retrieve the reference to the set of nodes you are interested in. The real works begins after you have the set of Nodes (or Relationships) you looked for.
For querying data, in addition to the Query API, you have the options to use Traversal API , of which there are two kinds: The Simple Traversal API and the enhanced Traversal API. Both are under active development though the Simple version has been out longer and is more proven. You can also use the REST interface to query data.
Another popular alternative is the Cypher Query language (http://docs.neo4j.org/chunked/stable/cypher-query-lang.html). It is a declarative language that can be compared to SQL in its structure. It is a pattern matching language that walks through the nodes using the indexes already created. Cypher is a powerful tool in your arsenal when working with Neo4j.
There are other tools available at http://tinkerpop.com/ that you will find very useful.

Database Replication and High Availability

Neo4j is a Graph Database and by its concept and design, is not built for sharding. A typical production configuration has a Master-Slave design with more than one Slave servers. Data is replicated across the servers using log shipping for eventual consistency. The Master database writes are fast and immediate and slaves eventually acquire consistency based on a polling schedule that is configurable. On the other hand, writing to a slave means that the data has to be synchronously written to the Master before commits and the other slaves catch up per design. For disaster recovery, you can configure some slaves to be write-only and all they do is hold latest data until a disaster strikes the Master! Then one of those permanent slaves can be elected the master and your apps continue working without missing a beat.

Scaling

By design, Graphs are hard to scale. Nodes can have relationships that span multiple servers in your database server farm. Load-balancing can cause many relationships crossing instances. These are very expensive to traverse, networks are many orders of magnitude slower than in-memory traversals. To achieve scaling to the max, Neo4j uses Java NIO, in-memory caching, and expects a large RAM that it can use to cache data. The cache itself can be sharded by sharing the main memory to keep the cache warm with most frequently used data and when used with sticky sessions, ensures a user get max performance. You can also shard your domain data with some application specific algorithms.

Summary

This is only a brief introduction to Graph Databases with focus on Neo4j. The topic is vast and there is much to learn about this unique NoSQL technology. Neo4j is a wonderfult alternative not only to RDBMS but to other NoSQL databases due to its intuitive design.
However, it is important to know that a NoSQL database may not be the right option for every use case and further, even if you have a strong case for a NoSQL database, MongoDB may still not be the right choice for you.
Consider the following as you determine a fit your usage scenarios:
  1. 1. Neo4j offers a powerful data model especially for connected data.
  2. 2. By its nature, data traversal is very fast compared to RDBMS.
  3. 3. Graph Databases are great for generating recommendations, ratings, web analytics, Social Apps
  4. 4. Compared to other NoSQL database, supports ACID transactions.
  5. 5. It is under active development with a company behind it - http://www.neotechnology.com
  6. 6. A community edition is available and is free
  7. 7. However, HA relies on Master-Slave configuration which limits certain capabilities – sharding is not available out of the box. You have to be creative in design and deployment to achieve desired (or close to desired) results.
  8. 8. Scaling is the same - you have to be creative in design and deployment to achieve desired (or close to desired) results.
  9. 9. Write scaling is harder than read scaling and even though your domain may fit the connected data model, it still may not fit the Graph Database technology due to the non-functional requirements of your applications.
For documentation, instructions on installing and tutorials, visit www.neo4j.org.

No comments: