Glossary of a few terms you need to know as you read on:
Sharding (or Horizontal partitioning) is a database design principle whereby the content of a database table are split across physical locations by rows instead of by columns (using referential integrity). Each partition forms part of a shard. Multiple shards together provide a complete data set but the partitioned shards are split logically to ensure faster reads and writes.
Database Replication is the process of sharing data between the primary and one or more redundant databases, to improve reliability, fault-tolerance, or accessibility. Typically, data is immediately copied over to backup location upon write so as to be available for recovery and/or read-only resources.
Mongo DB
MongoDB is a document-oriented database management system. It adheres to the NoSQL core concepts in being distributed, schema-free and highly performant for fast reads and writes. As a database built from the ground up, it is specifically designed for high read and write throughput, storing data as documents in a distributed and sharded environment all the while running on commodity servers.
MongoDB is especially suited for Web applications that require a rich domain model. Since a document can intuitively represent rich hierarchical model, one can do away with complicated data structures for managing relationships (a.k.a multi-table joins). A document can contain all the data related to a domain object without having to join to other data structures, resulting in faster reads, writes and more intuitive data management. MongoDB supports database replication with automatic failover support. It is designed for horizontal scaling, which I explain later in the post.
The Data Model
Since MongoDB is a document-oriented database, data stored by MongoDB is in the form of a document that is hierarchical and has attributes that can be easily ready by human and machine alike. The best way to illustrate this is by an example. Listing A shows a sample document that represents an employee in an organization.
Listing A
{ _id: ObjectID('5ddfaf7c7649b7e7hd638d65'), firstName: 'Mike', lastName: 'Smith, department: 'Finance', skills: ['SAP Finance','CFP','CPA'], salary:100000, image: {url:'companyserver/location/msmith.jpg'} }
The “employee” document is simply a JSON document, which means it is a set of key/value pairs of data. Keys represent the name of the fields and the values are either a specific value or could in turn represent a nested document, an array of data, or an array of documents. If you had to represent this in a relational model, you will end up with multiple tables to represent the entities, multiple join tables and referential integrity constraints. To extract data, you will find yourself writing a complex SQL query that joins these flat structured one-dimensional tables. But, a document would represent the person as a whole with all data self-contained resulting in being more intuitive to use.
Because you always operate on a document as whole, you get atomic operations on the document.
Querying and Indexes
The code to query data in these documents also ends up being more intuitive to read. For instance, to find all employees and to find a particular emloyee, the code would look like this:
Listing B:
db.employees.find(); db.employees.find({‘firstName’:’Mike’, ‘lastName’:’Smith’});
Now, if you spend a few more minutes to look at this document, you would be thinking of various other ways your clients would need this data – all employees by department, employees making more than $90,000, employees with particular skills etc. You would also rightfully wonder if these queries would be fast. But what is critical for you to know is that these queries are also very efficient because MongoDB allows you to define secondary indexes on the document. You would add the following indexes and the query would run quite efficiently.
Listing C:
db.employees.ensureIndex({lastName: 1}); db.employees.ensureIndex({skills: 1}); db.employees.ensureIndex({salary: 1}); db.employees.find({'skills': 'CFP'}); db.employees.find({'salary': {'$gt': 90000}});
Database Replication
One of the biggest advantages of NoSQL databases and MongoDB in particular is its database replication features. MongoDB’s offering is called a replica set, which is a set of MongoDB database servers that operate seamlessly as an unit offering one primary database server for writes and (potentially reads) and at least one (but typically more than one) redundant failover servers. The failover servers are referred to as secondary servers and can be used for scaling data reads. The data in the secondary servers is kept in sync by the replication process in real-time and in case of a failure in the primary server, one of the secondary servers kicks in automatically as primary and the remaining servers continue to be secondary. The choice of which server becomes primary can be decided by a configurable arbitration process. Eventually, when the former primary comes back online and rejoins the cluster, it does so as a secondary server. This is what is known as a replicated set.
Scaling
If you have been involved in building and running any enterprise application or large scale web application, it is very likely that you have been faced with scenarios where performance has degraded for your end-users over time. With existing architectures involving application servers and a large database server, you may have taken the approach of beefing up the hardware on your application servers and database servers, also known as vertical scaling. The drawback of such a solution is that there is a finite amount of opportunity available with your application servers and database servers after which, you cannot benefit from upgrades while being pragmatic about costs. Also, you will notice a glaring weakness – your database server is a single point of failure. If the server fails, so does your application ecosystem. Granted you may have backup, but to load the backup data and bring the server back online maynot be a trivial effort.
Horizontal scaling (Sharding) involves scaling out – instead of concentrating on improving one server, you focus on distributing data across multiple smaller servers. You continue with your existing hardware profile, perhaps smaller commodity servers, add more of these servers into your ecosystem and distribute your data across this farm of servers. In this architecture, the entire set of servers together isyour database but each server in the farm contains only a slice of data. In such a scenario, it is easier to imagine that a failure of one server means you lose a small chunk of data while retaining most of it and having it available to your applications. To extend this further and couple with what you learned about replication in the previous section, if each server is a replicated set, you are almost guaranteed to have the smallest down time in case of a catastrophic failure in a particular server.
Mongo DB was designed to operate as a distributed database with support for auto-sharding, which automatically manages the distribution of data across nodes. Adding a shard is easy and failover is handled by MongoDB’s sharding system. If you ensure that each shard operates as a replica set as discussed earlier, you would eliminate single-points of failure. MongoDB’s sharded cluster management also ensures that developers need not worry about the location of data and only need to address the sharded cluster as if it were a single MongoDB server.
Each shard in MongoDB consists of one or more servers running a mongod process. A collection is sharded based on a shard key which can be defined to use a certain pattern similar to defining an index. MongoDB's sharding is order-preserving; adjacent data by shard key tend to be on the same server. This page provides more detail on MongoDB's sharding policy.
Data Integrity
MongoDB offers an option called “Journaling”, a feature that addresses data integrity by recording every database write operation as a running incremental log in addition to writing to the database. This journal would then be used in case of a catastrophic failure to replay the write operations to bring the database back to its original state prior to the crash. An alternative to journaling would be run each server as a replicated set, but journaling is always recommended.
What do you give up?
RDBMS have long rules the database world for everything from simple persistence to complex analytical database engines. It is inevitable that NoSQL databases are compared to relative to RDBMS features. Here are a few things of note:
By using MongoDB, you give up the notion of atomic transactions (ACID) across domains. While updating a single document is an atomic transaction in MongoDB, writing to two separate documents (tables in RDBMS) is not a single transaction. One could succeed while the other could fail. You have to design for this by a separate strategy.
There is no concept equivalent to table joins and foreign key constraints in MongoDB.
By nature of a document-oriented model without joins and key constraints, you will end up with a more de-normalized model with MongoDB than you would expect with a similar RDBMS model.
You have to have a certain tolerance for some (but rare) loss of data because of in-memory manipulation of data in MongoDB as opposed to physical disk persistence in RDBMS databases.
By using MongoDB, you will give up some of the powerful SQL Querying constructs supported by RDBMS databases.
Summary
I have presented an overview of MongoDB, one of the most popular NoSQL databases around today. MongoDB is a strong and viable candidate for any company to consider as an alternative to relational databases or to other NoSQL databases. It is designed to be scalable, distributed, easy to administer and to run on commodity servers.
However, it is important to know that a NoSQL database may not be the right option for every use case and further, even if you have a strong case for a NoSQL database, MongoDB may still not be the right choice for you.
Consider the following as you determine a fit your usage scenarios:
- MongoDB is document oriented and within each document, maintains a key/value notation to store data. This format supports rich domains usually found in web applications, analytics, logging.
- The MongoDB document can act as a primary storage for your domain. The rich data model allows you to define secondary indexes for faster queries.
- MongoDB is schema-less. This allows you to model a domain whose attributes are expected to vary over time.
- Auto-sharding lets you operate your database as a distributed database and if your domain has naturally shardable attributes, then MongoDB is the best choice.
- MongoDB’s scaling features allows you to support a large database and still meet speed and availability specifications typically required of web applications.
- Not having to worry much about DBA activities, developers control the data format which leads to seamless development and enhancements to existing features all resulting in faster delivery.
- Community and Commercial support for MongoDB is significant. 10gen is the company behind MongoDB and seems committed to its success. There is a significant developer community excited about the product and its future.
- MongoDB has been deployed in many production environments around the world and has been proven to be a success. Craigslist (Craigslist uses MongoDB to archive billions of records), SAP (SAP uses MongoDB as a core component of SAP’s platform-as-a-service (PaaS) offering), Foursquare (to store venues and user check-ins) are some of the companies using MongoDB in production.