November 2011

Volume 26 Number 11

Data Points - What the Heck Are Document Databases?

By Julie Lerman | November 2011

Julie LermanThere’s a good chance that you’ve at least heard of the term NoSQL by now. Articles have even been written about it here in MSDN Magazine. A lot of people who I highly respect are quite excited about it, and having grown up on relational databases, I wanted to have a better understanding of the space. I’ve done quite a bit of research and pestering of friends to wrap my head around it, and here I’ll share what I’ve learned about a subset of NoSQL databases called “document databases.” Another subset is key-value pair databases. Azure Table Storage, which I wrote about in my July 2010 Data Points column (msdn.microsoft.com/magazine/ff796231), is an example of a key-value pair NoSQL store.

I should first address the definition of NoSQL. It’s become a bit of a ubiquitous and possibly overused term. The term is used to encompass data storage mechanisms that aren’t relational and therefore don’t require using SQL for accessing their data. In his blog post, “Addressing the NoSQL Criticism” (bit.ly/rkphh0), CouchDB expert and author Bradley Holt says that he’s heard people “redefining NoSQL as ‘not only SQL.’” His point is that this isn’t an anti-SQL movement by any means. I like this perspective, because I’m a big believer in using the right tool for the job.

Most databases that fall under the nonrelational umbrella share common goals of speed and scalability. By breaking away from the relational storage model and leaving schemas behind, these databases are free of the limitations put upon them by a tightly bound schema and your application’s need to join data across tables.

Of the many document databases available, I’ll focus on two of the most popular—MongoDB (mongodb.org) and CouchDB (couchdb.apache.org)—as well as RavenDB (ravendb.net), which was written for the Microsoft .NET Framework and is growing in popularity (see the article, “Embedding RavenDB into an ASP.NET MVC 3 Application,” in this issue). This will remain high-level, though you can learn many more details about the individual databases and what makes them unique from one another by visiting their Web sites.

With the exception of a few twists (which I’ll point out in this article), these databases provide their data most commonly through HTTP, store their data as JavaScript Object Notation (JSON) documents and provide APIs in multiple languages. The overall concerns are simplicity, speed and scalability. Equally important is that all three are open source projects.

In my research, I’ve heard a MongoDB expert say that the product’s primary concern is performance. A CouchDB expert pointed to simplicity and reliability (“we want to be the Honda Accord of databases”). And Ayende Rahien, creator of RavenDB, said RavenDB aims for “fast writes, fast reads and world peace.” Each of these document databases has even more to offer than what these sound bites suggest.

An Alternative, Not a Replacement, for Relational Databases

The NoSQL and document databases provide an alternative to relational databases, not a replacement. Each has its place, and they simply provide you with more options from which to choose. But how to choose? An important gauge is the Consistency, Availability and Partition Tolerance (CAP) theorem. It says that when working in distributed systems, you can only have two of the three guarantees (the C, the A or the P), so you have to pick what’s important. If Consistency is the most critical, then you need to go with a relational database.

A common example of where Consistency would be the most important guarantee is in a banking application or perhaps one that runs a nuclear facility. In these scenarios, it’s critical that every single piece of data is accounted for at every moment. If someone makes a withdrawal, you really need to know about it when you’re looking at his account balance. Therefore, you’ll probably want a relational database with a high level of control over its transactions. A term you’ll hear frequently is “eventual consistency,” or as expressed on the RavenDB site: “better stale than offline.” In other domains, eventual consistency is sufficient. It’s OK if data you’re retrieving isn’t up-to-the-millisecond accurate.

Perhaps, then, it’s more important that some version of the data is available, rather than waiting for all of the transactions to catch up. This is related to the A (Availability) in CAP, which is focused on server uptime. Knowing that you’ll always have access to the database takes precedence and is a huge benefit to database performance (that is, document databases are fast!). You’ll find that the P, Partition Tolerance, is also important to the document databases, especially when scaling horizontally.

RESTful HTTP API—Mostly

Many of the NoSQL databases are accessible in a RESTful way, so you make your database connection through a URI, and the queries and commands are HTTP calls. MongoDB is an exception. Its default is to use TCP for database interactions, although there’s at least one HTTP API available, as well. CouchDB and MongoDB provide language-specific APIs that let you write and execute queries and updates without having to worry about writing the HTTP calls directly. RavenDB has a .NET client API that simplifies interacting with the database.

A lot of people incorrectly presume that nonrelational databases are flat files. The documents stored in a document database are capable of containing shaped data: trees with nodes. Each record in the database is a document and can be an autonomous set of data. It’s self-describing—including its possibly unique schema—and isn’t necessarily dependent on any other document.

Following is a typical example of what a record might look like in a document database (I’ll steal a sample from the MongoDB tutorial that represents a student):

{

  "name" : "Jim",

  "scores" : [ 75, 99, 87.2 ]

}

And here’s one from the CouchDB introductory article, which describes a book:

{

  "Subject": "I like Plankton"  

  "Author": "Rusty"  

  "PostedDate": "5/23/2006"  

  "Tags": ["plankton", "baseball", "decisions"]

  "Body": "I decided today that I don't like baseball. I like plankton."

}

These are simple structures with string data, numbers and arrays. You can also embed objects within objects for a more complex document structure, such as this blog post example: 

{

  "BlogPostTitle”: “LINQ Queries and RavenDB”,

  "Date":"\/Date(1266953391687+0200)\/",

  "Content":”Querying RavenDB is very familiar for .NET developers who are already

    using LINQ for other purposes”,

  "Comments":[

             {

             "CommentorName":"Julie",

             "Date":"\/Date(1266952919510+0200)\/",

             "Text":"Thanks for using something I already know how to

               work with!",

             "UserId":"users/203907"             

             },

  ]

}

Unique Keys

All of the databases require a key. If you don’t provide one, they’ll create one internally for you. Keys are critical to the databases’ ability to index, but your own domain may require that you have known keys. In the previous blog post example, notice that there’s a reference to “users/203907.” This is how RavenDB leverages key values and allows you to define relationships between documents.

Storage in JSON Format

What these sample records all have in common is that they’re using JSON to store their data. CouchDB and RavenDB (and many others) do in fact store their data in JSON. MongoDB uses a twist on JSON called Binary JSON (BSON) that’s able to perform binary serialization. BSON is the internal representation of the data, so from a programming perspective, you shouldn’t notice any difference.

The simplicity of JSON makes it easy to transpose object structures of almost any language into JSON. Therefore, you can define your objects in your application and store them directly in the database. This relieves developers of the need to use an object-relational mapper (ORM) to constantly translate between the database schema and the class/object schema.

Full-text searching engines—such as Lucene (lucene.apache.org), which is what RavenDB relies on—provide high-performance searching on this text-based data.

Notice the date in the blog post example. JSON doesn’t have a date type, but each of the databases provides a way to interpret date types from whichever language you’re coding in. If you check out the Data Types and Conventions list for the MongoDB BSON API (bit.ly/o87Gnx), you’ll see that a date type is added, along with a few others, to flesh out what’s available in JSON.

Storing and retrieving related data in a single unit can have huge performance and scalability benefits. Databases don’t have to go trolling around to find data that’s commonly related, because it’s all together.

Collections of Types

When interacting with the database, how does your application know that one item is a student, another is a book and another is a blog post? The databases use a concept of collections. Any document, regardless of its schema, that’s associated with a particular collection—for example, a student collection—can be retrieved when requesting data from that collection. It’s also not uncommon to use a field to indicate type. This just makes searches a lot easier, but it’s up to your application to enforce what should and shouldn’t go into a collection.

Schema-Less Database

The “student” described earlier contains its own schema. Each record is responsible for its own schema, even those contained in a single database or collection. And one student record doesn’t necessarily need to match another student record. Of course, your software will need to accommodate any differences. You could simply leverage this flexibility for efficiency. For example, why store null values? You could do the following when a property, such as “most_repeated class,” has no value:

"name" : "Jim",

"scores" : [ 75, 99, 87.2 ]

"name" : "Julie",

"scores" : [ 50, 40, 65 ],

"most_repeated_class" : "Time Management 101"

Yes, Virginia, We Do Support Transactions

Each of the databases provides some level of transaction support—some more than others—but none are as rich as what can be achieved in a relational database. I’ll defer to their documentation and let you follow up with your own additional research.

Document Databases and Domain-Driven Development

One of the core concepts of domain-driven development relates to modeling your domain using aggregate roots. When planning your domain classes (which may become the documents in your database), you can look for data that’s most often self-contained (for example, an order with its line items) and focus on that as an individual data structure. In an ordering system, you’ll probably also have customers and products. But an order might be accessed without needing its customer information and a product might be used without needing access to the orders in which it’s used. This means that although you’ll find many opportunities to have self-contained data structures (such as the order with its line items), that doesn’t rule out the need or capability to join data through foreign keys in certain scenarios.

Each of the databases provides guidance on various patterns that are available and which ones their users are having the most success with. For example, MongoDB documentation talks about a pattern called Array of Ancestors, which speeds up access to related data when joining documents.

Concerns about navigating relationships are bound to the fact that in a relational database, repeating data is a sin. Databases are normalized to ensure this. When working in NoSQL databases, especially those that are distributed, denormalizing your data is useful and acceptable.

Querying and Updating

Each database comes with APIs for querying and updating. While they may not be part of the core API, a variety of language APIs are supplied through add-ons. As a .NET Framework entry into the document database world, RavenDB uses LINQ for querying—a nice benefit for .NET developers.

Other queries depend on predefined views and a pattern called map/reduce. The map part of this process uses the views, and the responsibility of the map differs between databases. The map also enables the database to distribute the query processing across multiple processors. Reduce takes the result of the map query (or queries, if it has been distributed) and aggregates the data into the results to be returned to the client.

Map/reduce is a pattern, and the various databases have their own implementations. Rob Ashton provides an interesting comparison of how RavenDB and CouchDB perform map/reduce at bit.ly/94OCME.

While CouchDB requires that you query through pre-defined map/reduce view, MongoDB (also using views and map/reduce) additionally provides the ability to do ad hoc querying. RavenDB allows predefined indexes for querying, but also support ad hoc queries, and will create indexes automatically for you based on your actual runtime queries. For the most part, however, when moving away from the known schemas and relational nature of the SQL databases, the ability to perform ad hoc querying is one of the features you lose. By having tight control over the querying, the document databases are able to promise their fast performance.

A Database Revolution

There are so many nonrelational databases out there under the NoSQL umbrella. And now that the door has opened, it’s inspiring more to come as folks look at what’s available and dream of how they might improve on it. I think RavenDB is a great example of this, and you can watch how Rahien is evolving the database as he continues to dream about how to make it better or becomes inspired by users.

I believe the intrigue about these databases is infectious. I definitely look forward to digging further and learning more. But even the three I’ve looked at are so interesting that it’s hard for this Libra to choose among them, because at present, I’m solving a curiosity problem and not a real business problem, and relational databases happen to be the right fit for my current projects.


Julie Lerman* is a Microsoft MVP, .NET mentor and consultant who lives in the hills of Vermont. You can find her presenting on data access and other Microsoft .NET topics at user groups and conferences around the world. She blogs at thedatafarm.com/blog and is the author of the highly acclaimed book, “Programming Entity Framework” (O’Reilly Media, 2010). Follow her on Twitter at twitter.com/julielerman.*

Thanks to the following technical experts for reviewing this article: *Ted Neward and *Savas Parastatidis