June 2015

Volume 30 Number 6


Data Points - An Overview of Microsoft Azure DocumentDB

By Julie Lerman

Julie LermanIn November 2011, I wrote a column called “What the Heck Are Document Databases?” (msdn.microsoft.com/magazine/hh547103) in which I discussed some of the most well-known document databases: MongoDB, CouchDB and RavenDB. All three of these NoSQL databases are still going strong. At that time, Microsoft didn’t have a document database on the market, though it did have Microsoft Azure Table Storage, a NoSQL database based on key-value pairs. However, in August 2014, Microsoft Azure DocumentDB was announced, which, as its name implies, is a NoSQL document database service available on Azure.

In this column, I’ll provide an overview of Azure DocumentDB that I hope will intrigue you enough to investigate it further on your own. The service is available on Azure and was already in use even before its April 8 elevation from a preview to a generally available resource. For example, there’s a great customer story about a company called SGS that implemented DocumentDB as part of a solution at bit.ly/1GMnBd9. One of the developers on that project sent me a tweet about this and said the customer is really happy with it so far.

What Is a Document Database?

My earlier column focused on answering this question, but I’ll discuss it briefly here and recommend you read that other column. A document database stores data as documents, mostly as individual JSON documents. (MongoDB has a small twist because it squishes its JSON documents into a binary format called BSON.) This storage provides much faster performance when working with huge amounts of data because it doesn’t have to jump all over the database to pull together related data. Related data can be combined in a single JSON document. Another critical feature of a document database and other NoSQL databases is that they’re schema-less. Unlike a relational database whose tables need predefined schemas in order to store and retrieve data, a document database allows each document to define its own schema. So the database is made up of collections of documents. Figure 1 shows a simple example of what an individual JSON document might look like. Notice that it specifies a property name along with the value and that it contains related data.

Figure 1 A Simple JSON Document

{
  "RecipeName": "Insane Ganache",
  "DerivedFrom": "Café Pasqual’s Cookbook",
  "Comments":"Insanely rich. Estimate min 20 servings",
  "Ingredients":[
    {
      "Name":"Semi-Sweet Chocolate",
      "Amount":"1.5 lbs",
      "Note":"Use a bar, not bits. Ghiradelli FTW"
    },
    {
      "Name":"Heavy cream",
      "Amount":"2 cups"
    },
    {
      "Name":"Unsalted butter",
      "Amount":"2 tbs"
    }
],
  "Directions": "Combine chocolate, cream and butter in the top ..."
}

Not only is this data all in JSON format and self-describing, but it contains related data (ingredients). Another feature common to some of these document databases is that they’re all accessible via HTTP calls. You’ll see more about this later, and again, the earlier column goes into detail about these and other features that are common to these databases.

Structure of Azure DocumentDB

Figure 1 shows what a typical document stored in a document database looks like. Azure DocumentDB is made up of more than just these documents, however. Documents are considered a resourcein Azure DocumentDB, and may be grouped into collections, which are also resources in DocumentDB. You can create, update, delete and query collections using HTTP calls, just as you can with documents. In fact, DocumentDB is entirely made up of different types of resources. Collections are grouped into a single DocumentDB database. You can have multiple databases in a DocumentDB account and you can have multiple accounts.

All of these resources are first-class citizens in the ecosphere. In addition, there’s another set of resources that accompany your documents. These are named in a way that’s familiar to relational database users: Stored procedures, User-defined functions (UDFs), Indexes and Triggers.

One last resource related to documents is an attachment, which is any kind of binary that’s attached to the JSON document. The binary of the attachment lives in Azure Blob Storage but the metadata is stored in DocumentDB, ensuring you can query for attachments on a variety of properties.

Additionally, DocumentDB has built-in security features and, within that scope, Users and Permissions are also resources you can interact with using the same means as you would documents.

Interacting with DocumentDB

There are a number of ways to work with resources in Azure DocumentDB, including: SQL, REST API and various client APIs including the .NET API, which lets you use LINQ to query the database. You can learn much more about the details of querying at bit.ly/1aLm4bC.

The Azure Portal is where you create and manage a DocumentDB account. (See the documentation at bit.ly/1Cq8zE7.) You can also manage your DocumentDB in the portal, as well as use Document Explorer and Query Explorer to view and query your documents. In Query Explorer, you can use the SQL syntax, as I’ve done for the simple query in Figure 2.

Using SQL Syntax to Query Documents in the Azure Portal Query Explorer
Figure 2 Using SQL Syntax to Query Documents in the Azure Portal Query Explorer

You can also use this SQL in your apps. For example, here’s some code from “Build a Node.js Web Application Using DocumentDB” (bit.ly/1E7j5Wg), where the query is expressed in SQL syntax:

getOrCreateDatabase: function (client, databaseId, callback) {
  var querySpec = {
    query: 'SELECT * FROM root r WHERE r.id=@id',
    parameters: [{
      name: '@id',
      value: databaseId
    }]
  };

In these early days of DocumentDB, you might find this SQL syntax limited, but keep in mind that you can supplement the existing SQL with UDFs. For example, you can write your own CONTAINS function for building predicates that evaluate strings, such as  CONTAINS(r.name, “Chocolate”).

Like many other Azure resources, Azure DocumentDB has a native REST API and can be queried and updated using HTTP. Every resource has a unique URI.  Here’s an example of an HTTP request for a particular DocumentDB permission:

GET https://contosomarketing.documents.azure.com/dbs/ruJjAA==/users/ruJjAFjqQAA=/permissions/ruJjAFjqQABUp3QAAAAAAA== HTTP/1.1
x-ms-date: Sun, 17 Aug 2014 03:02:32 GMT
authorization: type%3dmaster%26ver%3d1.0%26sig%3dGfrwRDuhd18ZmKCJHW4OCeNt5Av065QYFJxLaW8qLmg%3d
x-ms-version: 2014-08-21
Accept: application/json
Host: contosomarketing.documents.azure.com

Go to bit.ly/1NUIUd9 for details about working with the REST API directly. But working with any REST API can be pretty cumbersome. There are a number of client APIs already available for interacting with Azure DocumentDB: .NET, Node.js, JavaScript, Java and Python. Download the SDKs and read the documentation at bit.ly/1Cq9iVJ.

.NET developers will appreciate the .NET library allows you to query using LINQ. While the LINQ method support will definitely grow over time, the currently supported LINQ expressions are: Queryable.Where, Queryable.Select and Queryable.SelectMany.

Before you can perform any interaction with a DocumentDB, you need to specify an account, the database and the collection in which you want to work. The following, for example, defines a Microsoft.Azure.Documents.ClientDocument using the .NET API:

string endpoint = ConfigurationManager.AppSettings["endpoint"];
string authKey = ConfigurationManager.AppSettings["authKey"];
Uri endpointUri = new Uri(endpoint);
client = new DocumentClient(endpointUri, authKey);

This sample code comes from an ASP.NET MVC and DocumentDB walk-through I followed on the Azure Documentation page (bit.ly/1HS6OEe). The walk-through is quite thorough, beginning with steps to create a DocumentDB account on the Azure Portal. I highly recommend it or, alternatively, one of the walk-throughs that demonstrate DocumentDB with other languages, such as the Node.js article I mentioned earlier. The sample application has a single type, an Item class, shown in Figure 3.

Figure 3 The Item Class

public class Item
  {
    [JsonProperty(PropertyName = "id")]
    public string Id { get; set; }
    [JsonProperty(PropertyName = "name")]
    public string Name { get; set; }
    [JsonProperty(PropertyName = "descrip")]
    public string Description { get; set; }
    [JsonProperty(PropertyName = "isComplete")]
    public bool Completed { get; set; }
  }

Notice that each property of the item class specifies a JsonProperty PropertyName. This isn’t required, but it allows the .NET client to map between the stored JSON data and my Item type, and lets me name my class properties however I want, regardless of how they’re named in the database. Using the defined client, you can then express a LINQ query that returns an instance of a Microsoft.Azure.Documents.Database given a known database Id:

var db = Client.CreateDatabaseQuery()
               .Where(d => d.Id == myDatabaseId)
               .AsEnumerable()
               .FirstOrDefault();

From there you can define a collection within the database and finally query the collection with a LINQ expression like the following, which returns a single JSON document:

return Client.CreateDocumentQuery(Collection.DocumentsLink)
             .Where(d => d.Id == id)
             .AsEnumerable()
             .FirstOrDefault();

The various objects within the .NET API also enable operations to insert, update, and delete documents with the CreateDocument­Async, UpdateDocumentAsync and DeleteDocumentAsync (CUD) methods, which wrap the HTTP calls in the REST API. Like the queries, there are relevant CUD methods for other resource types, such as stored procedures and attachments.

A New Twist on CAP

One of the more interesting aspects of DocumentDB that sets it apart from other document databases is that it lets you tune the consistency. My earlier article on document databases talked about the CAP Theorem, which says that given guarantees of consistency, availability and partition (CAP) tolerance in a distributed system, only two of the three can be achieved. Relational databases ensure consistency at the cost of availability (for example, waiting for a transaction to complete). NoSQL databases, on the other hand, are more tolerant of eventual consistency, where the data might not be 100 percent current, in order to favor availability.

Azure DocumentDB provides a new way to address the CAP Theorem by letting you tune the level of consistency, thereby offering a chance to also benefit from both availability and partition tolerance at the same time. You can choose between four levels of consistency—strong, bounded staleness, session and eventual—which can be defined per operation, not just on the database. Rather than all or nothing consistency, you can tune the level of consistency to suit your needs throughout your solutions. Read more about this on the Azure DocumentDB Documentation page at bit.ly/1Cq9p3v.

Server-Side JavaScript

Many of you are probably familiar with stored procedures and UDFs in relational databases and, unlike other document databases, Azure DocumentDB includes these concepts, although they’re written in JavaScript. JavaScript can natively interact with JSON, so this is extremely efficient for interacting with the JSON documents and other resources. No transformations or translations or mappings are needed. Another benefit of having server-side JavaScript in the form of stored procedures and UDF triggers is that you get atomic transactions across multiple documents—everything in the scope of the transaction will be rolled back if one process fails. Defining stored procedures and UDFs is quite different from what you might be used to in a relational database like SQL Server. The portal doesn’t yet provide that capability.  Instead, you define your server-side code in your client-side code. I recommend looking at the Server-Side Script section of the Azure DocumentDB .NET Code Samples at bit.ly/1FiNK4y.

Now I’ll show you how to create and store a stored procedure, then how to execute it. Figure 4 shows a simple example that uses .NET API code to insert a stored procedure into a DocumentDB.

Figure 4 Inserting a Stored Procedure into a DocumentDB

public static async Task<StoredProcedure> InsertStoredProcedure() {
  var sproc = new StoredProcedure
              {
                Id = "Hello",
                Body = @"
                  function() {
                    var context = getContext();
                    var response = context.getResponse();
                    response.setBody('Stored Procedure says: Hello World');
                  };"
              };
  sproc = await Client.CreateStoredProcedureAsync(setup.Collection.SelfLink, sproc);
  return sproc;
}

I’ve encapsulated all of the logic in a single method for simplicity. My StoredProcedure object consists of an ID and a Body. The Body is the server-side JavaScript. You might prefer to create JavaScript files for each procedure and read their contents when creating the StoredProcedure object. The code presumes that the StoredProcedure doesn’t yet exist in the database. In the download example, you’ll see that I call out to a custom method that queries the database to ensure the procedure doesn’t already exist before inserting it. Finally, I use SetupDocDb<T>.Client property (which provides the DocumentClient instance) to create the stored procedure, similar to how I queried for a document earlier.

Now that the stored procedure exists in the database, I can use it. This was a little difficult for me to wrap my head around because I’m used to the way SQL Server works and this is different. Even though I know the procedure’s Id is “Hello,” with the current API that’s not enough to identify it when calling ExecuteStoredProcedureAsync. Every resource has a SelfLink created by the DocumentDB. A SelfLink is an immutable key that supportsthe REST capabilities of DocumentDB. It ensures every resource has an immutable HTTP address. I need that SelfLink to tell the database which stored procedure to execute. That means I must first query the database to find the stored procedure using the familiar Id (“Hello”) so I can find its SelfLink value. This workflow is causing friction for developers and the DocumentDB team is changing how it works to eliminate any need for SelfLinks. That change may even be made by the time this article has gone to press. But for now, I’ll query for the procedure as I would for any DocumentDB resource: I’ll use the CreateStoredProcedureQuery method. Then, with the SelfLink, I can execute the procedure and get its results:

public static async Task<string> GetHello() {
  StoredProcedure sproc = Client.CreateStoredProcedureQuery(Collection.SelfLink)
    .Where(s => s.Id == "Hello")
    .AsEnumerable()
    .FirstOrDefault();
  var response =
    (await Client.ExecuteStoredProcedureAsync<dynamic>(sproc.SelfLink)).Response;
  return response.ToString();
}

Creating UDFs is similar. You define the UDF as JavaScript in a UserDefinedFunction object and insert it into the DocumentDB. Once it exists in the database, you can use that function in your queries. Initially, that was possible only using the SQL syntax as a parameter of the CreateDocumentQuery method, although the LINQ support was added just prior to the official release of DocumentDB in early April 2015. Here’s an example of a SQL query using a custom UDF:

select r.name,udf.HelloUDF() AS descrip from root r where r.isComplete=false

The UDF simply spits out some text so it takes no parameters.

Notice that I’m using the JsonProperty names in the query because it will be processed on the server against the JSON data. With LINQ queries I’d use the property names of the Item type instead.

You’ll find a similar query being used in the sample download, although there my UDF is called HelloUDF.

Performance and Scalability

There are so many factors that come into play when talking about performance and scalability. Even the design of your data models and partitions can impact both of these critical facets of any data store. I highly recommend reading the excellent guidance on modeling data in DocumentDB at bit.ly/1Chrjqa. That article addresses the pros and cons of graph design and relationships and how they affect the performance and scalability of DocumentDB. The author, Ryan CrawCour, who is the senior program manager of the DocumentDB team, explains which patterns benefit read performance and which benefit write performance. In fact, I found the guidance to be useful for model design in general, not just for Azure DocumentDB.

How you choose to partition your database should also be determined by your read and write needs. The article on partitioning data in DocumentDB at bit.ly/1y5T4FG gives more guidance on using DocumentDB collections to define partitions and how to define collections depending on how you’ll need to access the data.

As another benefit of partitioning, you can create (or remove) more collections or databases as needed. DocumentDB scales elastically; that is, it will automatically comprehend the full collection of resources.

Indexes are another important factor affecting performance and DocumentDB allows you to set up indexing policies across collections. Without indexing, you’d only be able to use the SelfLinks and Ids of resources to perform querying, as I did earlier. The default indexing policy tries to find a balance between query performance and storage efficiency, but you can override it to get the balance you want. Indexes are also consistent, which means searches that leverage the indexing will have immediate access to new data. Read more details about indexing at bit.ly/1GMplDm.

Not Free, but Cost Effective

Managing performance and scalability affects more than the accessibility of your data, it also affects the cost of providing that data. As part of its Azure offerings, DocumentDB does come at a price. There are three price points determined by which of three performance-level units you choose. Because Microsoft is constantly tweaking the cost of its services, it’s safest to point you directly to the DocumentDB pricing details page (bit.ly/1IKUUMo). Like any NoSQL database, DocumentDB is aimed at providing data storage for huge amounts of data and, therefore, can be dramatically more cost-effective than working with relational data in the relevant scenarios.


Julie Lerman is a Microsoft MVP, .NET mentor and consultant who lives in the hills of Vermont. You can find her presenting on data access and other .NET topics at user groups and conferences around the world. She blogs at thedatafarm.com and is the author of “Programming Entity Framework” (2010), as well as a Code First edition (2011) and a DbContext edition (2012), all from O’Reilly Media. Follow her on Twitter at twitter.com/julielerman and see her Pluralsight courses at juliel.me/PS-Videos.

Thanks to the following Microsoft technical expert for reviewing this article: Ryan CrawCour
Ryan CrawCour is 20-year database veteran who started out many years ago writing his first stored procedure for SQL Server 4.2. Many cursors, joins and stored procedures later he began exploring the exciting free world of NoSQL solutions. Ryan is now working with the DocumentDB product team in Redmond as a Program Manager helping shape the future of this all-new NoSQL Database-as-a-Service