The “NoSQL” approach: struggling to see the benefits

Document-oriented data modeling is still young. The fact is, many more applications will need to be built on the document model before we can say anything definitive about best practices.
MongoDB Data Modeling and Rails

ismb2009.ff

ISMB 2009 feed, entries by date

This quote from the MongoDB website sums up, for me, the key problem in moving to a document-oriented, schema-free database: design. It’s easy to end up with a solution which resembles a relational database to the extent that you begin to wonder – if you should not just use a relational database. I’d like to illustrate by example.

I’m currently working on a Rails application to archive entries from a FriendFeed feed in a MongoDB database and display some statistics about the feed, using Highcharts. I’ve made some progress – see the screenshot, right.

Fetching a feed is pretty easy using the API; here’s how I’d grab the first 30 entries for my own feed:

require 'rubygems'
require 'json/pure'
require 'open-uri'

feed = JSON.parse(open("http://friendfeed-api.com/v2/feed/neilfws").read)

That generates a hash, feed, with the following structure:

# feed
  sup_id        =>      String
  name          =>      String
  description   =>      String
  type          =>      String
  private       =>      String
  commands      =>      Array
  entries       =>      Array

The array entries contains each entry with, if any, comments and likes:

# entry
  url          =>   String
  date         =>   String
  body         =>   String
  from         =>   Hash
  to           =>   Array
  thumbnails   =>   Array  # thumbnails[]{} - url, link, width, height, player
  files        =>   Array  # files[]{} - url, type, name, icon, size
  via          =>   Hash   # via{} - name, url
  geo          =>   Hash   # geo{} - lat, long
  commands     =>   Array
  comments     =>   Array
  likes        =>   Array

My next question: how best to save that to a MongoDB database?

Feed as document? Entry as document? Both.
Since JSON maps to a hash and the hash maps to MongoDB document structure, the temptation is always simply to save the hash straight to a collection, i.e. one feed = one document. Of course that would be a mistake, because maximum document size is 4 MB which would easily be surpassed for a feed with more than one hundred or so entries. So, we need to break up the feed.

Feeds have entries, entries have comments and likes. An obvious solution then, is to put information about the feed in one collection and the entries in a second collection, with comments and likes embedded inside their entry.

I like the Mongoid ODM. It plays well with the native Ruby mongo driver; I used the latter to save documents quickly and easily, then build the models on top using Mongoid. It can also use existing IDs as strings. This is discouraged at the MongoDB website, but in this case it makes sense to me that a document feed ID should match its FriendFeed feed ID and generate helpful URLs, such as /the-life-scientists/entries – as opposed to, e.g. /4c5a2e89daa3644ffc000001/entries.

Using Mongoid, the models look like this:

# feed.rb
class Feed
  include Mongoid::Document
  include Mongoid::Timestamps
# entries
  has_many_related :entries
end

# entry.rb
class Entry
  include Mongoid::Document
  include Mongoid::Timestamps
# embedded
  embeds_many :comments
  embeds_many :likes
# feed
  belongs_to_related :feed
end

# comment.rb
class Comment
  include Mongoid::Document
  embedded_in :entry, :inverse_of => :comments
end

# like.rb
class Like
  include Mongoid::Document
  embedded_in :entry, :inverse_of => :likes
end

Already then, I’ve been forced away from the preferred, embedded design and into relational associations.

Defining the fields
The models outlined above require some extra work. Using Mongoid, the default type for a field is String – anything else has to be defined explicitly. The from field in an entry, comment or like, for example, is a Hash.

When experimenting with a database, it’s easy to create “rogue” documents which gain or lose some methods depending on what fields were defined at document creation time. So once we start to define some fields and field types, it’s best to be consistent and define all of them. An entry, for example, now looks like this:

class Entry
  include Mongoid::Document
  include Mongoid::Timestamps

  field :url
  field :date
  field :body
  field :from,           :type => Hash
  field :to,             :type => Array
  field :thumbnails,     :type => Array  # thumbnails[]{} - url, link, width, height, player
  field :files,          :type => Array  # files[]{} - url, type, name, icon, size
  field :via,            :type => Hash   # via{} - name, url
  field :geo,            :type => Hash   # geo{} - lat, long
  field :commands, :type => Array
# embedded
  embeds_many :comments
  embeds_many :likes
# feed
  belongs_to_related :feed
end

Similarly for feed, comment and like. There goes schema-free. Next.

Fetching comments/likes per feed, not per entry
The aim of this application is to display some useful statistical information about feeds and their entries. Feeds have entries and entries have comments/likes, but we are not especially interested in, say, comments for a specific entry. We’d like to know about the activity of the feed: for example, total comments across all entries in the feed, over time.

To access this information, a method like this would be useful:

  @feed.comments.count

And indeed, that would be possible if feeds were related to comments through entries:

# feed
  has_many :comments, :through => :entries

Unfortunately, we cannot relate documents in one collection, feeds, to embedded documents (e.g. comments) in a second collection, entries. I’ve tried the following approach to fetch comments for a feed:

@feed = Feed.find(params[:feed_id])
@entries = @feed.entries.all
# count comments for feed
@comments_total = @entries.inject(0) do |sum, entry|
  sum = sum + entry.comments.count
end

It’s slow, taking around 3-4 seconds for a few hundred entries and between 20-40 seconds for a feed with ~ 10 000 entries.

One solution: separate collections for entries, comments and likes, with comments/likes linked to both entry and feed using foreign keys. In other words – back to the relational approach.

Recap
Let’s review what’s happened so far:

  • I decided to use a non-relational approach, but was forced by the data structure into relating feeds with their entries
  • I’m using a schema-free database, but ended up explicitly defining collection fields
  • The lack of a good has_many :through solution pushed me further towards a relational approach
  • I’m still forced to do a good deal of “hash re-writing” before saving data to collections

At this point you may be asking: what, in this case, is the benefit of using MongoDB or another “NoSQL” solution over, say, MySQL? Frankly, apart from the answer “fast writes”, so am I.

It may be that there are good use cases for “NoSQL”, of which this is not one. It may be that I need to completely rethink my approach: for example, storing all fields in a single collection of entries and using map-reduce to do fast queries. It may be – no, it’s certain – that my Ruby/Rails coding skills need to improve.

That said I find, repeatedly, that more complex data structures are difficult to “shoe-horn” into MongoDB. In my day job as a bioinformatician, data are not just “big” – they are always complex. The JSON string of a Twitter entry is not complex. A blog, with posts and embedded comments, is not complex. These are the kinds of example applications beloved of trendy web developers and to be honest, I need less trivial demonstrations to convince me that new technology, exciting though it may be, is useful to me.

12 thoughts on “The “NoSQL” approach: struggling to see the benefits

  1. Interesting post. Although I think usability of particular NoSQL database depends on the problem you are trying to solve. As newbie in Ruby and Rails, I picked neo4j and Radis to handle my particular problems and they worked very well. I was interested to implement the graph type data model in NoSQL as MySQL was definitely not good on that.

  2. For me personally the problem lies less with NoSQL approaches but more with the fact that I have been working with traditional databases and SQL for so long that I instinctively build the data model in a certain way because it appears natural to me. I often find myself in a position where it takes more effort to think of an alternative than to use a relational scheme.

    However, I’d be interested in the data model that a complete beginner who has no experience with relational databases would choose to store the data.

    • That’s a good point; there is a real “mental switch” from SQL to NoSQL modelling. I think the “best” NoSQL model is often a single collection of documents containing all fields from what would have been separate SQL tables, but it’s difficult to get there, sometimes.

  3. “NoSql” is not a thing, you can’t just use “Mongo”, realise it was wrong for your purposes and say that “NoSql” is the wrong choice.

    The whole point about the NoSql classification, is that there are certain types of storage engine that are geared towards SPECIFIC purposes, whereas RDBMSs are abused and over-used in situations as a one-size-fits-all technology.

    It’s not just about “fast writes” or whatever.

    Mongo: Fast writes, slow reads, traditional querying but over complex documents, good for certain OLTP situations
    Couchdb: Slow writes (technically), blindingly fast reads because all reads are done via pre-created map/reduce functions
    RavenDB: Fast writes, fast reads, but data is potentially out of date. All reads done via pre-created map/reduce functions

    Etc etc. List goes on.

    • Yes, I dislike the term “NoSQL” myself, I’m aware that it is not “one thing” and I have tried some of the alternatives. I figured that using it in the title would get attention, though – and I was right :-)

  4. That said, I don’t like the term NoSql at all, I’ve been using Solr in an application for ages now, and that’s technically NoSql, and years and years old.

    Right tool, right job, nothing more, nothing less. NoSql is just a fad that is just waiting for its first proper backlash.

  5. I have been looking for bioinformatics-related applications o key/value stores and document databases for a little while now. I’m curious if you tried using Mongo for a problem that is poorly suited to the relational model?

    Array data has always been a tortured mess in relational form because you end up with tall skinny tables and are forced to use hacks to do queries over multiple probes/snps. A cursory look at Mongo made me think it might be good for this, but I don’t know anyone who has tried it.

    • I should say that “despite” this post, I’m a big fan of MongoDB. It’s a great project, I want it to be useful in my work and I’m sure that it will be so. Your use case sounds just right.

      There are some good blog posts around from bioinformaticians using MongoDB: Jan Aerts, Pierre Lindenbaum, Brad Chapman – search for those names + bioinformatics + mongodb and you’ll find what others are doing.

  6. I’ve been playing with MongoDB for a little personal project, with an eye to seeing if it might be suited to move some of our internal apps at work to it. I’m still at the “Oh this is cool” and “Oh this is a bit strange” stage but my overriding sense so far is “This is Lotus Notes all over again”.

    Which is great because there are some problems that the Notes data store is/was just great for and implementing them in Rails using Active Record is OK but nowhere near as easy as it was in Notes. For example we’ve got one app where trying to run with Active Record is tying us up in knots and we’ll probably try MongoDB and see how it fits.

    I suspect it will often be obvious which data store to use where; but for some apps it might not be so clear cut. The “Best Practices” the Mongo guys refer to is probably as much about “Which data store to use for which problems” as it is how to use your chosen store.

    Picking up on your comment about ending up with a schema, both in Mongo and in the old days in Notes I found myself being explicit about what fields my app expected the objects to have, if only for my own sanity. The “Schema-less” part of Mongo is interesting more because I don’t have to do migrations whenever my data structure evolved – although that’s not really a problem in Rails these days.

    What’s more interesting is being able to store fairly complex objects without having to do lots of mapping onto the database. But Mongo’s limitations are a bit in-your-face as you mention. And I was shocked I can’t store a DateTime!

  7. One thing I don’t like with the SQL development scheme is that you first define you schema in your database and create classes in your app to match it, or vice-verse, first create classes and generate a schema in your DB through migrations or other. You always end up doing the work twice and have to make sure both are in sync. Sometimes, you even end up with business logic spread between your app and the DB.
    Schema-less storage seems to void that kind of problem.
    However, I think the main reason those solutions were crested is that they allow better distribution on a very large scale.

  8. Mike:: If you have piles of array data which you want to query on, you’d be best off using something like Couch and writing a map/reduce index to do that job.

  9. Pingback: Episode 01 – Neil Saunders | Nodalpoint Conversations

Comments are closed.