This quote from the MongoDB website sums up, for me, the key problem in moving to a document-oriented, schema-free database: design. It’s easy to end up with a solution which resembles a relational database to the extent that you begin to wonder – if you should not just use a relational database. I’d like to illustrate by example.Document-oriented data modeling is still young. The fact is, many more applications will need to be built on the document model before we can say anything definitive about best practices.
– MongoDB Data Modeling and Rails
I’m currently working on a Rails application to archive entries from a FriendFeed feed in a MongoDB database and display some statistics about the feed, using Highcharts. I’ve made some progress – see the screenshot, right.
Fetching a feed is pretty easy using the API; here’s how I’d grab the first 30 entries for my own feed:
require 'rubygems'
require 'json/pure'
require 'open-uri'
feed = JSON.parse(open("http://friendfeed-api.com/v2/feed/neilfws").read)
That generates a hash, feed, with the following structure:
# feed sup_id => String name => String description => String type => String private => String commands => Array entries => Array
The array entries contains each entry with, if any, comments and likes:
# entry
url => String
date => String
body => String
from => Hash
to => Array
thumbnails => Array # thumbnails[]{} - url, link, width, height, player
files => Array # files[]{} - url, type, name, icon, size
via => Hash # via{} - name, url
geo => Hash # geo{} - lat, long
commands => Array
comments => Array
likes => Array
My next question: how best to save that to a MongoDB database?
Feed as document? Entry as document? Both.
Since JSON maps to a hash and the hash maps to MongoDB document structure, the temptation is always simply to save the hash straight to a collection, i.e. one feed = one document. Of course that would be a mistake, because maximum document size is 4 MB which would easily be surpassed for a feed with more than one hundred or so entries. So, we need to break up the feed.
Feeds have entries, entries have comments and likes. An obvious solution then, is to put information about the feed in one collection and the entries in a second collection, with comments and likes embedded inside their entry.
I like the Mongoid ODM. It plays well with the native Ruby mongo driver; I used the latter to save documents quickly and easily, then build the models on top using Mongoid. It can also use existing IDs as strings. This is discouraged at the MongoDB website, but in this case it makes sense to me that a document feed ID should match its FriendFeed feed ID and generate helpful URLs, such as /the-life-scientists/entries – as opposed to, e.g. /4c5a2e89daa3644ffc000001/entries.
Using Mongoid, the models look like this:
# feed.rb class Feed include Mongoid::Document include Mongoid::Timestamps # entries has_many_related :entries end # entry.rb class Entry include Mongoid::Document include Mongoid::Timestamps # embedded embeds_many :comments embeds_many :likes # feed belongs_to_related :feed end # comment.rb class Comment include Mongoid::Document embedded_in :entry, :inverse_of => :comments end # like.rb class Like include Mongoid::Document embedded_in :entry, :inverse_of => :likes end
Already then, I’ve been forced away from the preferred, embedded design and into relational associations.
Defining the fields
The models outlined above require some extra work. Using Mongoid, the default type for a field is String – anything else has to be defined explicitly. The from field in an entry, comment or like, for example, is a Hash.
When experimenting with a database, it’s easy to create “rogue” documents which gain or lose some methods depending on what fields were defined at document creation time. So once we start to define some fields and field types, it’s best to be consistent and define all of them. An entry, for example, now looks like this:
class Entry
include Mongoid::Document
include Mongoid::Timestamps
field :url
field :date
field :body
field :from, :type => Hash
field :to, :type => Array
field :thumbnails, :type => Array # thumbnails[]{} - url, link, width, height, player
field :files, :type => Array # files[]{} - url, type, name, icon, size
field :via, :type => Hash # via{} - name, url
field :geo, :type => Hash # geo{} - lat, long
field :commands, :type => Array
# embedded
embeds_many :comments
embeds_many :likes
# feed
belongs_to_related :feed
end
Similarly for feed, comment and like. There goes schema-free. Next.
Fetching comments/likes per feed, not per entry
The aim of this application is to display some useful statistical information about feeds and their entries. Feeds have entries and entries have comments/likes, but we are not especially interested in, say, comments for a specific entry. We’d like to know about the activity of the feed: for example, total comments across all entries in the feed, over time.
To access this information, a method like this would be useful:
@feed.comments.count
And indeed, that would be possible if feeds were related to comments through entries:
# feed has_many :comments, :through => :entries
Unfortunately, we cannot relate documents in one collection, feeds, to embedded documents (e.g. comments) in a second collection, entries. I’ve tried the following approach to fetch comments for a feed:
@feed = Feed.find(params[:feed_id]) @entries = @feed.entries.all # count comments for feed @comments_total = @entries.inject(0) do |sum, entry| sum = sum + entry.comments.count end
It’s slow, taking around 3-4 seconds for a few hundred entries and between 20-40 seconds for a feed with ~ 10 000 entries.
One solution: separate collections for entries, comments and likes, with comments/likes linked to both entry and feed using foreign keys. In other words – back to the relational approach.
Recap
Let’s review what’s happened so far:
- I decided to use a non-relational approach, but was forced by the data structure into relating feeds with their entries
- I’m using a schema-free database, but ended up explicitly defining collection fields
- The lack of a good has_many :through solution pushed me further towards a relational approach
- I’m still forced to do a good deal of “hash re-writing” before saving data to collections
At this point you may be asking: what, in this case, is the benefit of using MongoDB or another “NoSQL” solution over, say, MySQL? Frankly, apart from the answer “fast writes”, so am I.
It may be that there are good use cases for “NoSQL”, of which this is not one. It may be that I need to completely rethink my approach: for example, storing all fields in a single collection of entries and using map-reduce to do fast queries. It may be – no, it’s certain – that my Ruby/Rails coding skills need to improve.
That said I find, repeatedly, that more complex data structures are difficult to “shoe-horn” into MongoDB. In my day job as a bioinformatician, data are not just “big” – they are always complex. The JSON string of a Twitter entry is not complex. A blog, with posts and embedded comments, is not complex. These are the kinds of example applications beloved of trendy web developers and to be honest, I need less trivial demonstrations to convince me that new technology, exciting though it may be, is useful to me.



