MongoDB, Mongoid and Map/Reduce

I’ve been experimenting with MongoDB’s map-reduce, called from Ruby, as a replacement for Ruby’s Enumerable methods (map/collect, inject). It’s faster. Much faster.

Next, the details but first – the disclaimers:

  • My database is not optimised (using, e.g. indices)
  • My non-map/reduce code is not optimised; I’m sure there are better ways to perform database queries and use Enumerable than those in this post
  • My map/reduce code is not optimised – these are my first attempts

In short nothing is optimised, my code is probably awful and I’m making it up as I go along. Here we go then!

The Mongoid models
A couple of posts ago, I outlined a Rails application to store feeds and entries from FriendFeed. Here are the models, with fields and methods removed to show just the relations:

# feed.rb
class Feed
  include Mongoid::Document
  include Mongoid::Timestamps
# entries
  has_many_related :entries
end

# entry.rb
class Entry
  include Mongoid::Document
# embedded
  embeds_many :comments
  embeds_many :likes
# feed
  belongs_to_related :feed
end

# comment.rb
class Comment
  include Mongoid::Document
  embedded_in :entry, :inverse_of => :comments
end

# like.rb
class Like
  include Mongoid::Document
  embedded_in :entry, :inverse_of => :likes
end

Fetching entries by date for a feed
These models make it easy to fetch entries per feed in the “usual” Rails way:

@feed = Feed.find(params[:feed_id])
@entries = @feed.entries.all

Each entry has a date field, which is a RFC3339 formatted-string (e.g. “2009-06-23T04:26:47Z”). We might count up entries by date using something like inject():

def entries_by_date_ruby(feed)
  @feed    = Feed.criteria.id(feed)
  @entries = @feed.first.entries
  @dates   = @entries.inject(Hash.new(0)) { |h, entry|
    h[entry.date.to_date] += 1;
    h
  }
end

This generates a hash where keys are dates and values are total entries for that date. OK – but likely to become slow when there are many entries. Without further ado, the map-reduce version:

def entries_by_date_mapreduce(feed)
  map     = "function() { emit(this.date.replace(/T.+/, ''), 1); }"
  reduce  = "function(k, vals) { var sum = 0; for(var i in vals) sum += vals[i]; return sum; }"
  @dates  = Entry.collection.map_reduce(map, reduce, { :query => { :feed_id => feed }, :out => "#{feed}.entries_by_date"})
  @dates.find.inject(Hash.new(0)) {|h, i| h[i.values[0].to_date] = i.values[1]; h}
end

Here’s how it works. First, the map function removes the time from the RFC3339 date and emits each date value. Then, the reduce function sums up the number of times a date is emitted (which equals total entries for that date). Line 4 generates a collection object for class Entry, on which the map-reduce function is called, for entries from the specified feed. The results are saved in a collection named after the feed and appended with “.entries_by date”. Each document in this collection is a key/value pair of the form “{:_id => date, :_value => count}“. Finally, we fetch the documents from that collection and store them in a hash as before, key = date and value = total entries on that date.

Fetching comments/likes by date for a feed
Getting to comments (or likes) per feed is a little more difficult. Let’s say we want to count up all of the comments by date for a feed. We might do something like this:

def comments_by_date_ruby(feed)
  @feed = Feed.criteria.id(feed)
  @entries = @feed.first.entries
  @dates = @entries.inject(Hash.new(0)) { |h, entry|
    entry.comments.map { |comment|
      h[comment.date.to_date] += 1
    };
    h
  }
end

That returns a hash where the keys are dates and the values total comments on that date.

All that fetching records and looping through entries, then comments, will be even slower than looping through entries, especially since feeds have many more comments than entries. Enter the map-reduce version:

def comments_by_date_mapreduce(feed)
  map    = "function() { if(this.comments) for(var c in this.comments) { emit(this.comments[c].date.replace(/T.+/, ''), 1); }}"
  reduce = "function(k, vals) { var sum = 0; for(var i in vals) sum += vals[i]; return sum; }"
  @dates  = Entry.collection.map_reduce(map, reduce, { :query => { :feed_id => feed }, :out => "#{feed}.comments_by_date"})
  @dates.find.inject(Hash.new(0)) {|h, i| h[i.values[0].to_date] = i.values[1]; h}
end

You can see that this is very similar to the previous example. The only difference is that we have to check whether an entry has comments (not all do), then loop through comments per entry and pull out the comment date. The code for retrieving “likes” would be very similar in both Ruby and map-reduce examples; just substitute “like(s)” for “comment(s)”, throughout.

Benchmarking
So, here’s the moment of truth. Since we’re working in the Rails environment, we can write a quick and dirty benchmarking rake task and save it in lib/tasks/benchmark.rake:

namespace :test do
  task :mapreduce => :environment do
    feed = ENV['feed']
    puts "#{Feed.criteria.id(feed).first.entries.count} entries"
    puts "#{Feed.criteria.id(feed).first.entries.map { |e| e.comments.count}.sum } comments"
    Benchmark.bmbm do |bm|
      bm.report("#{feed} entries_by_date_ruby") do
        entries_by_date_ruby(feed)
      end
      bm.report("#{feed} entries_by_date_mapreduce") do
        entries_by_date_mapreduce(feed)
      end
      bm.report("#{feed} comments_by_date_ruby") do
        comments_by_date_ruby(feed)
      end
      bm.report("#{feed} comments_by_date_mapreduce") do
        comments_by_date_mapreduce(feed)
      end
    end
  end
end

def entries_by_date_ruby(feed)
  # see code earlier in post
end

def entries_by_date_mapreduce(feed)
  # see code earlier in post
end

def comments_by_date_ruby(feed)
  # see code earlier in post
end

def comments_by_date_mapreduce(feed)
  # see code earlier in post
end

Run that from the Rails project root using “rake test:mapreduce feed=feedID“, where feedID is the ID of a feed in your MongoDB database (e.g. the-life-scientists).

Drum roll…the results for 3 feeds. First, a small feed, ISMB 2008, with 127 entries and 1204 comments:

127 entries
1204 comments
Rehearsal ------------------------------------------------------------------------
ismb-2008 entries_by_date_ruby         0.370000   0.020000   0.390000 (  0.396332)
ismb-2008 entries_by_date_mapreduce    0.010000   0.000000   0.010000 (  0.055770)
ismb-2008 comments_by_date_ruby        1.140000   0.090000   1.230000 (  1.238332)
ismb-2008 comments_by_date_mapreduce   0.020000   0.000000   0.020000 (  0.118020)
--------------------------------------------------------------- total: 1.650000sec

                                           user     system      total        real
ismb-2008 entries_by_date_ruby         0.290000   0.020000   0.310000 (  0.308145)
ismb-2008 entries_by_date_mapreduce    0.020000   0.000000   0.020000 (  0.035127)
ismb-2008 comments_by_date_ruby        1.150000   0.080000   1.230000 (  1.244103)
ismb-2008 comments_by_date_mapreduce   0.040000   0.000000   0.040000 (  0.134919)

Not bad: map-reduce was about 9x faster, when summing both entries and comments by date.

Next, a slightly larger feed – ISMB/ECCB 2009, with 330 entries and 3770 comments:

330 entries
3770 comments
Rehearsal ---------------------------------------------------------------------------
ismbeccb2009 entries_by_date_ruby         1.000000   0.070000   1.070000 (  1.113808)
ismbeccb2009 entries_by_date_mapreduce    0.010000   0.000000   0.010000 (  0.067773)
ismbeccb2009 comments_by_date_ruby        3.590000   0.290000   3.880000 (  3.995940)
ismbeccb2009 comments_by_date_mapreduce   0.010000   0.000000   0.010000 (  0.321649)
------------------------------------------------------------------ total: 4.970000sec

                                              user     system      total        real
ismbeccb2009 entries_by_date_ruby         0.880000   0.050000   0.930000 (  0.948717)
ismbeccb2009 entries_by_date_mapreduce    0.020000   0.000000   0.020000 (  0.047461)
ismbeccb2009 comments_by_date_ruby        3.400000   0.450000   3.850000 (  3.925007)
ismbeccb2009 comments_by_date_mapreduce   0.040000   0.000000   0.040000 (  0.376988)

Even better: map-reduce is 10 – 20x faster.

Finally, a large feed: The Life Scientists, which currently contains 4031 entries and 13996 comments:

4031 entries
13996 comments
Rehearsal ----------------------------------------------------------------------------------
the-life-scientists entries_by_date_ruby         7.960000   0.460000   8.420000 (  8.760340)
the-life-scientists entries_by_date_mapreduce    0.440000   0.020000   0.460000 (  0.913584)
the-life-scientists comments_by_date_ruby       17.830000   1.590000  19.420000 ( 19.924083)
the-life-scientists comments_by_date_mapreduce   0.280000   0.030000   0.310000 (  1.545204)
------------------------------------------------------------------------ total: 28.610000sec

                                                     user     system      total        real
the-life-scientists entries_by_date_ruby         8.000000   0.490000   8.490000 (  8.623212)
the-life-scientists entries_by_date_mapreduce    0.330000   0.010000   0.340000 (  1.055941)
the-life-scientists comments_by_date_ruby       17.590000   1.550000  19.140000 ( 19.405148)
the-life-scientists comments_by_date_mapreduce   0.320000   0.030000   0.350000 (  1.549654)

Using Ruby and Mongoid methods to fetch and process documents is getting very slow by this stage. The map-reduce improvement is beginning to plateau, but still decreases page rendering time from unacceptable (many seconds) to acceptable (a couple of seconds).

In summary: MongoDB’s map-reduce is terrific, even on single machines and I see many applications of it to future projects.

8 thoughts on “MongoDB, Mongoid and Map/Reduce

  1. Hey Neil,

    Nice write-up!

    As you know, I’ve been using MongoDB quite a bit as well from ruby, including Mongoid. Have also noticed that using a pure ruby way to query the data (e.g. @feed.entries.all) can slow things down immensely when you’re working with millions of documents. It does help to use the pure ruby API instead of something like Mongoid, but mapreduce is definitely the way to go.

    However, what I don’t like about using mapreduce in mongodb is that you have to write the map and reduce functions in javascript at the moment. That is not really an issue when the functions are simple, but can become a significant hurdle when they become more complicated. It also breaks the flow of the code.

    I suppose you noticed that the latest version of MongoDB has auto-sharding? Nice feature :-)

    jan.

    • Yes, I like to use the pure ruby driver wherever possible; it’s much faster than the ODMs. I’d like to know whether it’s possible to build a model-free Rails app, with just the mongo driver – I’m sure it is, would just mean writing more code.

      Know what you mean about the ugly javascript strings, although it’s possible to format them in a nicer way than I have here. One issue is that you need Ruby javascript helpers to be very good. I’ve had issues with e.g. helpers for the Highcharts javascript library which made them unusable; in the end it was better just to write the pure javascript.

  2. Yep, very nice write-up indeed.

    I still have not encountered a project where it would make sense to use it myself, apart from artificial, small try-out sessions.

    Anyway, I was wondering if you have put your FriendFeed project in a public Github repository or something similar ?

  3. Nice post. A very naive Rails question, I guess you have stored the friendfeed data in MongoDB and then Rails application use that. What happens if I want to do real time analysis?

    • There are quite a few blog posts around on real-time analysis; I can’t think of one off right now, but Google the appropriate terms or search at Slideshare for “mongodb”, there are some nice presentations there.

      It’s not really applicable to FriendFeed data (unless you’re interested in only recent entries), since it takes quite some time to fetch all entries for a large feed. I guess you could set up some kind of background job, but I’m using a Rake task at the moment to fetch and save entries.

  4. Any benchmarks on how map/reduce fares compared to a SQL “group by”? That is my preferred method for doing similar tasks because relational databases tend to be really good at that (much better than looping over collections in whatever language I’m using).

Comments are closed.