I’ve been experimenting with MongoDB’s map-reduce, called from Ruby, as a replacement for Ruby’s Enumerable methods (map/collect, inject). It’s faster. Much faster.
Next, the details but first – the disclaimers:
- My database is not optimised (using, e.g. indices)
- My non-map/reduce code is not optimised; I’m sure there are better ways to perform database queries and use Enumerable than those in this post
- My map/reduce code is not optimised – these are my first attempts
In short nothing is optimised, my code is probably awful and I’m making it up as I go along. Here we go then!
The Mongoid models
A couple of posts ago, I outlined a Rails application to store feeds and entries from FriendFeed. Here are the models, with fields and methods removed to show just the relations:
# feed.rb class Feed include Mongoid::Document include Mongoid::Timestamps # entries has_many_related :entries end # entry.rb class Entry include Mongoid::Document # embedded embeds_many :comments embeds_many :likes # feed belongs_to_related :feed end # comment.rb class Comment include Mongoid::Document embedded_in :entry, :inverse_of => :comments end # like.rb class Like include Mongoid::Document embedded_in :entry, :inverse_of => :likes end
Fetching entries by date for a feed
These models make it easy to fetch entries per feed in the “usual” Rails way:
@feed = Feed.find(params[:feed_id]) @entries = @feed.entries.all
Each entry has a date field, which is a RFC3339 formatted-string (e.g. “2009-06-23T04:26:47Z”). We might count up entries by date using something like inject():
def entries_by_date_ruby(feed)
@feed = Feed.criteria.id(feed)
@entries = @feed.first.entries
@dates = @entries.inject(Hash.new(0)) { |h, entry|
h[entry.date.to_date] += 1;
h
}
end
This generates a hash where keys are dates and values are total entries for that date. OK – but likely to become slow when there are many entries. Without further ado, the map-reduce version:
def entries_by_date_mapreduce(feed)
map = "function() { emit(this.date.replace(/T.+/, ''), 1); }"
reduce = "function(k, vals) { var sum = 0; for(var i in vals) sum += vals[i]; return sum; }"
@dates = Entry.collection.map_reduce(map, reduce, { :query => { :feed_id => feed }, :out => "#{feed}.entries_by_date"})
@dates.find.inject(Hash.new(0)) {|h, i| h[i.values[0].to_date] = i.values[1]; h}
end
Here’s how it works. First, the map function removes the time from the RFC3339 date and emits each date value. Then, the reduce function sums up the number of times a date is emitted (which equals total entries for that date). Line 4 generates a collection object for class Entry, on which the map-reduce function is called, for entries from the specified feed. The results are saved in a collection named after the feed and appended with “.entries_by date”. Each document in this collection is a key/value pair of the form “{:_id => date, :_value => count}“. Finally, we fetch the documents from that collection and store them in a hash as before, key = date and value = total entries on that date.
Fetching comments/likes by date for a feed
Getting to comments (or likes) per feed is a little more difficult. Let’s say we want to count up all of the comments by date for a feed. We might do something like this:
def comments_by_date_ruby(feed)
@feed = Feed.criteria.id(feed)
@entries = @feed.first.entries
@dates = @entries.inject(Hash.new(0)) { |h, entry|
entry.comments.map { |comment|
h[comment.date.to_date] += 1
};
h
}
end
That returns a hash where the keys are dates and the values total comments on that date.
All that fetching records and looping through entries, then comments, will be even slower than looping through entries, especially since feeds have many more comments than entries. Enter the map-reduce version:
def comments_by_date_mapreduce(feed)
map = "function() { if(this.comments) for(var c in this.comments) { emit(this.comments[c].date.replace(/T.+/, ''), 1); }}"
reduce = "function(k, vals) { var sum = 0; for(var i in vals) sum += vals[i]; return sum; }"
@dates = Entry.collection.map_reduce(map, reduce, { :query => { :feed_id => feed }, :out => "#{feed}.comments_by_date"})
@dates.find.inject(Hash.new(0)) {|h, i| h[i.values[0].to_date] = i.values[1]; h}
end
You can see that this is very similar to the previous example. The only difference is that we have to check whether an entry has comments (not all do), then loop through comments per entry and pull out the comment date. The code for retrieving “likes” would be very similar in both Ruby and map-reduce examples; just substitute “like(s)” for “comment(s)”, throughout.
Benchmarking
So, here’s the moment of truth. Since we’re working in the Rails environment, we can write a quick and dirty benchmarking rake task and save it in lib/tasks/benchmark.rake:
namespace :test do
task :mapreduce => :environment do
feed = ENV['feed']
puts "#{Feed.criteria.id(feed).first.entries.count} entries"
puts "#{Feed.criteria.id(feed).first.entries.map { |e| e.comments.count}.sum } comments"
Benchmark.bmbm do |bm|
bm.report("#{feed} entries_by_date_ruby") do
entries_by_date_ruby(feed)
end
bm.report("#{feed} entries_by_date_mapreduce") do
entries_by_date_mapreduce(feed)
end
bm.report("#{feed} comments_by_date_ruby") do
comments_by_date_ruby(feed)
end
bm.report("#{feed} comments_by_date_mapreduce") do
comments_by_date_mapreduce(feed)
end
end
end
end
def entries_by_date_ruby(feed)
# see code earlier in post
end
def entries_by_date_mapreduce(feed)
# see code earlier in post
end
def comments_by_date_ruby(feed)
# see code earlier in post
end
def comments_by_date_mapreduce(feed)
# see code earlier in post
end
Run that from the Rails project root using “rake test:mapreduce feed=feedID“, where feedID is the ID of a feed in your MongoDB database (e.g. the-life-scientists).
Drum roll…the results for 3 feeds. First, a small feed, ISMB 2008, with 127 entries and 1204 comments:
127 entries
1204 comments
Rehearsal ------------------------------------------------------------------------
ismb-2008 entries_by_date_ruby 0.370000 0.020000 0.390000 ( 0.396332)
ismb-2008 entries_by_date_mapreduce 0.010000 0.000000 0.010000 ( 0.055770)
ismb-2008 comments_by_date_ruby 1.140000 0.090000 1.230000 ( 1.238332)
ismb-2008 comments_by_date_mapreduce 0.020000 0.000000 0.020000 ( 0.118020)
--------------------------------------------------------------- total: 1.650000sec
user system total real
ismb-2008 entries_by_date_ruby 0.290000 0.020000 0.310000 ( 0.308145)
ismb-2008 entries_by_date_mapreduce 0.020000 0.000000 0.020000 ( 0.035127)
ismb-2008 comments_by_date_ruby 1.150000 0.080000 1.230000 ( 1.244103)
ismb-2008 comments_by_date_mapreduce 0.040000 0.000000 0.040000 ( 0.134919)
Not bad: map-reduce was about 9x faster, when summing both entries and comments by date.
Next, a slightly larger feed – ISMB/ECCB 2009, with 330 entries and 3770 comments:
330 entries
3770 comments
Rehearsal ---------------------------------------------------------------------------
ismbeccb2009 entries_by_date_ruby 1.000000 0.070000 1.070000 ( 1.113808)
ismbeccb2009 entries_by_date_mapreduce 0.010000 0.000000 0.010000 ( 0.067773)
ismbeccb2009 comments_by_date_ruby 3.590000 0.290000 3.880000 ( 3.995940)
ismbeccb2009 comments_by_date_mapreduce 0.010000 0.000000 0.010000 ( 0.321649)
------------------------------------------------------------------ total: 4.970000sec
user system total real
ismbeccb2009 entries_by_date_ruby 0.880000 0.050000 0.930000 ( 0.948717)
ismbeccb2009 entries_by_date_mapreduce 0.020000 0.000000 0.020000 ( 0.047461)
ismbeccb2009 comments_by_date_ruby 3.400000 0.450000 3.850000 ( 3.925007)
ismbeccb2009 comments_by_date_mapreduce 0.040000 0.000000 0.040000 ( 0.376988)
Even better: map-reduce is 10 – 20x faster.
Finally, a large feed: The Life Scientists, which currently contains 4031 entries and 13996 comments:
4031 entries
13996 comments
Rehearsal ----------------------------------------------------------------------------------
the-life-scientists entries_by_date_ruby 7.960000 0.460000 8.420000 ( 8.760340)
the-life-scientists entries_by_date_mapreduce 0.440000 0.020000 0.460000 ( 0.913584)
the-life-scientists comments_by_date_ruby 17.830000 1.590000 19.420000 ( 19.924083)
the-life-scientists comments_by_date_mapreduce 0.280000 0.030000 0.310000 ( 1.545204)
------------------------------------------------------------------------ total: 28.610000sec
user system total real
the-life-scientists entries_by_date_ruby 8.000000 0.490000 8.490000 ( 8.623212)
the-life-scientists entries_by_date_mapreduce 0.330000 0.010000 0.340000 ( 1.055941)
the-life-scientists comments_by_date_ruby 17.590000 1.550000 19.140000 ( 19.405148)
the-life-scientists comments_by_date_mapreduce 0.320000 0.030000 0.350000 ( 1.549654)
Using Ruby and Mongoid methods to fetch and process documents is getting very slow by this stage. The map-reduce improvement is beginning to plateau, but still decreases page rendering time from unacceptable (many seconds) to acceptable (a couple of seconds).
In summary: MongoDB’s map-reduce is terrific, even on single machines and I see many applications of it to future projects.


