Getting “stuff” into MongoDB

One of the aspects I like most about MongoDB is the “store first, ask questions later” approach. No need to worry about table design, column types or constant migrations as design changes. Provided that your data are in some kind of hash-like structure, you just drop them in.

Ruby is particularly useful for this task, since it has many gems which can parse common formats into a hash. Here are 3 quick examples with relevance to bioinformatics.

1. JSON
JSON is a good fit for MongoDB; when you view a document (represented internally as BSON), the structure looks just the same as the original JSON. I use json/pure as in this example, which grabs expression data for a gene from the Gene Expression Atlas API:

require 'open-uri'
require 'json/pure'
require 'mongo'

db   = Mongo::Connection.new.db('test')
col  = db.collection('scratch')
data = open("http://www.ebi.ac.uk:80/gxa/api?geneIs=ENSG00000026025&species=homo+sapiens&updownIn=EFO_0000365").read
data = JSON.parse(data)

# note: saved hash is value of the first key
col.save(data['results'].first)
col.find_one
# result
{"_id"=>BSON::ObjectId4d58be98daa36430e9000001, "gene"=>{"name"=>"VIM", "organism"=>"Homo sapiens", "omimIds"=>["193060"], "goTerms"=>["axon", "cellular component movement", "cytoplasm", "cytosol", "intermediate filament", "intermediate filament-based process", "interspecies interaction between organisms", "protein C-terminus binding", "protein binding", "protein kinase binding", "structural constituent of cytoskeleton", "structural molecule activity"], "refseqIds"=>["NM_003380"], "diseases"=>["Cataract, pulverulent, autosomal dominant"], "ensemblGeneId"=>"ENSG00000026025", "enstranscripts"=>["ENST00000224237", "ENST00000421459"], "interProTerms"=>["Filament", "Intermediate filament protein", "Intermediate filament protein, conserved site", "Intermediate filament, DNA-binding region", "Intermediate_filament_CS", "Prefoldin", "Tropomyosin", "Intermed_filament_DNA-bd", "F", "Keratin_I", "Keratin_II"], "synonyms"=>["RP11-124N14.1-008", "VIM"], "id"=>"ENSG00000026025", "uniprotIds"=>["P08670", "Q53HU8", "Q5JVS8", "B3KRK8", "B0YJC4", "B0YJC5", "D3DRU4"], "orthologs"=>[], "interProIds"=>["IPR000533", "IPR001664", "IPR002957", "IPR003054", "IPR006821", "IPR009053", "IPR016044", "IPR018039"], "goIds"=>["GO:0005198", "GO:0005200", "GO:0005515", "GO:0005737", "GO:0005829", "GO:0005882", "GO:0006928", "GO:0008022", "GO:0019901", "GO:0030424", "GO:0044419", "GO:0045103"], "unigeneIds"=>["Hs.628678", "Hs.714268"], "emblIds"=>["AF328728", "AK056766", "AK091813", "AK097336", "AK222482", "AK222507", "AK222602", "AK290643", "AL133415", "BC000163", "BC030573", "BC066956", "CH471072", "CR407690", "EF445046", "M14144", "M18888", "M18889", "M18890", "M18891", "M18892", "M18893", "M18894", "M18895", "M25246", "X16478", "X56134", "Z19554"], "ensemblProteinIds"=>["ENSP00000224237", "ENSP00000391842"]}, "expressions"=>[{"downExperiments"=>2, "efoId"=>"EFO_0000365", "experiments"=>[{"accession"=>"E-MTAB-37", "pvalue"=>0.0, "expression"=>"DOWN"}, {"accession"=>"E-MTAB-62", "pvalue"=>0.0, "expression"=>"DOWN"}], "downPvalue"=>0.0, "efoTerm"=>"colorectal adenocarcinoma", "upExperiments"=>0, "upPvalue"=>0.0, "nonDEExperiments"=>11}]}

2. CSV
A CSV file with a header can be transformed easily into a key-value structure, using fastercsv. For example, you have a CSV file, myfile.csv, where the header and first row look like this:

"ID_REF","MAS_VALUE","MAS_ABS_CALL","VALUE","ABS_CALL","RMA_VALUE"
"AFFX-BioB-5_at",26.5,"P",84.9798,"P",5.166499734
...

And you want this:

{"MAS_ABS_CALL"=>"P", "RMA_VALUE"=>"5.166499734", "ABS_CALL"=>"P", "MAS_VALUE"=>"26.5", "ID_REF"=>"AFFX-BioB-5_at", "VALUE"=>"84.9798"}
...

Here’s one approach:

require 'fastercsv'
require 'mongo'

db   = Mongo::Connection.new.db('test')
col  = db.collection('scratch')
data = FasterCSV.read("myfile.csv", :headers => true)

data.each do |row|
  h = row.to_hash
  col.save(h)
end

3. XML
I know what you’re thinking: why not just store XML documents as files and use one of the many XML libraries, or even an XML database, to process them? A good point, but let’s assume for now that you have a good reason to convert from XML to a MongoDB document.
For this, I like to use John Nunemaker’s very simple crack gem (which will also parse JSON). Here’s an example using the PDB API, which returns XML:

require 'crack'
require 'open-uri'
require 'mongo'

db   = Mongo::Connection.new.db('test')
col  = db.collection('scratch')
data = open("http://www.pdb.org/pdb/rest/describePDB?structureId=4hhb").read
data = Crack::XML.parse(data)

# note: saved hash is value of the first 2 keys
col.save(data['PDBdescription']['PDB'])
col.find_one
# result
{"organism"=>"Homo sapiens, Homo sapiens", "replaces"=>"1HHB", "revision_date"=>"1984-07-17", "title"=>"THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT 1.74 ANGSTROMS RESOLUTION", "nr_residues"=>"574", "structureId"=>"4HHB", "citation_authors"=>"Fermi, G., Perutz, M.F., Shaanan, B., Fourme, R.", "expMethod"=>"X-RAY DIFFRACTION", "publish_date"=>"1984-03-07", "nr_atoms"=>"4779", "status"=>"CURRENT", "pubmedId"=>"6726807", "keywords"=>"OXYGEN TRANSPORT", "nr_entities"=>"4"}

“Gotchas”: issues to look out for
IDs
Unless your hash contains the key “_id”, MongoDB will create an ID for you. This may or may not be what you want. Sometimes, the data contain a key which you can use as an ID. If you use the pure Ruby mongo driver, strings are acceptable as keys. However, some of the other MongoDB gems require that the ID be a particular type, such as ObjectID (the default in the absence of “_id”).
Keys
Occasionally, your hash will contain a key that MongoDB does not like – for example, one containing a period (“.”) – and the document will not be saved. In this case, you’ll have to use something like gsub to substitute a valid character (such as an underscore). This can be problematic if you want to relate your keys to the original data source.
Document size
Don’t forget that MongoDB has a document size limit of 4 MB. If your hash exceeds this, you’ll have to come up with creative ways of splitting the hash into documents that fit under the limit – or use a different document store without this limitation. In general, MongoDB performs better with small, simple documents containing a few key-value pairs than with larger, more complex nested document structures.

3 thoughts on “Getting “stuff” into MongoDB

  1. Jonathan Badger

    Interesting,

    1) fastercsv? Are you still using ruby 1.8? 1.9’s CSV basically is fastercsv (and actually complains when you require fastercsv; somewhat annoyingly for scripts that you want to work on both 1.8 and 1.9)

    2) Are you planning to give examples of what you actually do with MongoDB once you have the data in? Is it just a hash that is saved to disk for permanent use , or can you actually have complex queries? How would that work without a fixed table stucture?

    1. nsaunders Post author

      Yes, I’m still (mostly) in 1.8 land. I guess I should make the switch.

      Examples: I did say “store first, ask questions later” :-) but yes, I’ll try to write a post on queries soon. I find that MongoDB allows for some very useful and complex queries; their basic introduction is here. One approach that I like to use a lot is col.find.map {} for pulling all values of a specific key into an array.

  2. Gaurav Kumar

    Just a minor correction-
    When you do not define key “_id”, it is not the MongoDB who create value for it. It is the client driver which creates this value.

Comments are closed.