Converting a spreadsheet of SMILES: my first OSM contribution

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.

First step: make that data usable by rescuing it from the spreadsheet ;) We’ll clean up a column name too.

mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1")
colnames(mmv)[5] <- "EC50"


  COMPOUND_ID                                                      Smiles     MW
1   MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35                     0.0
2   MMV668823      c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79                     0.0
3   MMV668824                        c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27                    29.6
4   MMV668955                        C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30                    18.5
5   MMV668956    C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38                   124.2
6   MMV668957          c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42                    68.5
   EC50 New.quantity.remaining
1  4.01                      0
2  0.16                      0
3 10.00                     29
4  8.37                     18
5  0.43                    124
6  2.00                     62

What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).

The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):


mmv.sdf   <- smiles2sdf(mmv$Smiles)

That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties:

cid(mmv.sdf) <- mmv$COMPOUND_ID
datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)

Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.

write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)

It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:

babel -xp mmv-all.sdf mmv-all.cml

That’s it; here’s my OSMinformatics Github repository, here’s the output.

3 thoughts on “Converting a spreadsheet of SMILES: my first OSM contribution

  1. billyarberry

    Thoroughly enjoyed your blog of 7/1/14. I’m nearing retirement but have a continuing interest in data science/R. Do you know of other non-profit organizations that need work of this sort? How did you get the attention of Ms Williamson? Thanks

    1. nsaunders Post author

      I know many people in need of statistics, data analysis and visualization, but Mat’s group is the only one I know that openly invites anyone to participate in their science in this way. I’d check out the OSM website as a starting point; when they need help they generally post on Twitter, Facebook and one of their blogs, or at their Github account.

  2. Matthew Todd (@MatToddChem)

    Sensational Neil, thank you. Rescuing the data and making it more understandable/processable has led to us getting the wiki finished:
    Now what we want to do is to try to make sure we’re updating the project’s SD file automatically, but the one you worked on is a huge part of that since it’s all the data we inherited at the start of Series 4. Awesome, thanks again.

