My day out at #osddmalaria

Finally, I get around to telling you that…
…on Friday 24th February, I took a day out from my regular job to attend a meeting on Open Source Drug Discovery for Malaria. I should state straight away that whilst drug discovery and chem(o)informatics are topics that I find very interesting, I have no professional experience or connections in either area. However, it was an opportunity to learn more, listen to some great speakers, think about what bioinformaticians might be able to bring to the table and of course, finally meet Mat Todd in person. Mat, if you don’t know, is one of the few people on the planet who really does science online, as opposed to talking about science online.

Here’s what I learned – with just a little analysis using R later in the post, hence the statistics/R category.

First, there is a great community of people working on drug discovery for neglected diseases. What’s more, they’re doing it open-source style and anyone can contribute. Start here for a summary and some links.

There were several excellent talks during the day. The morning session revolved around general issues (open data, patents, how best to do open science drug discovery). The afternoon session focused in on the chemistry and informatics of some promising drug candidates. I especially enjoyed Richard Jefferson from CAMBIA demonstrating a new version of Patent Lens, with lots of new “Web 2.0″ (do we say that anymore?) features.

Mat tells me that the main informatics challenge is this: having screened chemical compounds and identified some that show activity against malaria in whole-cell assays, how can we predict potential biological targets? My (rather weak) response at the time was “that’s difficult, I’d have to think about it.” Having thought about it, I wondered whether one approach would be to search chemical databases for similar compounds and see whether targets are known for any of those. Indeed there is such an approach, since performed brilliantly by Iain Wallace from ChEMBL. You can read a summary here, view more details in the online lab book and look in detail at predictions for compounds from the Malaria Box.

The Malaria Box: 400 compounds, selected from a larger set of ~ 20 000, exhibiting antimalarial activity and freely-available to anyone on request. Let’s use R to examine the larger public data set, made available as an Excel file:

library(xlsx)
library(ggplot2)
# read malaria box xlsx file
mbox <- read.xlsx("data/Dataset.xlsx", 1)
names(mbox)
# [1] "Index"              "ChEMBL_NTD_ID"      "Source"            
# [4] "Activity..EC50.uM." "Canonical_Smiles"   "BATCH_NO"

# compounds by source
png("mbox_sources.png", width = 640, height = 480)
print(ggplot(mbox) + geom_bar(aes(x = Source, fill = Source)) + opts(title = "Compounds by Source"))
dev.off()

# activity distribution
png("mbox_activities.png", width = 640, height = 480)
print(ggplot(mbox) + geom_density(aes(x = Activity..EC50.uM., color = Source)) + opts(title = "Distribution of Activity by Source"))
dev.off()
Plots shown below; click for larger versions.
The density plot of activities is rather interesting, with many compounds lying under two closely-spaced peaks at low EC50 values. Can we assume that these compounds are in some way “similar”, structurally and/or chemically-speaking?
mbox_sources

Malaria Box Compounds By Source

mbox_activities

Malaria Box Compounds Activity Distribution

Another interesting and new (for me) discovery is the ChemmineR package from R/Bioconductor. Here, for example, is how we might use the SMILES data from the larger dataset to search PubChem for similar compounds:

library(ChemmineR)
mbox$Canonical_Smiles[1]
# [1] CCOc1ccc(CN2CCN(CC(O)COc3ccc(OCC(O)CN4CCN(Cc5ccc(OCC)cc5)CC4)cc3)CC2)cc1
# search using first compound
pc <- searchString(as.character(mbox$Canonical_Smiles[1]))
summary(pc)
# Length  Class   Mode 
#      3 SDFset     S4

# molecular weights of the 3 hits
MW(pc)
#     CMP1     CMP2     CMP3 
# 662.8586 574.7534 602.8066

# names
datablocktag(pc, tag = "PUBCHEM_IUPAC_NAME")
# [1] "1-[4-[(4-ethoxyphenyl)methyl]piperazin-1-yl]-3-[4-[3-[4-[(4-ethoxyphenyl)methyl]piperazin-1-yl]-2-hydroxypropoxy]phenoxy]propan-2-ol"
# [2] "1-(4-benzylpiperazin-1-yl)-3-[4-[3-(4-benzylpiperazin-1-yl)-2-hydroxypropoxy]phenoxy]propan-2-ol"                                    
# [3] "1-[4-[2-hydroxy-3-[4-[(4-methylphenyl)methyl]piperazin-1-yl]propoxy]phenoxy]-3-[4-[(4-methylphenyl)methyl]piperazin-1-yl]propan-2-ol"

ChemmineR does a whole lot more including analyses (such as clustering), visualization and submission to online tools. See the comprehensive and clear manual for details.

Assuming that online molecular databases are updated quite frequently, I can see a need for automation and data integration. We might, for example, want to: (1) run similarity searches against public databases using the Malaria Box compounds at regular intervals; (2) cluster the results; (3) parse, summarise and store the results in a database; (4) add a simple web front-end to query and visualize the current status. It might also be interesting to monitor and integrate data from other public sources. Transcriptomics? NCBI GEO currently returns 553 datasets for the simple query “Plasmodium”. How many of those look at the effect of drug candidates? If any, (1) on what biological pathways do they operate; (2) how easy it to cross-reference compounds from GEO experiments to other compounds of interest?

Summary
This is interesting, fun stuff with a lot of potential for involvement by – absolutely anyone with an interest in the project and skills to offer. To what extent time and the conditions of my employment enable me to contribute remains to be seen, but I’ll certainly be following along with interest.