Variants + Spark = VariantSpark

Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at:

VariantSpark: population scale clustering of genotype information

Happy to report it is open access.

2 thoughts on “Variants + Spark = VariantSpark

  1. Very nice work Neil. I’m very interested in in-memory computing right now. Nice to see a real application of it and glad to see some excellent results.

    I was wondering if you had experimented with VM instances that have higher ratio of memory:CPU? At the ratios in the paper, was it CPU availability or memory availability that was the bottleneck?

    • Good questions. So far as I know, the VM spec in the paper was the only one used. And I’d guess that CPU was more of a bottleneck, as I don’t believe that the data used (chromosome 22 variants) is huge. But you should definitely check with the lead or corresponding authors to be sure.

Comments are closed.