Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at:
VariantSpark: population scale clustering of genotype information
Happy to report it is open access.
Very nice work Neil. I’m very interested in in-memory computing right now. Nice to see a real application of it and glad to see some excellent results.
I was wondering if you had experimented with VM instances that have higher ratio of memory:CPU? At the ratios in the paper, was it CPU availability or memory availability that was the bottleneck?
Good questions. So far as I know, the VM spec in the paper was the only one used. And I’d guess that CPU was more of a bottleneck, as I don’t believe that the data used (chromosome 22 variants) is huge. But you should definitely check with the lead or corresponding authors to be sure.