Join me on a failed bioinformatics experiment. It’s a bit of fun.
First, we need to grab ourselves some Neanderthal DNA sequences. We’ll get these via the NCBI Entrez database, starting in the Taxonomy Browser. A new feature here is a page devoted to extinct organisms. Look down the page until you find the link for Neanderthal man.
If you follow that link and look over to the top right of the page, you’ll see links to other Entrez records – PubMed, proteins and what we want – nucleotide sequence. Click the number (currently 1335) to get to the nucleotide records. Now choose “Fasta” in the “Display” drop-down list, wait for the page to load and choose “File” in the “Send to” drop-down list. You can now download those 1335 records with the default name “sequences.fasta”.
Alright. Now we need the MCPH1 gene. This time, start from the nucleotide page and try “MCPH1” as a query. Scroll down past all those irrelevant breast cancer genes until you see NM_024596 – Homo sapiens microcephaly, primary autosomal recessive 1 (MCPH1), mRNA. Use the same procedure as before to save that record as a fasta file, called “mcph1.fasta”.
OK. Looking through the Neanderthal DNA, we see a lot of very short sequence and 1335 sequences really isn’t a lot of data. We probably want to do a local alignment here. I used the EMBOSS program water. It’s as simple as typing “water” at the prompt, entering “mcph1.fasta” for sequence 1, “sequences.fasta” for sequence 2, accepting the defaults and waiting a while. The default output file is named “nm_024596.water”.
We could spend some time writing a parser to extract the best hits from this file, but if you’re in a hurry, you can get quick answers with nothing more than shell commands. First, try this:
grep Identity nm_24596.water > id
That pulls out all of the lines that look like “# Identity: 199/543 (36.6%)” and dumps them in a file named “id”. You can now open “id” in a simple editor such as nano and batch-remove “(” and “%)”.
Next we’d like to sort the hits by percent identity, highest to lowest:
sort -r -n +3 id | less
Up at the top, we see a couple of 100 percent identities, spanning 11 bp. Back to the water output file to see to which sequences these correspond:
grep -A 9 11/11 nm_024596.water # Identity: 11/11 (100.0%) # Similarity: 11/11 (100.0%) # Gaps: 0/11 ( 0.0%) # Score: 55.0 # # #======================================= NM_024596.1 2097 CATCGTCATCG 2107 ||||||||||| DX935970 10 CATCGTCATCG 20 -- # Identity: 11/11 (100.0%) # Similarity: 11/11 (100.0%) # Gaps: 0/11 ( 0.0%) # Score: 55.0 # # #======================================= NM_024596.1 1061 AGAAGGAGAGA 1071 ||||||||||| DX935508 14 AGAAGGAGAGA 24
Back in “sequences.fasta”, we see that these sequences are rather short (though longer than our local alignment) and uninformative. It’s looking unlikely that we’ll find a fragment of MCPH1 in there.
Ah well, it was worth a try. Now we wait for more Neanderthal DNA sequence.