<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>What You're Doing Is Rather Desperate</title>
	<atom:link href="http://nsaunders.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://nsaunders.wordpress.com</link>
	<description>Notes from the life of a bioinformatics researcher</description>
	<lastBuildDate>Fri, 27 Jan 2012 22:17:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='nsaunders.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>What You're Doing Is Rather Desperate</title>
		<link>http://nsaunders.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://nsaunders.wordpress.com/osd.xml" title="What You&#039;re Doing Is Rather Desperate" />
	<atom:link rel='hub' href='http://nsaunders.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Reproducible research: three links that made me think</title>
		<link>http://nsaunders.wordpress.com/2012/01/27/reproducible-research-three-links-that-made-me-think/</link>
		<comments>http://nsaunders.wordpress.com/2012/01/27/reproducible-research-three-links-that-made-me-think/#comments</comments>
		<pubDate>Fri, 27 Jan 2012 01:20:45 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[publications]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[reproducibility]]></category>
		<category><![CDATA[retraction]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2983</guid>
		<description><![CDATA[I&#8217;m constantly amazed, bemused and troubled by how little published scientific research is genuinely reproducible, in that you or I (or even the original authors) could go back and check the results. Three examples from around the Web converged in my mind this week. Software availability A BioStar user asks: where is the software for [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2983&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m constantly amazed, bemused and troubled by how little published scientific research is genuinely reproducible, in that you or I (or even the original authors) could go back and check the results. Three examples from around the Web converged in my mind this week.<br />
<span id="more-2983"></span></p>
<p><strong>Software availability</strong><br />
A BioStar user <a href="http://biostar.stackexchange.com/questions/16693/composite-of-multiple-signals-where-have-you-gone" target="_blank">asks</a>: where is the software for the method described in a <a href="http://www.sciencemag.org/content/327/5967/883.abstract" target="_blank">Science article</a>, &#8220;A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection.&#8221;</p>
<p>No-one can find it on the Web; the best we can do is a press release from 2010 stating that the software &#8220;should soon be available.&#8221;  Why do the so-called flagship journals and their reviewers have so little interest in the methods used to generate data shown in papers? It&#8217;s beyond my comprehension.</p>
<p><strong>Retractions on retractions</strong><br />
I was reviewing PubMed retraction data for 2011 (the &#8220;year of retractions&#8221;) using my <a href="http://pmretract.heroku.com/" target="_blank">PMRetract</a> application, when I noticed:</p>
<p><a href="http://www.plosone.org/annotation/info%3Adoi%2F10.1371%2Fannotation%2F8f94e479-4161-43a0-a28c-4c0460bb89a4" target="_blank">Retraction: An Integrated Approach to the Prediction of Chemotherapeutic Response in Patients with Breast Cancer</a></p>
<p>It caught my eye for a couple of reasons. First, the authors decided to retract because they based their analyses on methods described in &#8211; <a href="http://www.nature.com/nm/journal/v12/n11/full/nm1491.html" target="_blank">another retracted article</a>. Second, although the authors of that article don&#8217;t mention it in the <a href="http://www.nature.com/nm/journal/v17/n1/full/nm0111-135.html" target="_blank">retraction notice</a>, their retraction is the result of a reproducibility study in one of my all-time <a href="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aoas/1267453942" target="_blank">favourite articles</a>: &#8220;Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology.&#8221;</p>
<p>I note that the <em>PLoS ONE</em> retraction occurred about 8 months after the <em>Nat. Med.</em> retraction, which in turn occurred over 4 years after the original publication and just over a year after the <em>Ann. Appl. Stat.</em> re-analysis. This suggests to me that the <em>PLoS ONE</em> authors were alerted by the <em>Nat. Med.</em> retraction, not the excellent Baggerley <em>et al</em>. article. If only the latter article were published in <em>Nat. Med.</em>, as opposed to a less widely-read applied statistics journal.</p>
<p>Regardless, credit to the PLoS ONE authors for a courageous retraction based on the flawed work of others.</p>
<p><strong>Org-mode published</strong><br />
I was somewhat surprised (and pleased) to see that <a href="http://orgmode.org/" target="_blank">org-mode</a>, a tool for Emacs aimed at generating reproducible code, data analysis and documentation, has been written up and <a href="http://www.jstatsoft.org/v46/i03/paper" target="_blank">published</a> in the <em>Journal of Statistical Software</em>.</p>
<p>And then cynicism set in.  Is software that life scientists don&#8217;t use, published in a journal that they don&#8217;t read, really going to address the problem?</p>
<p>Perhaps one day, all journals will demand appropriate standards for reproducible research. Until then I&#8217;ll just have to continue feeling amazed, bemused and troubled.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/publications/'>publications</a>, <a href='http://nsaunders.wordpress.com/category/statistics/'>statistics</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/reproducibility/'>reproducibility</a>, <a href='http://nsaunders.wordpress.com/tag/retraction/'>retraction</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2983/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2983&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2012/01/27/reproducible-research-three-links-that-made-me-think/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>
	</item>
		<item>
		<title>2011 blog stats courtesy of WordPress.com</title>
		<link>http://nsaunders.wordpress.com/2012/01/01/2011-blog-stats-courtesy-of-wordpress-com/</link>
		<comments>http://nsaunders.wordpress.com/2012/01/01/2011-blog-stats-courtesy-of-wordpress-com/#comments</comments>
		<pubDate>Sun, 01 Jan 2012 00:14:16 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[this blog]]></category>
		<category><![CDATA[2011]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[wordpress.com]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2977</guid>
		<description><![CDATA[The kind people at WordPress.com have prepared a 2011 annual report for this blog. Click here to see the complete report. Filed under: this blog Tagged: 2011, summary, wordpress.com<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2977&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>
The kind people at WordPress.com have prepared a 2011 annual report for this blog.
</p>
<p>
  <a href="/2011/annual-report/"><br />
    Click here to see the complete report.<br />
  </a></p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/this-blog/'>this blog</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/2011/'>2011</a>, <a href='http://nsaunders.wordpress.com/tag/summary/'>summary</a>, <a href='http://nsaunders.wordpress.com/tag/wordpresscom/'>wordpress.com</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2977/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2977/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2977/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2977/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2977/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2977/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2977/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2977/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2977/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2977/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2977/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2977/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2977/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2977/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2977&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2012/01/01/2011-blog-stats-courtesy-of-wordpress-com/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>
	</item>
		<item>
		<title>Sequencing for relics from the Sanger era part 1: getting the raw data</title>
		<link>http://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/</link>
		<comments>http://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 04:28:44 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[research diary]]></category>
		<category><![CDATA[how to]]></category>
		<category><![CDATA[next-generation]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[sequencing]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2942</guid>
		<description><![CDATA[In another life, way back in the mists of time, I did a Ph.D. Part of my project was to sequence a gene from a bacterium, which encoded an enzyme involved in nitrate metabolism. It took the best part of a year to obtain ~ 2 000 bp of DNA sequence: partly because I was [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2942&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><div id="attachment_2943" class="wp-caption alignleft" style="width: 299px"><a href="http://nsaunders.files.wordpress.com/2011/12/20122011432.jpg"><img src="http://nsaunders.files.wordpress.com/2011/12/20122011432.jpg?w=289&#038;h=300" alt="" title="20122011432" width="289" height="300" class="size-medium wp-image-2943" /></a><p class="wp-caption-text">Sequencing in the good old days</p></div>In another life, way back in the mists of time, I <a href="http://www.citeulike.org/user/neils/article/6616450" target="_blank">did a Ph.D.</a>  Part of my project was to sequence a gene from a bacterium, which encoded an enzyme involved in nitrate metabolism.  It took the best part of a year to obtain ~ 2 000 bp of DNA sequence:  partly because I was rubbish at sequencing, but also because of the technology at the time.  It was an elegant biochemical technique called the <a href="http://en.wikipedia.org/wiki/DNA_sequencing#Chain-termination_methods" target="_blank">dideoxy chain termination method</a>, or &#8220;Sanger sequencing&#8221; after its inventor.  Sequence was visualized by exposing radioactively-labelled DNA to X-ray film, resulting in images like the one at left, from my thesis.  Yes, that photograph is <em>glued</em> in place.  The sequence was read manually, by placing the developed film on a light box, moving a ruler and writing down the bases.</p>
<p>By the time I started my first postdoc, technology had moved on a little.  We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence.  During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes.  Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used <a href="http://www.phrap.org/phredphrapconsed.html" target="_blank">a pipeline</a> of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).</p>
<p>The last time I worked directly on a project with sequencing data was around 2005.  Jump forward 5 years to the <a href="http://biostar.stackexchange.com/" target="_blank">BioStar bioinformatics Q&amp;A forum</a> and you&#8217;ll find <a href="http://biostar.stackexchange.com/questions/tagged/next-gen-sequencing" target="_blank">many questions</a> related to sequencing.  But not sequencing as I knew it.  No, this is so-called next-generation sequencing, or NGS.  Suddenly, I realised that I am no longer a sequencing expert. In fact:</p>
<blockquote><p>I am a relic from the Sanger era</p></blockquote>
<p>I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012.  So I figured: what better way to educate myself than to blog about it as I go along?</p>
<p>This is part 1 of a 4-part series and in this installment, we&#8217;ll look at how to get hold of public NGS data.<br />
<span id="more-2942"></span><br />
For these blog posts, I thought we&#8217;d stay with the theme of <a href="https://twitter.com/#!/search/%23arseniclife" target="_blank">#arseniclife</a>.  A draft genome sequence for the organism in question, <em>Halomonas</em> sp. GFAJ-1, was recently assembled from NGS data and made publicly available.</p>
<p>My starting point is <a href="http://www.ncbi.nlm.nih.gov/sra/SRX109792" target="_blank">this page</a> at the NCBI Short Read Archive (SRA), a database saved from closure this year when the curators secured extra funding.  There&#8217;s a lot of useful information on this page, some of which is hidden under &#8220;more&#8221; links.  </p>
<p>First, details of the DNA library:<br />
<pre class="brush: plain;">
Strategy:  WGS
Source:    GENOMIC
Selection: RANDOM
Layout:    PAIRED, Orientation: , Nominal length: 150, Nominal Std Dev: 30
</pre></p>
<p>Second, details of the sequencing platform:</p>
<p><pre class="brush: plain;">
Instrument model: Illumina HiSeq 2000
Spot descriptor:  -&gt; 1  forward 102  reverse &lt;-
Total:            1 run, 65.1M spots, 13.2G bases
    Download reads for this experiment in sra(7.0G) or sra-lite(7.0G) formats 
#       Run          # of Spots    # of Bases
1.      SRR385952    65,107,412    13.2G
</pre></p>
<p>Data are available in 2 formats: sra and sra-lite.  There isn&#8217;t much to choose between them, except that sra contains some extra information such as intensity scores.  Note that both are quite large files: after processing they&#8217;ll become even larger.  So the first rule of NGS work: you need powerful hardware with plenty of storage and memory.  I&#8217;m currently working on a shared server with access to 2 TB of storage, 96 GB RAM and 24 CPUs &#8211; and it&#8217;s just about enough.</p>
<p>There are software packages which handle SRA format, but most people convert to a widely-used NGS format called FASTQ.  To do this we use the <a href="http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&amp;f=software&amp;m=software&amp;s=software" target="_blank">SRA toolkit</a> provided by NCBI.  There&#8217;s no pre-compiled 64-bit version, but it was easy enough to compile on my machine, just by following the instructions in README-build:</p>
<p><pre class="brush: bash;">
cd
wget http://trace.ncbi.nlm.nih.gov/Traces/sra/static/sra_sdk-2.1.8.tar.gz
tar zxvf sra_sdk-2.1.8.tar.gz
cd sra_sdk-2.1.8
OUTDIR=/home/neil/sra
make OUTDIR=&quot;$OUTDIR&quot; out
make static
make GCC debug
make
</pre></p>
<p>This provides the program ~/sra/bin64/fastq-dump.  We get the SRA file from the NCBI FTP site:</p>
<p><pre class="brush: bash;">
mkdir -p ~/projects/gfaj1
cd ~/projects/gfaj1
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX109/SRX109792/SRR385952/SRR385952.sra
</pre></p>
<p>Now the fun begins.  First, <strong><em>don&#8217;t do this</em></strong>:</p>
<p><pre class="brush: bash;">
~/sra/bin64/fastq-dump SRR385952.sra
</pre></p>
<p>If you do and then examine the resulting file SRR385952.fastq, you&#8217;ll see sequences with length = 202.  However, you&#8217;ll see from the sequencing platform details earlier that there should be forward and reverse sequences, each with length = 101.  These are joined in the SRA file and need to be split, which you can do in one of two ways:</p>
<p><pre class="brush: bash;">
# 1 file
~/sra/bin64/fastq-dump --split-spot SRR385952.sra
# 2 files
~/sra/bin64/fastq-dump --split-files SRR385952.sra
</pre></p>
<p>Note: the documentation for &#8211;split-files is confusing. It reads &#8220;Dump each read into separate file.&#8221; What that actually means is: dump into forward and reverse files.  Note also: very poor documentation is a hallmark of NGS software; we&#8217;ll see plenty more of it later.</p>
<p>I ran fastq-dump with the &#8211;split-spot option, which generates a single file with the reverse read of each pair below the forward read for that pair.  Whichever way you do it, there&#8217;s a problem:</p>
<p><pre class="brush: bash;">
head -8 SRR385952.fastq
</pre></p>
<p><pre class="brush: plain;">
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD&gt;DDDDDDDDCCCCCDDDDA
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD
</pre></p>
<p>Each sequence in a <a href="http://en.wikipedia.org/wiki/FASTQ_format" target="_blank">FASTQ file</a> is described using 4 lines: a header, the sequence, a header for quality scores and the scores themselves.  So by using <em>head -8</em> we see the first pair of forward + reverse reads.  Unfortunately, they have the same ID in the header line.</p>
<p>One convention for identifying paired reads is to suffix the ID using either /1 or /2.  We can do that using <em>sed</em>:</p>
<p><pre class="brush: bash;">
sed -i '1~8s/^@SRR385952\.[0-9]*/&amp;\/1/' SRR385952.fastq
sed -i '5~8s/^@SRR385952\.[0-9]*/&amp;\/2/' SRR385952.fastq
</pre></p>
<p>To explain that: headers for forward reads are on lines 1, 9, 17&#8230; and those for reverse on lines 5, 13, 21&#8230;  So using <em>sed</em>, we append &#8220;/1&#8243; to the ID on lines 1, 9, 17&#8230; and &#8220;/2&#8243; to the ID on lines 5, 13, 21&#8230;  Result:</p>
<p><pre class="brush: bash;">
head -8 SRR385952.fastq
</pre></p>
<p><pre class="brush: plain;">
@SRR385952.1/1 HWI-ST484_0123:8:1101:1154:2066 length=101
ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD&gt;DDDDDDDDCCCCCDDDDA
@SRR385952.1/2 HWI-ST484_0123:8:1101:1154:2066 length=101
GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD
</pre></p>
<p>There&#8217;s one more thing to note about fastq-dump.  By default, the quality score offset is 33.  What does that mean?  Let&#8217;s look at the first five characters in the quality line for the first sequence:</p>
<p><pre class="brush: plain;">
character    ASCII code (decimal)    ASCII code - 33
B            66                      33
C            67                      34
#            35                       2
4            52                      19
A            65                      32
</pre></p>
<table>
<tr>
<td valign="top">
<div id="attachment_2958" class="wp-caption alignleft" style="width: 310px"><a href="http://nsaunders.files.wordpress.com/2011/12/sra-qual.png"><img src="http://nsaunders.files.wordpress.com/2011/12/sra-qual.png?w=300&#038;h=194" alt="" title="sra-qual" width="300" height="194" class="size-medium wp-image-2958" /></a><p class="wp-caption-text">SRA sequence view with quality scores</p></div>
</td>
<td valign="top">
So to get the quality score, just subtract 33 from the <a href="http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters" target="_blank">decimal ASCII code</a> of the character.  We can check our results by looking at the read back at the <a href="http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&amp;m=data&amp;s=viewer&amp;run=SRR385952" target="_blank">SRA website</a>; see image, left.</p>
<p>Schemes for FASTQ encoding of quality scores are, apparently, a bit of a mess. We&#8217;ll see in later posts that some software tools expect certain encodings (and even have completely undocumented features to deal with other encoding schemes).  For now though, it&#8217;s enough to know that our FASTQ file is in agreement with the SRA data and move on.
</td>
</tr>
</table>
<p>So what have we achieved?</p>
<p><pre class="brush: bash;">
ls -l SRR385952.fastq
</pre></p>
<p><pre class="brush: plain;">
-rw-r--r-- 1 neil neil 43719072816 2011-12-22 14:56 SRR385952.fastq
</pre></p>
<p>We&#8217;ve generated a large (~ 44 GB) FASTQ file, containing paired reads with unique header IDs.  It was quite a bit of work simply to fetch and format the raw data.</p>
<p>In the next installment: manipulation of FASTQ files and sequence quality assessment.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/research-diary/'>research diary</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/how-to/'>how to</a>, <a href='http://nsaunders.wordpress.com/tag/next-generation/'>next-generation</a>, <a href='http://nsaunders.wordpress.com/tag/ngs/'>ngs</a>, <a href='http://nsaunders.wordpress.com/tag/sequencing/'>sequencing</a>, <a href='http://nsaunders.wordpress.com/tag/tutorial/'>tutorial</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2942/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2942/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2942/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2942/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2942/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2942/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2942/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2942/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2942/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2942/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2942/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2942/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2942/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2942/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2942&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/12/20122011432.jpg?w=289" medium="image">
			<media:title type="html">20122011432</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/12/sra-qual.png?w=300" medium="image">
			<media:title type="html">sra-qual</media:title>
		</media:content>
	</item>
		<item>
		<title>#arseniclife: the genome</title>
		<link>http://nsaunders.wordpress.com/2011/12/14/arseniclife-the-genome/</link>
		<comments>http://nsaunders.wordpress.com/2011/12/14/arseniclife-the-genome/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 04:35:32 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[genomics]]></category>
		<category><![CDATA[arseniclife]]></category>
		<category><![CDATA[extremophiles]]></category>
		<category><![CDATA[halomonas]]></category>
		<category><![CDATA[microbiology]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2899</guid>
		<description><![CDATA[It&#8217;s about one year since the science story dubbed #arseniclife hit the headlines. November 30th saw the release of a draft genome sequence for Halomonas sp. GFAJ-1, the bacterium behind the furore. As Iddo pointed out on Twitter, sequencing the DNA from GFAJ-1 is itself strong evidence against arsenate in the DNA backbone, since the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2899&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s about one year since the science story dubbed <a href="https://twitter.com/search/%23arseniclife" target="_blank">#arseniclife</a> hit the headlines.  November 30th saw the release of <a href="http://www.ncbi.nlm.nih.gov/nuccore/AHBC00000000.1" target="_blank">a draft genome</a> sequence for <em>Halomonas</em> sp. GFAJ-1, the bacterium behind the furore.</p>
<p>As Iddo pointed out <a href="https://twitter.com/#!/iddux/status/143468587296886784" target="_blank">on Twitter</a>, sequencing the DNA from GFAJ-1 is itself strong evidence against arsenate in the DNA backbone, since the sequencing chemistry would be highly unlikely to work in that case. However, if like me you think that a new microbial genome provides the most fun to be had in bioinformatics [*], you&#8217;ll be excited by the availability of the data.</p>
<p>In this post then: where to get it, some very preliminary analysis and some things that you might like to to with it. Projects for your students, perhaps.</p>
<p><em>[*] note to self: why, then, am I working on colorectal cancer?</em><br />
<span id="more-2899"></span><br />
<strong>1. The data</strong><br />
Here&#8217;s your starting point: the NCBI Whole Genome Shotgun Sequencing Project page <a href="http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=AHBC01" target="_blank">for GFAJ-1</a>. It provides a nice summary of the data including: the sequencing technology used (Illumina GAIIx), coverage (170x), assembly software (<a href="http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page" target="_blank">Mira</a>) and number of contigs (103).</p>
<p>Files for download: annotated contigs in Genbank format (<a href="ftp://ftp.ncbi.nlm.nih.gov//genbank/wgs/wgs.AHBC.1.gbff.gz" target="_blank">wgs.AHBC.1.gbff.gz</a>), the WGS &#8220;master&#8221; Genbank file (<a href="ftp://ftp.ncbi.nlm.nih.gov//genbank/wgs/wgs.AHBC.mstr.gbff.gz" target="_blank">wgs.AHBC.mstr.gbff.gz</a> &#8211; not really necessary) and contig sequences in fasta format (<a href="ftp://ftp.ncbi.nlm.nih.gov//genbank/wgs/wgs.AHBC.1.fsa_nt.gz" target="_blank">wgs.AHBC.1.fsa_nt.gz</a>).</p>
<p>You can also fetch raw sequencing data from the Short Read Archive, but more about that a little later.</p>
<p>Be aware that this is a draft genome, annotated using an automated computational pipeline. This means that there will be errors in the sequence, the genes and the annotation. It also means plenty of potential improvements can be made by smart, enthusiastic bioinformaticians.</p>
<p><strong>2. Arsenic metabolism</strong><br />
Of course your first instinct is to see whether any of the putative protein products have been annotated as having an &#8220;arsenic-related&#8221; function. Since this might involve arsenic, arsenate or arsenite, the simplest approach is:</p>
<p><pre class="brush: bash;">
grep arsen wgs.AHBC.1.gbff
</pre></p>
<p>And the result:<br />
<pre class="brush: plain;">
                     /product=&quot;arsenite-activated ATPase ArsA&quot;
                     /note=&quot;COG1055 Na+/H+ antiporter NhaD and related arsenite
                     /product=&quot;arsenical-resistance protein&quot;
</pre></p>
<p>Digging around in the file a little more reveals more information about these two putative products:<br />
<pre class="brush: plain;">
CDS             90659..91666
                     /locus_tag=&quot;MOY_00685&quot;
                     /note=&quot;COG0003 Oxyanion-translocating ATPase&quot;
                     /codon_start=1
                     /transl_table=11
                     /product=&quot;arsenite-activated ATPase ArsA&quot;
                     /protein_id=&quot;EHK62524.1&quot;
                     /db_xref=&quot;GI:359298308&quot;
                     /translation=&quot;MQAVLNRRLLWVGGKGGVGKTTVAASLAVLAARRGKRVLVVSTD
                     PAHSLGDVFDRALSDIPRRLLPNLDAMEIDPDIEVEAHLARVVKQMRRYAAPEMMQEL
                     ERQMRLTRQSPGTQEAALLERLARLMVDDSAPYDLIIFDTAPTGHTLRLLTLPEAMAA
                     WTDGLLAHNRKSAELGKVLEHLTPKRGRDVATPFDDPTVDPLDDLDERTRDVAKTLID
                     RRRLFHQARRRIEDSKACSFLFVMTPERLPILETDRAVKALEEVHIPVAGVLINRLIP
                     IEADGDFLQARREQEATYLTRIDELFERLPRPTLPWLPTDVQGIEVLEMLAQKLEQQG
                     F&quot;
</pre></p>
<p><pre class="brush: plain;">
CDS             8835..9863
                     /locus_tag=&quot;MOY_16088&quot;
                     /note=&quot;COG0798 Arsenite efflux pump ACR3 and related
                     permeases&quot;
                     /codon_start=1
                     /transl_table=11
                     /product=&quot;arsenical-resistance protein&quot;
                     /protein_id=&quot;EHK59503.1&quot;
                     /db_xref=&quot;GI:359295220&quot;
                     /translation=&quot;MGLFERYLSIWVAIAIVAGIALGQFAPAVPEVLSRFEYAQVSIP
                     IAVLIWAMIFPMMAQIDFSAIAGVRRQPKGLTITTTVNWLIKPFTMFALAWLFFMVIF
                     KPFIPEELASQYLAGAILLGAAPCTAMVFVWSYLTRGDAAYTLVQVALNDLIMLFAFA
                     PLVVFLLGISNIQVPWDTVILSVVLYIVIPLAAGYFTRKTLIAKYGTEWYDNVFMKRV
                     GPITPIGLIITLVLLFAFQGEVILNNPLHIVLIAIPLIIQTFLIFFIAYGWAKAWRVP
                     HNIAAPGAMIGASNFFELAVAAAIALFGLQSGAALATVVGVLVEVPLMLALVRIANKT
                     RQHFPENT&quot;
</pre></p>
<p>So it seems we may have a system for pumping arsenite out of the cell. Not unexpected, for an organism known to have high arsenic tolerance. Presumably, if the form of arsenic in the environment is <em>arsenate</em>, there&#8217;s a mechanism for reducing it to arsenite (if these annotations are correct). Some interesting bioenergetic possibilities there.</p>
<p>By the way, these findings seem to be at odds with the statement in <a href="http://www.usatoday.com/tech/science/columnist/vergano/story/2011-12-04/arseniclife-bacteria-dna/51593468/1" target="_blank">this article from USA Today</a> that &#8220;none of them [the genes] look like identified arsenic resistance genes found in other bugs.&#8221;</p>
<p>If we were to postulate that this organism incorporated arsenate into the DNA backbone (I do not believe it does), how might we use the genome sequence data?  I guess there are two main possibilities:</p>
<ol>
<li>The genes for nucleotide metabolism encode modified proteins which can catalyse reactions using arsenate-containing compounds instead of, or as well as, phosphates</li>
<li>The genome contains novel genes for nucleotide metabolism that we might not even recognise as such</li>
</ol>
<p>Exploring idea (1) a little further &#8211; nucleotide biosynthesis is a complicated business. See, for example, <a href="http://www.kegg.com/kegg-bin/show_pathway?map00230" target="_blank">this KEGG pathway map</a> for purine metabolism. I suppose in principle, it&#8217;s possible to retrieve the sequences for some of the key enzymes, search for GFAJ-1 orthologs using BLAST, then further investigate the GFAJ-1 protein sequences (domains? homology modelling?) to see if any of them are &#8220;unusual&#8221;. However, this entails all manner of unfounded assumptions about a hypothetical arsenic-based biochemistry: for example, the existence of analogous mono-, di- and tri-arsenate compounds. I wouldn&#8217;t go there myself.</p>
<p><strong>3. Phylogeny and genome alignment</strong><br />
How &#8220;Halomonas-like&#8221; is <em>Halomonas</em> GFAJ-1 anyway? Let&#8217;s start by looking for 16S rRNA genes:<br />
<pre class="brush: bash;">
grep -P &quot;^LOCUS|ribosomal RNA&quot; wgs.AHBC.1.gbff
</pre></p>
<p>Like many bacteria, GFAJ-1 appears to have multiple rRNA operons (unless the assembly has failed to merge them). One of these is on contig AHBC01000086:<br />
<pre class="brush: plain;">
LOCUS       AHBC01000086            5885 bp    DNA     linear   BCT 02-DEC-2011                     /product=&quot;16S ribosomal RNA&quot;                     
 /product=&quot;23S ribosomal RNA&quot;                     
 /product=&quot;5S ribosomal RNA&quot;
</pre></p>
<p>It&#8217;s easy enough to copy/paste the 16S rDNA sequence from the file, or you can use the handy <em>bp_extract_feature_seq</em> script which comes with Bioperl to grab all rDNA sequences. The 16S sequence can be identified by its length (not the shortest which is 5S, or the longest which is 23S):</p>
<p><pre class="brush: bash;">
bp_extract_feature_seq -i wgs.AHBC.1.gbff --format genbank --feature=rRNA -o rrna.fa
</pre></p>
<p><div id="attachment_2916" class="wp-caption alignright" style="width: 310px"><a href="http://nsaunders.files.wordpress.com/2011/12/gfaj-1_tree.png"><img src="http://nsaunders.files.wordpress.com/2011/12/gfaj-1_tree.png?w=300&#038;h=236" alt="gfaj-1_tree" title="gfaj-1_tree" width="300" height="236" class="size-medium wp-image-2916" /></a><p class="wp-caption-text">NCBI BLAST GFAJ-1 vs. Bacteria</p></div>What follows is emphatically <strong>not</strong> the correct approach to phylogenetic analysis but let&#8217;s run with it anyway: a quick BLAST versus bacteria at NCBI and the results as a tree view. Click image at right for larger version.</p>
<p><em>Halomonas</em> appears to be a rather diverse genus but indeed, GFAJ-1 seems most similar to other <em>Halomonas</em>. Is there genomic information for any of the related organisms?  <a href="http://www.ncbi.nlm.nih.gov/genome/?term=halomonas" target="_blank">A quick search</a> suggests not; 7 organisms, only 2 of which have available genome sequence, neither of which feature in the top 100 BLAST hits for GFAJ-1 16S rDNA.</p>
<p>Regardless, it might be fun to try aligning the GFAJ-1 contigs with a reference genome, in this case from <em>Halomonas elongata</em>. We can save <a href="http://www.ncbi.nlm.nih.gov/nuccore/307543589" target="_blank">the chromosome sequence</a> in Fasta format, install <a href="http://mummer.sourceforge.net/" target="_blank">MUMmer</a> and run:<br />
<pre class="brush: bash;">
nucmer -maxmatch -p nucmer helongata.fa wgs.AHBC.1.fsa_nt
show-coords -r -c -l nucmer.delta &gt; nucmer.coords
mapview -n 1 -p mapview nucmer.coords
xfig mapview_0.fig
</pre></p>
<table>
<tr>
<td valign="top">
And sure, there are a few nicely-aligned segments (click image, right), but nothing spectacular in terms of a good reference genome which we could use to order large numbers of contigs.
</td>
<td valign="top">
<div id="attachment_2919" class="wp-caption alignright" style="width: 310px"><a href="http://nsaunders.files.wordpress.com/2011/12/gfaj-1_mummer.png"><img src="http://nsaunders.files.wordpress.com/2011/12/gfaj-1_mummer.png?w=300&#038;h=210" alt="gfaj-1_mummer" title="gfaj-1_mummer" width="300" height="210" class="size-medium wp-image-2919" /></a><p class="wp-caption-text">Part of MUMmer alignment: GFAJ-1 contigs to H. elongata chromosome</p></div>
</td>
</tr>
</table>
<p><strong>4. Assembly quality and raw data</strong><br />
First, a quick look at the contig statistics, using <em>grep</em> and a little R:</p>
<table>
<tr>
<td valign="top">
<pre class="brush: bash;">
grep ^LOCUS wgs.AHBC.1.gbff &gt; locus.txt
</pre></p>
<p><pre class="brush: r;">
library(ggplot2)
loc &lt;- read.table(&quot;locus.txt&quot;, header = F, stringsAsFactors = F)
fivenum(loc$V3)
# [1]    545   1385   9163  48993 233806
ggplot(loc) + geom_density(aes(V3), fill = &quot;cornsilk&quot;) + theme_bw() + 
opts(title = &quot;Density plot GFAJ-1 contig lengths&quot;)
</pre>
</td>
<td valign="top">
<div id="attachment_2924" class="wp-caption alignright" style="width: 310px"><a href="http://nsaunders.files.wordpress.com/2011/12/contigs1.png"><img src="http://nsaunders.files.wordpress.com/2011/12/contigs1.png?w=300&#038;h=300" alt="contigs" title="contigs" width="300" height="300" class="size-medium wp-image-2924" /></a><p class="wp-caption-text">Density plot of GFAJ-1 contig length</p></div>
</td>
</tr>
</table>
<p>This tells us that many of the contigs are rather short: 50% are 9 163 bp or less (the median) and 12 contigs are 100 000 bp or more in length. The long contigs add up to around 1/3 of the total genome size, which seems like a good result.</p>
<p>Good, but might other assembly software perform better? Happily, you&#8217;re in a position to find out because the raw sequencing data <a href="http://www.ncbi.nlm.nih.gov/sra/SRX109792" target="_blank">can be found</a> in the Short Read Archive (SRA).  You can grab them in SRA format (danger! 7.5 GB!) like so:<br />
<pre class="brush: bash;">
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR385/SRR385952/SRR385952.sra
</pre></p>
<p>You&#8217;ll then need to grab the <a href="http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&amp;f=software&amp;m=software&amp;s=software" target="_blank">SRA toolkit</a> and dump the SRA to FASTQ format. Things become complex here because the toolkit documentation is not great and we need to figure out what (a) kind of reads we have and (b) how to restore them from the SRA file.</p>
<p>To cut a long story short, this line tells us that we have a paired-end library:<br />
<pre class="brush: plain;">
Layout: PAIRED, Orientation: , Nominal length: 150, Nominal Std Dev: 30
</pre></p>
<p>and so we need to dump that into 2 fastq files:<br />
<pre class="brush: bash;">
fastq-dump --split-files SRR385952.sra
</pre></p>
<p>but this has the unfortunate effect of generating identical fastq headers in both files:<br />
<pre class="brush: bash;">
head -4 SRR385952_*.fastq
</pre></p>
<p><pre class="brush: plain;">
==&gt; SRR385952_1.fastq &lt;==
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
ATNCGTCCTTTATTCTGCCAGGGAATTGGCCCGTTCCGCTGGGCAGCGCTTTCCGGCGACCCGGAAGATATTTACAAAACCGACCAGAAGGTCAAAGAGCT
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
BC#4ADDFHHHHHJHIJJJJJJJIJJIJJJJJJJJHGGHIJJJJIJJJIJJHHHHFEDDDDDDDDDDDDDDEECDCCCBDBD&gt;DDDDDDDDCCCCCDDDDA

==&gt; SRR385952_2.fastq &lt;==
@SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
GGTCGTCGGGGATCAGCTCTTTGACCTTCTGGTCGGTTTTGTAAATATCTTCCGGGTCGCCGGAAAGCGCTGCCCAGCGGAACGGGCCAATTCCCTGGCAG
+SRR385952.1 HWI-ST484_0123:8:1101:1154:2066 length=101
CCBFFFFFHHHHHJJJJJJJJJJJJIIJJJJIEHIJHIJJIHHIJJIIGIIIHHHFDDDDDDDDDDDDDDBDDDDDDDDDBDDDDDDDDBDDCDDDDBDDD
</pre></p>
<p>and so, some <em>sed</em> is required to generate the more standard &#8220;/1 or /2&#8243; fastq header suffix:<br />
<pre class="brush: bash;">
sed -i -e 's/^@SRR385952\.[0-9]*/&amp;\/1/' SRR385952_1.fastq
sed -i -e 's/^@SRR385952\.[0-9]*/&amp;\/2/' SRR385952_2.fastq
</pre></p>
<p>After which, you can experiment with the <em>de novo</em> assembler <a href="http://gage.cbcb.umd.edu/recipes/index.html" target="_blank">of your choice</a>.  I&#8217;ll leave the appalling quality and documentation of NGS software in general for another post.</p>
<p>Enjoy &#8211; but don&#8217;t spend too much time on it. You won&#8217;t find the key to &#8220;arsenic life&#8221; in this genome.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/genomics/'>genomics</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/arseniclife/'>arseniclife</a>, <a href='http://nsaunders.wordpress.com/tag/extremophiles/'>extremophiles</a>, <a href='http://nsaunders.wordpress.com/tag/halomonas/'>halomonas</a>, <a href='http://nsaunders.wordpress.com/tag/microbiology/'>microbiology</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2899/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2899/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2899/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2899/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2899/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2899/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2899/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2899/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2899/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2899/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2899/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2899/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2899/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2899/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2899&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/12/14/arseniclife-the-genome/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/12/gfaj-1_tree.png?w=300" medium="image">
			<media:title type="html">gfaj-1_tree</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/12/gfaj-1_mummer.png?w=300" medium="image">
			<media:title type="html">gfaj-1_mummer</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/12/contigs1.png?w=300" medium="image">
			<media:title type="html">contigs</media:title>
		</media:content>
	</item>
		<item>
		<title>A Friday round-up</title>
		<link>http://nsaunders.wordpress.com/2011/12/02/a-friday-round-up/</link>
		<comments>http://nsaunders.wordpress.com/2011/12/02/a-friday-round-up/#comments</comments>
		<pubDate>Thu, 01 Dec 2011 23:36:06 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[research diary]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[web resources]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2891</guid>
		<description><![CDATA[Just a brief selection of items that caught my eye this week. Note that this is a Friday as opposed to Friday, lest you mistake this for a new, regular feature. 1. R/statistics ggbio A new Bioconductor package which builds on the excellent ggplot graphics library, for the visualization of biological data. R development master [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2891&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Just a brief selection of items that caught my eye this week. Note that this is <em>a Friday</em> as opposed to <em>Friday</em>, lest you mistake this for a new, regular feature.</p>
<p><strong>1. R/statistics</strong></p>
<ul>
<li><a href="http://www.bioconductor.org/packages/release/bioc/html/ggbio.html" target="_blank">ggbio</a></li>
<p>A new Bioconductor package which builds on the excellent <a href="http://had.co.nz/ggplot/" target="_blank">ggplot</a> graphics library, for the visualization of biological data.</p>
<li><a href="http://courses.had.co.nz/11-csiro/" target="_blank">R development master class</a></li>
<p>Hadley Wickham recently presented this course on R package development for my organisation. I was on parental leave at the time, otherwise I would have attended for sure.
</ul>
<p><strong>2. Bioinformatics in the media</strong><br />
<a href="http://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html?_r=1&amp;pagewanted=all" target="_blank">DNA Sequencing Caught in Deluge of Data</a></p>
<p>I described this NYT article as a &#8220;<a href="http://twitter.com/#!/neilfws/status/142105572530065408" target="_blank">surprisingly-good intro article</a>&#8220;.  Michael Eisen described it as &#8220;<a href="http://www.michaeleisen.org/blog/?p=770" target="_blank">kind of silly</a>&#8220;.</p>
<p>I think we&#8217;re both right. Michael&#8217;s perspective is that of an expert in high-throughput sequencing data; I&#8217;m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.</p>
<p>As to the &#8220;deluge&#8221;: yes, there are other sciences that generate more data and yes, we probably don&#8217;t need to archive/analyse a lot of the raw data. However, I&#8217;d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/statistics/r/'>R</a>, <a href='http://nsaunders.wordpress.com/category/research-diary/'>research diary</a>, <a href='http://nsaunders.wordpress.com/category/statistics/'>statistics</a>, <a href='http://nsaunders.wordpress.com/category/web-resources/'>web resources</a>  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2891/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2891/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2891/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2891/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2891/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2891/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2891/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2891/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2891/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2891/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2891/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2891/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2891/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2891/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2891&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/12/02/a-friday-round-up/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>
	</item>
		<item>
		<title>Boring, monotonous day-to-day tasks? That&#8217;s synonymous with bioinformatics.</title>
		<link>http://nsaunders.wordpress.com/2011/11/24/boring-monotonous-day-to-day-tasks-thats-synonymous-with-bioinformatics/</link>
		<comments>http://nsaunders.wordpress.com/2011/11/24/boring-monotonous-day-to-day-tasks-thats-synonymous-with-bioinformatics/#comments</comments>
		<pubDate>Thu, 24 Nov 2011 05:18:38 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[productivity]]></category>
		<category><![CDATA[biostar]]></category>
		<category><![CDATA[licklider]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2886</guid>
		<description><![CDATA[In response to this question, I can only point out that J.C.R. Licklider figured it out over 50 years ago: Despite the fact that there is a voluminous literature on thinking and problem solving, including intensive case-history studies of the process of invention, I could find nothing comparable to a time-and-motion-study analysis of the mental [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2886&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In response to <a href="http://biostar.stackexchange.com/questions/14693/blue-collar-bioinformatics-what-are-the-boring-monotonous-day-to-day" target="_blank">this question</a>, I can only point out that <a href="http://en.wikipedia.org/wiki/Licklider" target="_blank">J.C.R. Licklider</a> <a href="http://groups.csail.mit.edu/medg/people/psz/Licklider.html" target="_blank">figured it out</a> over 50 years ago:</p>
<blockquote><p>Despite the fact that there is a voluminous literature on thinking and problem solving, including intensive case-history studies of the process of invention, I could find nothing comparable to a time-and-motion-study analysis of the mental work of a person engaged in a scientific or technical enterprise. In the spring and summer of 1957, therefore, I tried to keep track of what one moderately technical person actually did during the hours he regarded as devoted to work. Although I was aware of the inadequacy of the sampling, I served as my own subject.</p>
<p>It soon became apparent that the main thing I did was to keep records, and the project would have become an infinite regress if the keeping of records had been carried through in the detail envisaged in the initial plan. It was not. Nevertheless, I obtained a picture of my activities that gave me pause. Perhaps my spectrum is not typical&#8211;I hope it is not, but I fear it is.</p>
<p>About 85 per cent of my &#8220;thinking&#8221; time was spent getting into a position to think, to make a decision, to learn something I needed to know. Much more time went into finding or obtaining information than into digesting it. Hours went into the plotting of graphs, and other hours into instructing an assistant how to plot. When the graphs were finished, the relations were obvious at once, but the plotting had to be done in order to make them so. At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibility to speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.</p>
<p>Throughout the period I examined, in short, my &#8220;thinking&#8221; time was devoted mainly to activities that were essentially clerical or mechanical: searching, calculating, plotting, transforming, determining the logical or dynamic consequences of a set of assumptions or hypotheses, preparing the way for a decision or an insight. Moreover, my choices of what to attempt and what not to attempt were determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability.</p></blockquote>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/productivity/'>productivity</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/biostar/'>biostar</a>, <a href='http://nsaunders.wordpress.com/tag/licklider/'>licklider</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2886/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2886/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2886/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2886/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2886/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2886/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2886/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2886/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2886/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2886/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2886/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2886/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2886/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2886/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2886&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/11/24/boring-monotonous-day-to-day-tasks-thats-synonymous-with-bioinformatics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>
	</item>
		<item>
		<title>Interacting with bioinformatics webservers using R</title>
		<link>http://nsaunders.wordpress.com/2011/09/08/interacting-with-bioinformatics-webservers-using-r/</link>
		<comments>http://nsaunders.wordpress.com/2011/09/08/interacting-with-bioinformatics-webservers-using-r/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 12:49:42 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[how to]]></category>
		<category><![CDATA[protein structure]]></category>
		<category><![CDATA[what if]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2858</guid>
		<description><![CDATA[In an ideal world, all bioinformatics tools would be made available via the Web as a web service with an API, as well as a standalone package to download for local use. This is rarely the case and sometimes, even where one or the other is available, factors such as cost come into play. So [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2858&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In an ideal world, all bioinformatics tools would be made available via the Web as a web service with an API, as well as a standalone package to download for local use.  This is rarely the case and sometimes, even where one or the other is available, factors such as cost come into play.  So we resort to <em>web scraping</em>; writing code to interact with the code that lies behind a web server so as to submit queries, retrieve and parse results.</p>
<p>Normally, I&#8217;d use something like Ruby&#8217;s <a href="http://mechanize.rubyforge.org/" title="Mechanize" target="_blank">Mechanize library</a> for this purpose. However, where the purpose is to retrieve delimited data for analysis using R, I figured it was time to try and achieve the entire process within R.  So here&#8217;s how I used the <em>RCurl</em> and <em>XML</em> packages to interact with the <a href="http://swift.cmbi.ru.nl/servers/html/index.html" title="whatif" target="_blank">WHAT IF server</a>, which provides tools for the analysis of protein structure.<br />
<span id="more-2858"></span></p>
<p><strong>1. A note of caution</strong><br />
The administrators of some servers don&#8217;t like robots; particularly those which fire off thousands of queries per second to a server designed to cope only with a small number of requests. So use common sense; check whether the administrators have made their policy public and respect the limits.</p>
<p><strong>2. Know your HTML</strong><br />
Step 1 in automating form submission is understanding how the web form works. So: head to the WHAT IF server, click &#8220;Atomic contacts&#8221; in the left menu, then right-click &#8220;Salt bridges&#8221; and open <a href="http://swift.cmbi.ru.nl/servers/html/shosbr.html" title="salt bridges" target="_blank">that link</a> in a new tab. Examination of the HTML shows this:</p>
<p><pre class="brush: xml;">
&lt;FORM METHOD=&quot;POST&quot; ACTION=&quot;/wiw-cgi/GenericCGI.py&quot; ENCTYPE=&quot;multipart/form-data&quot;&gt; 
&lt;INPUT TYPE=&quot;hidden&quot; NAME=&quot;request&quot; VALUE=&quot;shosbr&quot; &gt; 
&lt;TABLE border=0 cellpadding=4 cellspacing=1 width=&quot;100%&quot;&gt; 
&lt;TR&gt;&lt;TD Align=left &gt;Either Choose a pdb-file,&lt;/TD&gt; 
&lt;TD Align=left &gt;&lt;INPUT TYPE=&quot;TEXT&quot; NAME=&quot;&amp;PDB1&quot; SIZE=8 MAXLENGTH=4 &gt; 
&lt;/TD&gt; 
&lt;/TR&gt; 
&lt;TR&gt;&lt;TD Align=left &gt;Or Choose your own file&lt;/TD&gt; 
&lt;TD Align=left &gt;&lt;INPUT TYPE=&quot;file&quot; NAME=&quot;&amp;FIL1&quot; &gt; 
&lt;/TD&gt; 
&lt;/TR&gt; 
&lt;/TABLE&gt;&lt;P&gt; 
&lt;INPUT TYPE=&quot;submit&quot; NAME=&quot;SubmitButton&quot; VALUE=&quot;Send&quot; &gt; 
&lt;INPUT TYPE=&quot;reset&quot; NAME=&quot;Reset&quot; VALUE=&quot;Clear Form&quot; &gt; 
&lt;/FORM&gt;
</pre></p>
<p>What this tells you is that the server runs a Python script, named <em>GenericCGI.py</em>, to which you can submit a set of <em>name=value</em> parameters: <em>request=shosbr</em>, <em>&amp;PDB1=NNNN</em> (where NNNN is a PDB identifier) or <em>&amp;FIL1=FILE</em>, where FILE is a PDB file on your local machine.</p>
<p><strong>3. Submission of query</strong><br />
Let&#8217;s choose a small protein &#8211; <a href="http://www.pdb.org/pdb/explore.do?structureId=3N2J" title="3n2j" target="_blank">3N2J</a>, a bacterial azurin and submit it to WHAT IF using R:</p>
<p><pre class="brush: r;">
library(RCurl)
salt &lt;- postForm(&quot;http://swift.cmbi.ru.nl/wiw-cgi/GenericCGI.py&quot;, &quot;&amp;PDB1&quot; = &quot;3N2J&quot;, &quot;request&quot; = &quot;shosbr&quot;, &quot;submitButton&quot; = &quot;Send&quot;)
</pre></p>
<p>That takes a little while to run and when finished, the variable <em>salt</em> contains a bunch of HTML which looks like this:</p>
<p><pre class="brush: xml;">
&lt;!DOCTYPE HTML PUBLIC \&quot;-//W3C//DTD HTML 3.2//EN\&quot;&gt;\n&lt;HTML&gt;\n\n&lt;!-- This file generated using Python HTMLgen module. --&gt;\n&lt;HEAD&gt;\n  &lt;META NAME=\&quot;GENERATOR\&quot; CONTENT=\&quot;HTMLgen 2.0.6\&quot;&gt;\n        &lt;TITLE&gt;Salt bridges.&lt;/TITLE&gt;\n &lt;/HEAD&gt;\n&lt;BODY BGCOLOR=\&quot;WHITE\&quot;&gt;\n&lt;H1&gt;Salt bridges.&lt;/H1&gt;\n\n\n            Your job has been passed to WHAT IF and is processed.\n            Your request might take a while, please wait.\n            \n\n&lt;/BODY&gt; &lt;/HTML&gt;\n&lt;!DOCTYPE HTML PUBLIC \&quot;-//W3C//DTD HTML 3.2//EN\&quot;&gt;\n&lt;HTML&gt;\n\n&lt;!-- This file generated using Python HTMLgen module. --&gt;\n&lt;HEAD&gt;\n  &lt;META NAME=\&quot;GENERATOR\&quot; CONTENT=\&quot;HTMLgen 2.0.6\&quot;&gt;\n        &lt;TITLE&gt;Salt bridges.&lt;/TITLE&gt;\n &lt;/HEAD&gt;\n&lt;BODY BGCOLOR=\&quot;WHITE\&quot;&gt;\n\n&lt;P&gt;\n\nAccess your result.\n\n&lt;FORM METHOD=\&quot;POST\&quot; ACTION=\&quot;/wiw-cgi//GenericCGI.py\&quot;&gt;\n&lt;INPUT TYPE=\&quot;hidden\&quot; NAME=\&quot;ID\&quot; VALUE=\&quot;/tmprM7rDq/\&quot; &gt;\n&lt;INPUT TYPE=\&quot;hidden\&quot; NAME=\&quot;refresh\&quot; VALUE=\&quot;shosbr\&quot; &gt;\n&lt;INPUT TYPE=\&quot;submit\&quot; NAME=\&quot;submit\&quot; VALUE=\&quot;Results\&quot; &gt;\n\n&lt;/FORM&gt;\n\n    &lt;HR&gt; \n    If you have detected any error, or have any question or suggestion,\n    please send an Email to Gert Vriend.&lt;BR&gt;\n    Roland Krause, Jens Erik Nielsen, \n    &lt;a href=\&quot;mailto: Vriend@CMBI.ru.nl\&quot;&gt;Gert Vriend&lt;/a&gt;.&lt;P&gt;\n    &lt;HR&gt;\n    Last modified Thu Sep  8 12:07:53 2011\n    \n\n&lt;/BODY&gt; &lt;/HTML&gt;\n
</pre></p>
<p><strong>4. Retrieval of results</strong><br />
The WHAT IF server is a bit of a tease. Examination of the HTML in variable <em>salt</em>, above, reveals that instead of returning results, we have an intermediate page containing another form, which we have to submit to retrieve the final data.  The key form parameters are: <em>ID=/tmprQzd2o/</em>, <em>submit=Results</em> and <em>refresh=shosbr</em>. The parameter ID refers to a temporary location on the server which holds the results and will change with each submitted query.</p>
<p>We can submit this form in the same way:</p>
<p><pre class="brush: r;">
library(RCurl)
salt &lt;- postForm(&quot;http://swift.cmbi.ru.nl/wiw-cgi/GenericCGI.py&quot;, &quot;ID&quot; = &quot;/tmprQzd2o/&quot;, &quot;request&quot; = &quot;shosbr&quot;, &quot;submitButton&quot; = &quot;Send&quot;)
</pre></p>
<p>That returns a bunch more HTML &#8211; just looking at the first part:</p>
<p><pre class="brush: xml;">
&lt;!DOCTYPE HTML PUBLIC \&quot;-//W3C//DTD HTML 3.2//EN\&quot;&gt;\n&lt;HTML&gt;\n\n&lt;!-- This file generated using Python HTMLgen module. --&gt;\n&lt;HEAD&gt;\n  &lt;META NAME=\&quot;GENERATOR\&quot; CONTENT=\&quot;HTMLgen 2.0.6\&quot;&gt;\n        &lt;TITLE&gt;Salt bridges.&lt;/TITLE&gt;\n &lt;/HEAD&gt;\n&lt;BODY BGCOLOR=\&quot;white\&quot;&gt;\n&lt;H1&gt;Salt bridges.&lt;/H1&gt;\n\nThe list of salt bridges:\n&lt;PRE&gt;Date= 2011-09-08 12:07:52\nYou are prompted for the first range\nYou are prompted for the second range\n   1   6 ASP  (   6 ) A      OD1 -    1 ALA  (   1 ) A      N     6.78\n   2   6 ASP  (   6 ) A      OD2 -    1 ALA  (   1 ) A      N     5.98\n   3  11 ASP  (  11 ) A      OD1 -   35 HIS  (  35 ) A      ND1   4.35\n   4  11 ASP  (  11 ) A      OD1 -   35 HIS  (  35 ) A      NE2   4.23\n
...
</pre></p>
<p>Success. We have results and if you squint, you might be able to see that they are wrapped up in <em>PRE</em> tags.  Now, how to get them out?</p>
<p><strong>5. Parsing of results</strong><br />
Here, we bring the XML package into play.  First, parse the returned HTML and pull out the text between the PRE tags:</p>
<p><pre class="brush: r;">
library(XML)
sb  &lt;- htmlTreeParse(salt, useInternalNodes = T)
sb1 &lt;- xpathApply(sb, &quot;//pre&quot;, xmlValue)
class(sb1)
# &quot;list&quot;
</pre></p>
<p>That gives us a list, containing one element which is a character vector containing the data as preformatted text.</p>
<p><strong>6. Conversion to a data frame</strong><br />
Now for the tricky part. It looks as though each line of the data is space-delimited. So, you might think that we could read it into a data frame by treating the text as a connection. Question is: how consistent is the data output? Do all lines contain the same number of fields, for example? What about fields that themselves contain spaces?</p>
<p>Well, all we can do is give it a try. First, a quick look at the start and end of the data:</p>
<p><pre class="brush: r;">
head(readLines(textConnection(sb1[[1]])))
# [1] &quot;Date= 2011-09-08 12:07:52&quot;                                             
# [2] &quot;You are prompted for the first range&quot;                                  
# [3] &quot;You are prompted for the second range&quot;                                 
# [4] &quot;   1   6 ASP  (   6 ) A      OD1 -    1 ALA  (   1 ) A      N     6.78&quot;
# [5] &quot;   2   6 ASP  (   6 ) A      OD2 -    1 ALA  (   1 ) A      N     5.98&quot;
# [6] &quot;   3  11 ASP  (  11 ) A      OD1 -   35 HIS  (  35 ) A      ND1   4.35&quot;

tail(readLines(textConnection(sb1[[1]])))
# [1] &quot; 5871506 ASP  (  98 ) L      OD2 - 1509 LYS  ( 101 ) L      NZ    6.02&quot;
# [2] &quot; 5881512 GLU  ( 104 ) L      OE1 - 1432 LYS  (  24 ) L      NZ    2.84&quot;
# [3] &quot; 5891512 GLU  ( 104 ) L      OE2 - 1432 LYS  (  24 ) L      NZ    4.42&quot;
# [4] &quot; 5901514 GLU  ( 106 ) L      OE2 - 1511 LYS  ( 103 ) L      NZ    5.43&quot;
# [5] &quot; 5911572 LYS  ( 128 ) L      O'' - 1432 LYS  (  24 ) L      NZ    6.43&quot;
# [6] &quot;&quot;
</pre></p>
<p>OK, looks like we can lose the first 3 lines and perhaps split on whitespace?</p>
<p><pre class="brush: r;">
sbl &lt;- readLines(textConnection(sb1[[1]]))
sbl &lt;- sbl[4:length(sbl)]
sbt &lt;- read.table(textConnection(sbl), fill = T)

head(sbt)
#   V1 V2  V3 V4 V5 V6 V7  V8 V9 V10 V11 V12 V13 V14 V15 V16  V17
# 1  1  6 ASP  (  6  )  A OD1  -   1 ALA   (   1   )   A   N 6.78
# 2  2  6 ASP  (  6  )  A OD2  -   1 ALA   (   1   )   A   N 5.98
# 3  3 11 ASP  ( 11  )  A OD1  -  35 HIS   (  35   )   A ND1 4.35
# 4  4 11 ASP  ( 11  )  A OD1  -  35 HIS   (  35   )   A NE2 4.23
# 5  5 11 ASP  ( 11  )  A OD2  -  35 HIS   (  35   )   A ND1 6.01
# 6  6 11 ASP  ( 11  )  A OD2  -  35 HIS   (  35   )   A NE2 6.33

tail(sbt)
#          V1  V2 V3  V4 V5 V6  V7 V8   V9 V10 V11 V12 V13 V14 V15  V16 V17
# 586 5861506 ASP  (  98  )  L OD2  - 1435 LYS   (  27   )   L  NZ 5.09  NA
# 587 5871506 ASP  (  98  )  L OD2  - 1509 LYS   ( 101   )   L  NZ 6.02  NA
# 588 5881512 GLU  ( 104  )  L OE1  - 1432 LYS   (  24   )   L  NZ 2.84  NA
# 589 5891512 GLU  ( 104  )  L OE2  - 1432 LYS   (  24   )   L  NZ 4.42  NA
# 590 5901514 GLU  ( 106  )  L OE2  - 1511 LYS   ( 103   )   L  NZ 5.43  NA
# 591 5911572 LYS  ( 128  )  L O''  - 1432 LYS   (  24   )   L  NZ 6.43  NA
</pre></p>
<p>That almost worked &#8211; except&#8230;</p>
<p><strong>7. &#8230;we have a glitch</strong></p>
<p>Rows in the data frame <em>sbt</em> which look like this:</p>
<p><pre class="brush: plain;">
5911572 LYS  ( 128  )  L O''  - 1432 LYS   (  24   )   L  NZ 6.43  NA
</pre></p>
<p>should actually look like this:</p>
<p><pre class="brush: plain;">
591 1572 LYS  ( 128  )  L O''  - 1432 LYS   (  24   )   L  NZ 6.43
</pre></p>
<p>The problem lies with the original output from WHAT IF. There needs to be a space between column 1 (contact number) and column 2 (residue number of first interacting molecule).  In fact, I noticed this problem for the first time today, in the course of this R investigation.</p>
<p>Well you win some, you lose some. At least I learned a thing or two about RCurl and XML.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/programming/'>programming</a>, <a href='http://nsaunders.wordpress.com/category/statistics/r/'>R</a>, <a href='http://nsaunders.wordpress.com/category/statistics/'>statistics</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/how-to/'>how to</a>, <a href='http://nsaunders.wordpress.com/tag/protein-structure/'>protein structure</a>, <a href='http://nsaunders.wordpress.com/tag/what-if/'>what if</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2858/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2858/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2858/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2858/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2858/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2858/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2858/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2858/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2858/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2858/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2858/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2858/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2858/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2858/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2858&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/09/08/interacting-with-bioinformatics-webservers-using-r/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>
	</item>
		<item>
		<title>Popular topics at the BioStar Q&amp;A site</title>
		<link>http://nsaunders.wordpress.com/2011/08/23/popular-topics-at-the-biostar-qa-site/</link>
		<comments>http://nsaunders.wordpress.com/2011/08/23/popular-topics-at-the-biostar-qa-site/#comments</comments>
		<pubDate>Tue, 23 Aug 2011 06:54:14 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[web resources]]></category>
		<category><![CDATA[biostar]]></category>
		<category><![CDATA[stackexchange]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2844</guid>
		<description><![CDATA[Which topics are the most popular at the BioStar bioinformatics Q&#38;A site? One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so &#8220;bad&#8221; tags are usually edited to improve them. Hint: if your question is &#8220;How to find SNPs&#8221;, then [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2844&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Which topics are the most popular at the <a href="http://biostar.stackexchange.com/">BioStar</a> bioinformatics Q&amp;A site?</p>
<p>One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so &#8220;bad&#8221; tags are usually edited to improve them.  Hint: if your question is &#8220;How to find SNPs&#8221;, then tagging it with &#8220;how, to, find, snps&#8221; won&#8217;t win you any admirers.</p>
<p>OK: we&#8217;re going to grab the tags then use a bunch of R packages (<em>XML, wordcloud and ggplot2</em>) to take a quick look.</p>
<p><span id="more-2844"></span><br />
<strong>1. Fetch the tags</strong><br />
Fortunately, I enjoy sufficient privileges at BioStar to obtain a dump of the database. It contains a file named &#8220;Tags.xml&#8221;, with this simple structure:<br />
<pre class="brush: xml;">
&lt;Tags&gt;
  &lt;row&gt;
    &lt;Id&gt;3&lt;/Id&gt;
    &lt;Name&gt;bed&lt;/Name&gt;
    &lt;Count&gt;20&lt;/Count&gt;
    &lt;UserId&gt;2&lt;/UserId&gt;
    &lt;CreationDate&gt;2009-09-30T14:55:00.167&lt;/CreationDate&gt;
  &lt;/row&gt;
  ...
&lt;/Tags&gt;
</pre></p>
<p>A hint for people who write XML parsing documentation.  <strong><em>Most of us just want to get the values from between the tags</em></strong>.  Just tell us how to do that. OK?</p>
<p>Thanks to <a href="http://stackoverflow.com/questions/1960119/importing-data-from-an-xml-file-into-r">this StackOverflow thread</a>, I discovered the incredibly-useful <em>xmlToDataFrame()</em> function in the R XML package:<br />
<pre class="brush: r;">
library(XML)
tags &lt;- xmlToDataFrame(&quot;Tags.xml&quot;)
head(tags)
#   Id       Name Count UserId            CreationDate
# 1  3        bed    20      2 2009-09-30T14:55:00.167
# 2  4        gff    12      2 2009-09-30T14:55:00.167
# 3  5     galaxy    11      2 2009-09-30T15:09:43.417
# 4  6      yeast     5      3 2009-09-30T16:09:06.723
# 5  7      motif    19      3  2009-09-30T16:09:06.74
# 6  8 microarray    96      2 2009-09-30T16:44:22.677
</pre></p>
<p>Too easy.  However, <em>class(tags$Count)</em> = &#8220;character&#8221;, which is not quite not we want.  So let&#8217;s change that to numeric, then sort the data frame on Count, decreasing:<br />
<pre class="brush: r;">
tags$Count &lt;- as.numeric(tags$Count)
tags &lt;- tags[sort.list(tags$Count, decreasing = T),]
</pre></p>
<table>
<tr>
<td valign="top">
<strong>2. For those who like a &#8220;top N&#8221; plot</strong><br />
Next, we&#8217;ll grab the top 20 tags by Count.  To plot them in decreasing order, we need to reorder the tag Name by Count. With thanks again to <a href="http://stackoverflow.com/questions/3744178/ggplot2-sorting-a-plot">a StackOverflow thread</a>.<br />
<pre class="brush: r;">
library(ggplot2)
tags.20 &lt;- head(tags, 20)
tags.20 &lt;- transform(tags.20, Name = reorder(Name, Count))
ggplot(tags.20) + geom_bar(aes(Name, Count), fill = &quot;coral&quot;) + coord_flip() + theme_bw() + opts(title = &quot;Top 20 BioStar Tags&quot;)
</pre></p>
<p>Click image, right, for full-size version.
</td>
<td valign="top">
<div id="attachment_2849" class="wp-caption alignright" style="width: 310px"><a href="http://nsaunders.files.wordpress.com/2011/08/tags20.png"><img src="http://nsaunders.files.wordpress.com/2011/08/tags20.png?w=300&#038;h=225" alt="tags20" title="tags20" width="300" height="225" class="size-medium wp-image-2849" /></a><p class="wp-caption-text">Top 20 Biostar Tags</p></div>
</td>
</tr>
</table>
<p><strong>3. For those who like word/tag clouds</strong></p>
<table>
<tr>
<td valign="top">
Here, we look at tags which occur 10 or more times and display a maximum of 1000 tags in the cloud.  Again, click image for the full-size version.<br />
<pre class="brush: r;">
library(wordcloud)
library(RColorBrewer)

png(file = &quot;tags.png&quot;, width = 1024, height = 1024)
wordcloud(tags$Name, tags$Count, scale = c(8,.2), min.freq = 10, max.words = 1000, random.order = F, rot.per = .15, colors = brewer.pal(8, &quot;Dark2&quot;))
dev.off()
</pre>
</td>
<td valign="top">
<div id="attachment_2854" class="wp-caption alignright" style="width: 310px"><a href="http://nsaunders.files.wordpress.com/2011/08/tags.png"><img src="http://nsaunders.files.wordpress.com/2011/08/tags.png?w=300&#038;h=300" alt="tags" title="tags" width="300" height="300" class="size-medium wp-image-2854" /></a><p class="wp-caption-text">BioStar tag cloud</p></div>
</td>
</tr>
</table>
<p>Conclusions?  <em>XML</em>, <em>ggplot2</em> and <em>wordcloud</em> are all great packages.  And whilst so-called &#8220;next-generation-sequencing&#8221; might be all the rage, it&#8217;s good to see the old stalwarts of bioinformatics hanging in there: BLAST, alignment, phylogenetics, Python and Perl.  It will be interesting to see how tags change over time.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/statistics/r/'>R</a>, <a href='http://nsaunders.wordpress.com/category/statistics/'>statistics</a>, <a href='http://nsaunders.wordpress.com/category/web-resources/'>web resources</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/biostar/'>biostar</a>, <a href='http://nsaunders.wordpress.com/tag/stackexchange/'>stackexchange</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2844/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2844/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2844/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2844/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2844/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2844/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2844/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2844/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2844/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2844/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2844/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2844/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2844/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2844/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2844&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/08/23/popular-topics-at-the-biostar-qa-site/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/08/tags20.png?w=300" medium="image">
			<media:title type="html">tags20</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/08/tags.png?w=300" medium="image">
			<media:title type="html">tags</media:title>
		</media:content>
	</item>
		<item>
		<title>Monitoring PubMed retractions: updates</title>
		<link>http://nsaunders.wordpress.com/2011/08/16/monitoring-pubmed-retractions-updates/</link>
		<comments>http://nsaunders.wordpress.com/2011/08/16/monitoring-pubmed-retractions-updates/#comments</comments>
		<pubDate>Tue, 16 Aug 2011 05:51:03 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[publications]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[heroku]]></category>
		<category><![CDATA[pubmed]]></category>
		<category><![CDATA[retraction]]></category>
		<category><![CDATA[sinatra]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2834</guid>
		<description><![CDATA[There&#8217;s been a recent flurry of interest in retractions. See for example: Scientific Retractions: A Growth Industry?; summarised also by GenomeWeb in Take That Back; articles in the WSJ and the Pharmalot blog; and academic articles in the Journal of Medical Ethics and Infection &#38; Immunity. Several of these sources cite data from my humble [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2834&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><div id="attachment_2840" class="wp-caption alignright" style="width: 310px"><a href="http://nsaunders.files.wordpress.com/2011/08/chart.png"><img src="http://nsaunders.files.wordpress.com/2011/08/chart.png?w=300&#038;h=225" alt="chart" title="chart" width="300" height="225" class="size-medium wp-image-2840" /></a><p class="wp-caption-text">PubMed cumulative retractions 1977-present</p></div>There&#8217;s been a recent flurry of interest in retractions. See for example: <a href="http://pipeline.corante.com/archives/2011/08/11/scientific_retractions_a_growth_industry.php">Scientific Retractions: A Growth Industry?</a>; summarised also by GenomeWeb in <a href="http://www.genomeweb.com/blog/take-back">Take That Back</a>; articles in <a href="http://online.wsj.com/article/SB10001424052702303627104576411850666582080.html">the WSJ</a> and <a href="http://www.pharmalot.com/2011/08/retractions-of-scientific-studies-are-surging/">the Pharmalot blog</a>; and academic articles in the <a href="http://jme.bmj.com/content/37/4/249.abstract?sid=905cb3bc-a961-4710-8544-5f0509a6b599"><em>Journal of Medical Ethics</em></a> and <a href="http://iai.asm.org/cgi/content/abstract/IAI.05661-11v1"><em>Infection &amp; Immunity</em></a>.</p>
<p>Several of these sources cite data from my humble web application, <a href="http://pmretract.heroku.com/">PMRetract</a>.  So now seems like a good time to mention that:</p>
<ul>
<li>The application is still going strong and is updated regularly</li>
<li>I&#8217;ve added a few enhancements to the UI; you can follow development <a href="https://github.com/neilfws/PubMed">at GitHub</a></li>
<li>I&#8217;ve also added a long-overdue <a href="http://pmretract.heroku.com/about">about page</a> with some extra information, including the fact that I wrote it :)</li>
</ul>
<p>Now I just need to fix up my Git repositories. Currently there&#8217;s one which pushes to GitHub and a second, with a copy of the Sinatra code for pushing to Heroku, which isn&#8217;t too smart.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/programming/'>programming</a>, <a href='http://nsaunders.wordpress.com/category/publications/'>publications</a>, <a href='http://nsaunders.wordpress.com/category/ruby/'>ruby</a>, <a href='http://nsaunders.wordpress.com/category/statistics/'>statistics</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/heroku/'>heroku</a>, <a href='http://nsaunders.wordpress.com/tag/pubmed/'>pubmed</a>, <a href='http://nsaunders.wordpress.com/tag/retraction/'>retraction</a>, <a href='http://nsaunders.wordpress.com/tag/sinatra/'>sinatra</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2834/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2834/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2834/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2834/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2834/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2834/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2834/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2834/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2834/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2834/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2834/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2834/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2834/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2834/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2834&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/08/16/monitoring-pubmed-retractions-updates/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>

		<media:content url="http://nsaunders.files.wordpress.com/2011/08/chart.png?w=300" medium="image">
			<media:title type="html">chart</media:title>
		</media:content>
	</item>
		<item>
		<title>BioRuby development: feedback on using Git</title>
		<link>http://nsaunders.wordpress.com/2011/08/05/bioruby-development-feedback-on-using-git/</link>
		<comments>http://nsaunders.wordpress.com/2011/08/05/bioruby-development-feedback-on-using-git/#comments</comments>
		<pubDate>Fri, 05 Aug 2011 03:15:57 +0000</pubDate>
		<dc:creator>nsaunders</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[bioruby]]></category>
		<category><![CDATA[discussion]]></category>
		<category><![CDATA[feedback]]></category>

		<guid isPermaLink="false">http://nsaunders.wordpress.com/?p=2831</guid>
		<description><![CDATA[Everyone likes constructive feedback. I received a couple of great comments on my previous post, which warrant a brief discussion. @vlandham points out that when the main BioRuby repository updates, you&#8217;ll want to update your local repository. Using git, you do that by adding a remote which points to the original repository, from which you [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2831&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Everyone likes constructive feedback. I received a couple of great comments on my previous post, which warrant a brief discussion.</p>
<p><a href="http://twitter.com/vlandham">@vlandham</a> points out that when the main BioRuby repository updates, you&#8217;ll want to update your local repository.  Using git, you do that by adding a <em>remote</em> which points to the original repository, from which you can fetch updates and merge with your local version:</p>
<p><pre class="brush: bash;">
git remote add upstream https://github.com/bioruby/bioruby.git
# fetch/merge only when main repo updates
git fetch upstream
git merge upstream master
</pre></p>
<p>This is described at the GitHub help page <a href="http://help.github.com/fork-a-repo/">Fork A Repo</a>.</p>
<p>Michael points to an article titled <a href="http://nvie.com/posts/a-successful-git-branching-model/">A successful Git branching model</a>. It suggests that when developing new features you create a <em>feature branch</em> (also called <em>topic branch</em>). This can help with the management of new features and creates a more complete commit history if/when the new feature is merged back into your development repository.  The article also suggests a main branch for development named <em>develop</em>, rather than the default <em>master</em>.</p>
<p>I haven&#8217;t quite got my head around all the ins-and-outs of the article yet, but it&#8217;s well worth a read.</p>
<br />Filed under: <a href='http://nsaunders.wordpress.com/category/bioinformatics/'>bioinformatics</a>, <a href='http://nsaunders.wordpress.com/category/programming/'>programming</a>, <a href='http://nsaunders.wordpress.com/category/ruby/'>ruby</a> Tagged: <a href='http://nsaunders.wordpress.com/tag/bioruby/'>bioruby</a>, <a href='http://nsaunders.wordpress.com/tag/discussion/'>discussion</a>, <a href='http://nsaunders.wordpress.com/tag/feedback/'>feedback</a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/nsaunders.wordpress.com/2831/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/nsaunders.wordpress.com/2831/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/nsaunders.wordpress.com/2831/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/nsaunders.wordpress.com/2831/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/nsaunders.wordpress.com/2831/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/nsaunders.wordpress.com/2831/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/nsaunders.wordpress.com/2831/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/nsaunders.wordpress.com/2831/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/nsaunders.wordpress.com/2831/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/nsaunders.wordpress.com/2831/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/nsaunders.wordpress.com/2831/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/nsaunders.wordpress.com/2831/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/nsaunders.wordpress.com/2831/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/nsaunders.wordpress.com/2831/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=nsaunders.wordpress.com&amp;blog=334198&amp;post=2831&amp;subd=nsaunders&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://nsaunders.wordpress.com/2011/08/05/bioruby-development-feedback-on-using-git/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e41743fa8aee7f5c7d1cd7ebfa77da85?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">nsaunders</media:title>
		</media:content>
	</item>
	</channel>
</rss>
