This is why code written by scientists gets ugly

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.

library(GEOquery)
gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
names(pd)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5])
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.

22 thoughts on “This is why code written by scientists gets ugly

  1. I am still a novice in R, but I tend to do this as well. Mostly because the data I get is pretty messy and I “fix” the problems in the order that I discover them (obscure column names, empty rows at line 24667235 and 34545949, etc. and my favorite so far: repeating headerlines because someone found out that “cat” is a VERY useful program). So I totally agree with your example because I am constantly “hacking” my data in order to get something that makes sense. In my defence though… I comment (almost) every line or small block of code to keep my sanity.

  2. Spot on. This is a matter of incentives. A computer scientist’s job is to write good code, so he does. A scientist’s is to find stuff out; if he spends time tidying code, he isn’t spending time finding stuff out. It’s the same in the (my) commercial world. The code has to be right, but nobody is paying me to make it pretty.

  3. I think writing scripts is fundamentally different from writing software you expect other people to read or use. I guess someone may read the scripts but I doubt it for most analysis – still, as John pointed out, it is good to make some effort in scripting just in case someone (you?) needs to reread the script later. I’ve once had a script for some analyses that I expected to be a couple hundred lines evolve in to a 1000+ line script, at which point I started adding more functions and so on. It is still far from perfect and would probably fail according to the guidelines in the book ‘Clean Code’, but every bit helps.

  4. Agreed. I don’t see a problem as long as “Bob” adds sufficient comments that six months later when he needs to repeat or extend the analysis he can recall why he did that. I would also suggest that using version control for most tasks is also well worth the bit of effort it takes. It has saved my bacon many times. I think the key test is that “Bob” can easily reproduce his results and has appropriate test cases for known data to show that his code is “right.”

  5. And therein lies the ongoing conflict within R: is the language a stat command language (in which case Bob is doing the right thing), or is the language a (general purpose?) programming language (in which case the syntax is an abomination, and Bob should be banished to London Excel spreadsheets). Two semantics (and mindsets), but one syntax. Kind of like that scene from “Chinatown”.

  6. I found on a recent, large data cleaning exercise that it is only worth writing nice functions if (a) my code is becoming needlessly repetitive, and (b) if I conceivably would need to come back and change certain structural decisions. Otherwise, writing excessively elegant functions actually hampers going back and finding where certain decisions about the data were made.

  7. There was a time before I used version control in writing scripts … I would never go back (the horrors). Agreed on testing frameworks as well, for any software that is more than a few lines (and maybe even for that?).

  8. I’m also a beginner in R and programming in general. In my case I try not to stack a lot of code in one line, not because is a good practice, but because doesn’t fit in my wide screen!.
    I always wanted to use version control, but I think is quite complicated and never has tried.

  9. It’s not that the code gets ugly, it’s that it never becomes tidy. Because indeed, there is no time for that, and nobody complains much about bracket soup.

  10. As a food chemist who uses R as an ad-hoc toolbox, that summarises me pretty well.

  11. I think that you have to look at the context in which the work, the programming, is done. Are we writing code for someone else to use or just to get an answer today? Does it make a difference if the code runs for 0.1 seconds or 10 seconds if it’s only for a single use? In these cases it’s all about managing time efficiently and getting a reliable result, not for it to be pretty.

  12. Just follow the Unix philosophy: Do one thing and do it well
    Write small functions that do simple operations and then put them together to solve the larger problem. Makes coding easier for both testing and writing the code.
    Don’t worry about efficiency until it becomes a problem or you could be making something efficient that doesn’t really ever get called.
    If you start coding before you truly understand the problem you are just going to create a mess. Doesn’t matter if you are a data scientist or a software engineer, this applies to both.

  13. With all due respect, “speak for yourself.” Any code I write (or ever wrote in any language), if I see the faintest chance I’ll use it in the future, gets cleaned up, commented, error-checking added in, etc. I suspect those folks who leave their code to go crufty are the same ones with 37 stacks of articles in their (tenured) office that they are planning to get around to reading someday.

  14. Use Python (and libraries) instead of R. Do things that need an OO approach using the OO features and the data crunching things with more functional programming and your code will more and more seldom need clean ups.

    • Yes, switching languages, particularly away from the one with the statistical libraries, is a surefire solution to this issue ;)

Comments are closed.