Literate programming for collaborative and reproducible research

 

Jeff Johnston1 and Julia Zeitlinger1

1Stowers Institute for Medical Research, Kansas City, MO

 

Here we advocate the use of literate programming, a technology that integrates the writing of prose and the execution of programming code within the same document. We find that literate programming is a very effective tool in scientific research as it improves documentation, collaborative interactions, teaching and publications. The widespread use of literate programming has the potential to significantly improve the transparency and reproducibility of bioinformatics research.

Bioinformatics analyses have become an increasingly important component of many research projects, especially those involving genome-wide data sets. As most science is collaborative, these results need to be accessible to other researchers, including many who do not have bioinformatics expertise. In addition to being clearly communicated, analyses also need to be properly described and reproducible. However, the ad hoc manner in which bioinformatics analyses are often performed in scientific research projects makes it challenging to meet these requirements.

Data analysis is often an iterative process in which similar analyses are performed with different data sets, parameters and other variations over the course of a project. Documenting this often unpredictable process and keeping track of important information can therefore be challenging, especially when multiple collaborators are involved. For example, individual scripts in a programming language such as Python or R may perform specific analyses that generate outputs such as figures and tables. Communicating these results might involve sending them as attachments in an email along with a short description, or collecting them in an annotated presentation or word processing document. As a project progresses, these individual scripts and their outputs accumulate as more analyses are performed and existing analyses are modified. When it comes time to prepare a manuscript for publication, these scripts must be examined in order to accurately describe the analyses performed, and care must be taken to ensure the correct versions of figures and tables are included in the manuscript.

This approach splits the bioinformatics analysis into three components that must be independently tracked: the software code that performs the analysis, the results of the analysis, and the description of the analysis. Since this requires frequent switching between programming and writing in separate applications, documenting the analysis process often becomes disruptive, tedious, and time consuming.

Over the years, it became clear to us that this separation creates a large source of errors and inconsistencies and that a more effective approach was needed. To that end, we have adopted a technique known as literate programming, which combines these three components into one document in a flexible manner and thus allows a more organized way to document analyses without imposing too many unnecessary constraints. In our experience, literate programming dramatically improves the clarity, reproducibility and efficiency of bioinformatics analyses.

Literate programming was introduced by computer scientist Donald Knuth in 1984 as a way to efficiently build software with clearly defined logic that is well documented and easily understandable by others (Knuth 1984). Knuth called his technique literate programming because he envisioned, perhaps presumptively, that once adopted, nobody would want to go back to “illiterate” programming.

Knuth’s fundamental idea was to combine two previously separate technologies: a language for formatting documents (e.g. nowadays LaTeX or HTML) and a computer programming language (e.g. Python or R). This idea has been implemented in a number of modern literate programming tools (Table 1). While the programming language provides instructions for the computer, the language for formatting documents encapsulates the programming code and allows one to build complex documents that can include headings, tables, mathematical equations, figures, citations and other elements. By integrating these two languages into one document, the programmer not only writes code for the computer but can also articulate the purpose and structure of the program, and can visualize and summarize the output of the program as a report. Thus, literate programming allows an analyst to describe the entire analysis process in a single document, similar to an experimentalist writing in a lab notebook.

Using literate programming, the analyst describes the research question, outlines the approach to address the question, narrates the important steps in the analysis while writing the programming code, and discusses the results at the end. Such literate programming documents can be constructed dynamically, by writing portions of the narrative and software code, viewing the results in the report, and continuing the analysis with additional narration and code, or summarizing the results. A completed literate programming document contains all the critical components of a reproducible analysis: the description of the analysis, the programming code used to perform it, the results, and a discussion. The document and its report, typically a PDF or HTML file, can be easily shared with other researchers.

While literate programming did not immediately gain widespread use for developing software programs, it has more recently become a popular method of performing data analysis tasks in a wide variety of fields. These data analysis tasks are often question-driven and involve importing and manipulating large datasets, such as those generated from Internet advertising campaigns or health insurance metrics. Literate programming is a good fit for these kinds of analyses, as the document-based approach allows for the production of a single report that both describes the methodology of the analysis and includes the results.

The popularity of literate programming techniques in fields that involve large datasets certainly suggests that it could become more widespread in life science research as well. Indeed, we find that modern implementations of Knuth’s idea of literate programming are of particular benefit to current day research projects for many reasons.

Documentation as a routine. The ease by which documentation can be performed during programming makes it much easier to do it routinely as a habit. The analyst still has to choose the details and extent of the documentation, but making these decisions as part of the analysis flow simplifies the process and saves time.

Increased communication in collaboration projects. Another advantage of literate programming is that it aids communication between analysts and their collaborators. It is not uncommon for bioinformaticians to analyze data that they did not generate themselves. To do this effectively requires an understanding of the biological question being studied and the design of the experiment itself. When the bioinformatician describes such details in a literate programming document, it becomes easier for the experimentalist to verify that the analysis is based on an appropriate understanding of the biological question and that any assumptions made are valid.

A resource for teaching. We have found literate programming documents to be very valuable teaching tools. The narrative component allows us to explain each step of the analysis as it is performed in the context of a research question, and the structured nature of literate programming enforces good habits in students new to bioinformatics. By introducing analysis using literate programming, students build a strong foundation that encourages them to approach analysis in the same way they would approach any other scientific process.

Reproducibility. Last but not least, performing bioinformatics analyses using literate programming considerably improves the reproducibility of published research. Since the analysis code and results are linked, there is no confusion about which version of a particular analysis generated an output. This also simplifies manuscript preparation, as the methods section can be written based on the literate programming documents that generated each result. The literate programming documents themselves can also be provided as downloadable supplemental materials to aid others, either in reanalyzing the published data or utilizing the same analysis techniques in their own research efforts. These benefits will only become more valuable as the bioinformatics field matures, especially as more journals adopt stricter requirements regarding the release of analysis code as a condition of publication.

In summary, we believe the regular use of literate programming in biological research makes bioinformatics analyses more transparent, easier to share, better documented and more reproducible.

 

Table 1

Literate Programming Tool Programming Language Website
RMarkdown R (supports others) rmarkdown.rstudio.com
Jupyter Python (supports others) jupyter.org
EMACS Org mode Most languages orgmode.org

 

References
Knuth, D. E. (1984). Literate programming. The Computer Journal, 27(2), 97-111.