When we want to perform an analysis, record our work, and also explain our work and make it reproducible, we can take advantage of a markdown variant that works well in the R environment called: RMarkdown.
One of the main ideas behind using markdown in this way is to combine analysis, code, results, and interpretation into one reproducible document containing everything required for a given analysis.
R has a library for this called knitr
which you can use from the command line, or a button in RStudio.
For example, this document is written in RMarkdown. The original file is a text file with a .Rmd extension. The extension is not required, but helps us identify it as an RMarkdown document. If this were an analysis of some data, this section might be the background or introduction of the problem.
We can insert chunks of R code to do things or draw figures. Here is an example of inserting a code chunk and looking at the results.
This is a code chunk:
```{r get_random}
# generate 100 random numbers
x <- sample(1:100, 100)
# look at a few
head(x)
```
If we insert this into our RMarkdown document, we would see the following:
# generate 100 random numbers
x <- sample(1:100, 100)
# look at a few
head(x)
## [1] 92 74 35 40 28 90
We see the code that was run, and the result that it generated.
To put a code chunk into our document, and have R execute it, we simply have to type the code with some mark up around it. Specifically, you start and end code chunks with three backwards tick marks. In the example above, the opening set of tick marks also has some curly brackets in which you can specify options for the code chunk. The first option is “r” which specifies what kind of code will be evaluated. The example above contains a second option: a name for the chunk. This is not necessary, but allows you to better follow where errors might occur and which figures were produced from which section. There are several other chunk options:
There are many other code block options.
If you use some code which generates a figure, it will appear automatically in the document.
``` {r build_and_display_scatterplot, fig.width=5, fig.height=5}
x <- rnorm(100)
y <- rnorm(100)
plot(x,y)
```
As we explore out data, we can use “inline” expressions to make statements about it. For instance, if I want to know the correlation of x and y, I can call the correlation function in the middle of a sentence using back ticks like this: The correlation of x and y is `r cor(x,y)`
.
The correlation of x and y is 0.0439267.
We could probably round off that number to make it more readable.
Are these distributions equal? Let’s examine a boxplot.
```{r boxplot}
boxplot(list(x,y))
```
They look pretty similar.
We could use the summary()
function, set echo=FALSE
and see the results. Maybe put them in a table.
Finally, does a random set of numbers look interesting as a heat map? Let’s use the pheatmap
library.
```{r heatmap, echo=FALSE}
library(pheatmap)
# create a matrix
foo <- matrix(x, nrow=10, ncol=10)
# name the rows and columns
rownames(foo) <- paste0("g", 1:nrow(foo))
colnames(foo) <- paste0("e", 1:ncol(foo))
# draw a heatmap
pheatmap(foo)
```
If we place a few lines of YAML at the top of our document, the information will be rendered into the top of the document when we process it. YAML stands for Yet Another Markup Language, and is meant to be a simple, human readable way to structure information. If we did this, the author and purpose of our document would also be easy to identify. Here’s what the YAML header would look like:
title: "An example RMarkdown document"
author: "Chris Seidel"
date: "23 Sep 2016"
output: html_document
You can see that the information there was used to title this document, and also specify other things, such as the output format upon procesing.
When you’re ready to produce a copy of your document in html or pdf format, you can go to the command line and type:
R -e "rmarkdown::render('yourDoc.Rmd')"
Or you can push the little button in RStudio.
There are several interesting resources available to learn more about RMarkdown:
Here is a simple RMarkdown template to get started with:
/n/projects/CompGenomics/Data/template.Rmd
It’s a good idea to capture and display the session information that was used at the time of the analysis. The R sessionInfo()
function does this. Thus you can display the session information with a little code which will give the results below:
``` {r session_info, echo=FALSE, comment=NA}
sessionInfo()
```
This analysis was performed with the following R/Bioconductor session parameters:
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] pheatmap_1.0.8 GenomicRanges_1.22.4 GenomeInfoDb_1.6.0
[4] IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 knitr_1.11 XVector_0.10.0
[4] magrittr_1.5 zlibbioc_1.16.0 munsell_0.4.3
[7] colorspace_1.2-6 stringr_1.0.0 plyr_1.8.4
[10] tools_3.2.2 grid_3.2.2 gtable_0.2.0
[13] htmltools_0.3 yaml_2.1.13 digest_0.6.9
[16] RColorBrewer_1.1-2 formatR_1.2.1 evaluate_0.8
[19] rmarkdown_0.9.5 stringi_1.1.1 scales_0.4.0