Introduction

The goals of this MD (markdown) are to:

Draw a metagene plot of RNA Polymerase at TSS near oct4 summits
Figure out how to split a data set into quartiles

Messages and warnings may be suppressed for cleaner code.

# Packages called are listed below:
# Standard packages
source("https://bioconductor.org/biocLite.R") #Link Bioconductor packages to your system
library(rtracklayer) 
library(ggplot2)
library(reshape2)

# Local source code and functions are listed below:
source("/n/projects/CompGenomics/Tutorial/jjj_shared/granges1.1.r")

# Working directory and other formatting is listed below: 
setwd("/home/cws/oct4/")

Think about the questions you want to ask

Load oct4 summits

summits.f <- '/n/projects/CompGenomics/Data/ChIP-Seq/MACS/oct4_10M_2_summits.bed'
oct4.summits <- import.bed(summits.f)

# let's get rid of the non-cannonical chromosomes
# we can use logical grep to find names with "_", and invert it.
#oct4.summits <- oct4.summits[!grepl("_", seqnames(oct4.summits))]
# There's a better way!
length(oct4.summits)

## [1] 25204

#oct4.summits <- keepStandardChromosomes(oct4.summits, pruning.mode="coarse")
length(oct4.summits)

## [1] 25204

Map OCT4 to gene TSS

read in gene annotations

library('GenomicFeatures')
library('rtracklayer')

library(GenomicFeatures)
txdbfile <- '/n/projects/CompGenomics/Data/M.musculus.Ens_91.mm10.db'
txdb <- loadDb(txdbfile)

# get tss
tss.gr <- promoters(txdb,0,1,columns=c("tx_id", "tx_name","gene_id"))

# keep only the stuff on cannonical chromosomes
#tss.gr <- keepStandardChromosomes(tss.gr, pruning.mode="coarse")

Map oct4 summit to nearest TSS

# find nearest TSS
ntss.idx <- nearest(oct4.summits,tss.gr)

# keep summits that have a distance (not NA)
tss_anchored_summits.gr <- oct4.summits[!is.na(ntss.idx)]

# get a copy of the near TSS (those that map to a summit)
# in the order in which they map (see help on na.omit)
ntss.gr <- tss.gr[na.omit(ntss.idx)]

# transfer the strand of the nearest gene to the summit
strand(tss_anchored_summits.gr) <- strand(ntss.gr)

# make sure order of seqlevels is the same in both of these things
all(seqlevels(ntss.gr) == seqlevels(tss_anchored_summits.gr))

## [1] FALSE

# seqlevels have to be in same order for the operation below
# so put them in order
seqlevels(ntss.gr) <- sort(seqlevels(ntss.gr))
seqlevels(tss_anchored_summits.gr) <- sort(seqlevels(tss_anchored_summits.gr))

# find the upstream or downstream distance
tss_anchored_summits.gr$tss_nearest_distance <-
  distance(tss_anchored_summits.gr,ntss.gr)  *
  ifelse(tss_anchored_summits.gr < ntss.gr,-1,1) *
  ifelse('-' == strand(tss_anchored_summits.gr),-1,1)

# write data to the meta data setction of our summits
mcols(tss_anchored_summits.gr) <- cbind(mcols(tss_anchored_summits.gr),mcols(ntss.gr))

What are the distances between oct4 binding sites and nearest TSSs?

summary(tss_anchored_summits.gr$tss_nearest_distance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -925038   -8496      -7    1042   10127  887272

ggplot(data=data.frame(dist=tss_anchored_summits.gr$tss_nearest_distance),
  aes(dist)) + geom_histogram() + xlim(-10000,10000)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12160 rows containing non-finite values (stat_bin).

What is the distribution of peak scores?

## Examine histogram of oct4 peak scores
hist(oct4.summits$score, breaks=50)
abline(v=-10 * log10(0.01), col="red")

Let’s filter our data to reduce the number of peaks

## Filtering only summits within 10kb
is.good <- tss_anchored_summits.gr$tss_nearest_distance < abs(10^3)
oct4.topSet <- tss_anchored_summits.gr[is.good]

# how many were returned?
sum(is.good)

## [1] 14401

There are still quite a few. Let’s partition the data into 4 quartiles based on score.

Spliting datasets into quantiles

How would we know what score represents the 90th percentile? We can use the quantiles() function in R to find this:

# first let's examine a summary
summary(oct4.topSet$score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.009   7.370  12.182  25.576  26.371 540.763

# then let's grab the value at the 90th percentile
quantile(oct4.topSet$score, 0.9)

##      90% 
## 59.16802

See the help for the quantile function. We hand quantile() our data set, and a list of probabilities:

quantile(oct4.topSet$score, c(0.25,0.5,0.75))

##      25%      50%      75% 
##  7.37034 12.18200 26.37071

Now we can use the cut() function to create a factor we can use to divide our data into groups by quantile.

scoresByQ <- cut(oct4.topSet$score,
                     breaks = quantile (oct4.topSet$score, c (0, .25, .5, .75, 1)),
                     include.lowest = TRUE)

This creates a “factor” that places all of our scores into groups.

Using Factors in R, an Example

What is a factor? It’s simply a description of group memberships for our data.

If our data is categorized, or labelled in groups, R is good at utilizing this. Here’s a simple example of getting the average height of jockeys or ball players from a set of height measurements in inches.

# heights in inches of Jockeys or Basketball players
heights <- c(66,64,82,62,78,80,67)
playerType <- c("jock","jock","bask","jock","bask","bask","jock")

We can create a factor for the playerType variable, and use it to automatically group the data using an R function. In this case we can get the average height of each group using tapply()

# create the factor
playerType.f <- factor(playerType)
# apply the mean function
tapply(heights, playerType.f, mean)

##  bask  jock 
## 80.00 64.75

Splitting summits into quartiles by score

Back to our data set, once we have a factor representing a grouping label for each of our data points, we can split the data into those groups.

oct4ByQ <- split(oct4.topSet, scoresByQ)

The result of this is a list of GRanges objects (a GrangesList), and this is complicated. So rather than split our data this way, let’s simply split the index positions of our original object - that way we can use those to grab out pieces of our object at will.

# split the index of our object into groups
oct4ByQ.iv <- split(1:length(oct4.topSet), scoresByQ)

Take a look at the result, and see if this makes any more sense. The result is still a list, but the values in each section represent the elements of the oct4.topSet GRanges in each category. Let’s examine it.

# what are the names of the list elements?
names(oct4ByQ.iv)

## [1] "[2.01,7.37]" "(7.37,12.2]" "(12.2,26.4]" "(26.4,541]"

We see the score values that make up the boundaries of each category.

How many elements are in each category?
Examine a few elements of each.

# because it's a list, we can use lapply on it.
# How long is each group?
lapply(oct4ByQ.iv, length)

## $`[2.01,7.37]`
## [1] 3613
## 
## $`(7.37,12.2]`
## [1] 3594
## 
## $`(12.2,26.4]`
## [1] 3594
## 
## $`(26.4,541]`
## [1] 3600

# Examine the first few elements of each group
lapply(oct4ByQ.iv, head)

## $`[2.01,7.37]`
## [1]  5  9 15 18 22 23
## 
## $`(7.37,12.2]`
## [1]  2  6  7  8 11 12
## 
## $`(12.2,26.4]`
## [1] 19 27 28 29 30 32
## 
## $`(26.4,541]`
## [1]  1  3  4 10 13 14

Let’s extract 500 randomly chosen elements from each quantile. We reference the 1st list element, which contain 3613 items and choose 500 at random.

Summarize the scores for the first quantile:

# take a random sample of 500 ids
q1.ids <- sample(oct4ByQ.iv[[1]], 500)

summary(oct4.topSet$score[q1.ids])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.311   4.775   5.694   5.635   6.616   7.370

Pick random sets for each quantile.

q1.idx <- sample(oct4ByQ.iv[[1]], 500)
q2.idx <- sample(oct4ByQ.iv[[2]], 500)
q3.idx <- sample(oct4ByQ.iv[[3]], 500)
q4.idx <- sample(oct4ByQ.iv[[4]], 500)

boxplot(list(q1=oct4.topSet$score[q1.idx],
             q2=oct4.topSet$score[q2.idx],
             q3=oct4.topSet$score[q3.idx],
             q4=oct4.topSet$score[q4.idx]), main="Peak Scores by Quartile")

Do peak widths differ by quantile?

uh, they’re summits - they’re all the same width. We have to load in the actual peaks to test this. Skip this chunk.

boxplot(list(q1=width(oct4.topSet[q1.idx]),
             q2=width(oct4.topSet[q2.idx]),
             q3=width(oct4.topSet[q3.idx]),
             q4=width(oct4.topSet[q4.idx])), main="Peak Width by Quartile")

Get the genes associated with each quantile

# get the gene ids
goi.q1 <- unique(as.character(oct4.topSet$gene_id[q1.idx]))
goi.q2 <- unique(as.character(oct4.topSet$gene_id[q2.idx]))
goi.q3 <- unique(as.character(oct4.topSet$gene_id[q3.idx]))
goi.q4 <- unique(as.character(oct4.topSet$gene_id[q4.idx]))

# grab the coordinates for each gene TSS
q1.tss.gr <- tss.gr[match(goi.q1, as.character(tss.gr$gene_id))]
q2.tss.gr <- tss.gr[match(goi.q2, as.character(tss.gr$gene_id))]
q3.tss.gr <- tss.gr[match(goi.q3, as.character(tss.gr$gene_id))]
q4.tss.gr <- tss.gr[match(goi.q4, as.character(tss.gr$gene_id))]

# resize them 
q1.tss.gr <- resize(q1.tss.gr, width=1000, fix="center")
q2.tss.gr <- resize(q2.tss.gr, width=1000, fix="center")
q3.tss.gr <- resize(q3.tss.gr, width=1000, fix="center")
q4.tss.gr <- resize(q4.tss.gr, width=1000, fix="center")

# make sure they didn't extend past boundaries
q1.tss.gr <- trim(q1.tss.gr)
q2.tss.gr <- trim(q2.tss.gr)
q3.tss.gr <- trim(q3.tss.gr)
q4.tss.gr <- trim(q4.tss.gr)

Now we can extrcat out the RNA Polymerase II values and make metaplots for each quartile.

# set the location of our RNA polII bigwWig file
bwFile <- "/n/projects/CompGenomics/Data/GSE115436/R1_WT_ES_RPB1.bw"

polII.q1 <- grangesMatrix(q1.tss.gr, bwFile)
polII.q2 <- grangesMatrix(q2.tss.gr, bwFile)
polII.q3 <- grangesMatrix(q3.tss.gr, bwFile)
polII.q4 <- grangesMatrix(q4.tss.gr, bwFile)

Let’s make a plot and see what it looks like!

plot(colMeans(polII.q4), type="l", main="RNA Polymerase II by OCT4 quartile", ylab="mean RPM", xaxt="n")
lines(colMeans(polII.q3), col="blue")
lines(colMeans(polII.q2), col="green")
lines(colMeans(polII.q1), col="red")
# add and axis
axis(1, at=c(0,500,1000), labels=c("-500","0","+500"))
# add a vertical line at TSS
abline(v=500, col="red", lty="dashed", lwd=2)

legend("topright", c("q1", "q2","q3","q4"), col=c("red","green","blue","black"), lwd=2)

Session information

For reproducibility, this analysis was performed with the following R/Bioconductor session:

R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicFeatures_1.32.0 AnnotationDbi_1.42.1   Biobase_2.40.0        
 [4] reshape2_1.4.3         ggplot2_3.0.0          rtracklayer_1.40.3    
 [7] GenomicRanges_1.32.6   GenomeInfoDb_1.16.0    IRanges_2.14.11       
[10] S4Vectors_0.18.3       BiocGenerics_0.26.0    BiocInstaller_1.30.0  
[13] RColorBrewer_1.1-2    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18                lattice_0.20-35            
 [3] prettyunits_1.0.2           Rsamtools_1.32.2           
 [5] Biostrings_2.48.0           assertthat_0.2.0           
 [7] digest_0.6.15               R6_2.2.2                   
 [9] plyr_1.8.4                  RSQLite_2.1.1              
[11] evaluate_0.11               httr_1.3.1                 
[13] pillar_1.3.0                progress_1.2.0             
[15] zlibbioc_1.26.0             rlang_0.2.1                
[17] lazyeval_0.2.1              blob_1.1.1                 
[19] Matrix_1.2-14               rmarkdown_1.10.9           
[21] labeling_0.3                BiocParallel_1.14.2        
[23] stringr_1.3.1               biomaRt_2.36.1             
[25] RCurl_1.95-4.11             bit_1.1-14                 
[27] munsell_0.5.0               DelayedArray_0.6.6         
[29] compiler_3.5.0              xfun_0.3                   
[31] pkgconfig_2.0.1             htmltools_0.3.6            
[33] tidyselect_0.2.4            SummarizedExperiment_1.10.1
[35] tibble_1.4.2                GenomeInfoDbData_1.1.0     
[37] matrixStats_0.54.0          XML_3.98-1.12              
[39] crayon_1.3.4                dplyr_0.7.6                
[41] withr_2.1.2                 GenomicAlignments_1.16.0   
[43] bitops_1.0-6                grid_3.5.0                 
[45] gtable_0.2.0                DBI_1.0.0                  
[47] magrittr_1.5                scales_0.5.0               
[49] stringi_1.2.4               XVector_0.20.0             
[51] bindrcpp_0.2.2              htmldeps_0.1.1             
[53] tools_3.5.0                 bit64_0.9-7                
[55] glue_1.3.0                  purrr_0.2.5                
[57] hms_0.4.2                   yaml_2.2.0                 
[59] colorspace_1.3-2            memoise_1.1.0              
[61] knitr_1.20.8                bindr_0.1.1

Splitting a Data Set for Metagene Analysis

Chris Seidel

October 19, 2018