The goals of this MD (markdown) are to:
Messages and warnings may be suppressed for cleaner code.
# Packages called are listed below:
# Standard packages
source("https://bioconductor.org/biocLite.R") #Link Bioconductor packages to your system
library(rtracklayer)
library(ggplot2)
library(reshape2)
# Local source code and functions are listed below:
source("/n/projects/CompGenomics/Tutorial/jjj_shared/granges1.1.r")
# Working directory and other formatting is listed below:
setwd("/home/cws/oct4/")
Load oct4 summits
summits.f <- '/n/projects/CompGenomics/Data/ChIP-Seq/MACS/oct4_10M_2_summits.bed'
oct4.summits <- import.bed(summits.f)
# let's get rid of the non-cannonical chromosomes
# we can use logical grep to find names with "_", and invert it.
#oct4.summits <- oct4.summits[!grepl("_", seqnames(oct4.summits))]
# There's a better way!
length(oct4.summits)
## [1] 25204
#oct4.summits <- keepStandardChromosomes(oct4.summits, pruning.mode="coarse")
length(oct4.summits)
## [1] 25204
library('GenomicFeatures')
library('rtracklayer')
library(GenomicFeatures)
txdbfile <- '/n/projects/CompGenomics/Data/M.musculus.Ens_91.mm10.db'
txdb <- loadDb(txdbfile)
# get tss
tss.gr <- promoters(txdb,0,1,columns=c("tx_id", "tx_name","gene_id"))
# keep only the stuff on cannonical chromosomes
#tss.gr <- keepStandardChromosomes(tss.gr, pruning.mode="coarse")
# find nearest TSS
ntss.idx <- nearest(oct4.summits,tss.gr)
# keep summits that have a distance (not NA)
tss_anchored_summits.gr <- oct4.summits[!is.na(ntss.idx)]
# get a copy of the near TSS (those that map to a summit)
# in the order in which they map (see help on na.omit)
ntss.gr <- tss.gr[na.omit(ntss.idx)]
# transfer the strand of the nearest gene to the summit
strand(tss_anchored_summits.gr) <- strand(ntss.gr)
# make sure order of seqlevels is the same in both of these things
all(seqlevels(ntss.gr) == seqlevels(tss_anchored_summits.gr))
## [1] FALSE
# seqlevels have to be in same order for the operation below
# so put them in order
seqlevels(ntss.gr) <- sort(seqlevels(ntss.gr))
seqlevels(tss_anchored_summits.gr) <- sort(seqlevels(tss_anchored_summits.gr))
# find the upstream or downstream distance
tss_anchored_summits.gr$tss_nearest_distance <-
distance(tss_anchored_summits.gr,ntss.gr) *
ifelse(tss_anchored_summits.gr < ntss.gr,-1,1) *
ifelse('-' == strand(tss_anchored_summits.gr),-1,1)
# write data to the meta data setction of our summits
mcols(tss_anchored_summits.gr) <- cbind(mcols(tss_anchored_summits.gr),mcols(ntss.gr))
What are the distances between oct4 binding sites and nearest TSSs?
summary(tss_anchored_summits.gr$tss_nearest_distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -925038 -8496 -7 1042 10127 887272
ggplot(data=data.frame(dist=tss_anchored_summits.gr$tss_nearest_distance),
aes(dist)) + geom_histogram() + xlim(-10000,10000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12160 rows containing non-finite values (stat_bin).
What is the distribution of peak scores?
## Examine histogram of oct4 peak scores
hist(oct4.summits$score, breaks=50)
abline(v=-10 * log10(0.01), col="red")
Let’s filter our data to reduce the number of peaks
## Filtering only summits within 10kb
is.good <- tss_anchored_summits.gr$tss_nearest_distance < abs(10^3)
oct4.topSet <- tss_anchored_summits.gr[is.good]
# how many were returned?
sum(is.good)
## [1] 14401
There are still quite a few. Let’s partition the data into 4 quartiles based on score.
How would we know what score represents the 90th percentile? We can use the quantiles() function in R to find this:
# first let's examine a summary
summary(oct4.topSet$score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.009 7.370 12.182 25.576 26.371 540.763
# then let's grab the value at the 90th percentile
quantile(oct4.topSet$score, 0.9)
## 90%
## 59.16802
See the help for the quantile function. We hand quantile() our data set, and a list of probabilities:
quantile(oct4.topSet$score, c(0.25,0.5,0.75))
## 25% 50% 75%
## 7.37034 12.18200 26.37071
Now we can use the cut() function to create a factor we can use to divide our data into groups by quantile.
scoresByQ <- cut(oct4.topSet$score,
breaks = quantile (oct4.topSet$score, c (0, .25, .5, .75, 1)),
include.lowest = TRUE)
This creates a “factor” that places all of our scores into groups.
What is a factor? It’s simply a description of group memberships for our data.
If our data is categorized, or labelled in groups, R is good at utilizing this. Here’s a simple example of getting the average height of jockeys or ball players from a set of height measurements in inches.
# heights in inches of Jockeys or Basketball players
heights <- c(66,64,82,62,78,80,67)
playerType <- c("jock","jock","bask","jock","bask","bask","jock")
We can create a factor for the playerType variable, and use it to automatically group the data using an R function. In this case we can get the average height of each group using tapply()
# create the factor
playerType.f <- factor(playerType)
# apply the mean function
tapply(heights, playerType.f, mean)
## bask jock
## 80.00 64.75
Back to our data set, once we have a factor representing a grouping label for each of our data points, we can split the data into those groups.
oct4ByQ <- split(oct4.topSet, scoresByQ)
The result of this is a list of GRanges objects (a GrangesList), and this is complicated. So rather than split our data this way, let’s simply split the index positions of our original object - that way we can use those to grab out pieces of our object at will.
# split the index of our object into groups
oct4ByQ.iv <- split(1:length(oct4.topSet), scoresByQ)
Take a look at the result, and see if this makes any more sense. The result is still a list, but the values in each section represent the elements of the oct4.topSet GRanges in each category. Let’s examine it.
# what are the names of the list elements?
names(oct4ByQ.iv)
## [1] "[2.01,7.37]" "(7.37,12.2]" "(12.2,26.4]" "(26.4,541]"
We see the score values that make up the boundaries of each category.
How many elements are in each category?
Examine a few elements of each.
# because it's a list, we can use lapply on it.
# How long is each group?
lapply(oct4ByQ.iv, length)
## $`[2.01,7.37]`
## [1] 3613
##
## $`(7.37,12.2]`
## [1] 3594
##
## $`(12.2,26.4]`
## [1] 3594
##
## $`(26.4,541]`
## [1] 3600
# Examine the first few elements of each group
lapply(oct4ByQ.iv, head)
## $`[2.01,7.37]`
## [1] 5 9 15 18 22 23
##
## $`(7.37,12.2]`
## [1] 2 6 7 8 11 12
##
## $`(12.2,26.4]`
## [1] 19 27 28 29 30 32
##
## $`(26.4,541]`
## [1] 1 3 4 10 13 14
Let’s extract 500 randomly chosen elements from each quantile. We reference the 1st list element, which contain 3613 items and choose 500 at random.
Summarize the scores for the first quantile:
# take a random sample of 500 ids
q1.ids <- sample(oct4ByQ.iv[[1]], 500)
summary(oct4.topSet$score[q1.ids])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.311 4.775 5.694 5.635 6.616 7.370
Pick random sets for each quantile.
q1.idx <- sample(oct4ByQ.iv[[1]], 500)
q2.idx <- sample(oct4ByQ.iv[[2]], 500)
q3.idx <- sample(oct4ByQ.iv[[3]], 500)
q4.idx <- sample(oct4ByQ.iv[[4]], 500)
boxplot(list(q1=oct4.topSet$score[q1.idx],
q2=oct4.topSet$score[q2.idx],
q3=oct4.topSet$score[q3.idx],
q4=oct4.topSet$score[q4.idx]), main="Peak Scores by Quartile")
Do peak widths differ by quantile?
uh, they’re summits - they’re all the same width. We have to load in the actual peaks to test this. Skip this chunk.
boxplot(list(q1=width(oct4.topSet[q1.idx]),
q2=width(oct4.topSet[q2.idx]),
q3=width(oct4.topSet[q3.idx]),
q4=width(oct4.topSet[q4.idx])), main="Peak Width by Quartile")
Get the genes associated with each quantile
# get the gene ids
goi.q1 <- unique(as.character(oct4.topSet$gene_id[q1.idx]))
goi.q2 <- unique(as.character(oct4.topSet$gene_id[q2.idx]))
goi.q3 <- unique(as.character(oct4.topSet$gene_id[q3.idx]))
goi.q4 <- unique(as.character(oct4.topSet$gene_id[q4.idx]))
# grab the coordinates for each gene TSS
q1.tss.gr <- tss.gr[match(goi.q1, as.character(tss.gr$gene_id))]
q2.tss.gr <- tss.gr[match(goi.q2, as.character(tss.gr$gene_id))]
q3.tss.gr <- tss.gr[match(goi.q3, as.character(tss.gr$gene_id))]
q4.tss.gr <- tss.gr[match(goi.q4, as.character(tss.gr$gene_id))]
# resize them
q1.tss.gr <- resize(q1.tss.gr, width=1000, fix="center")
q2.tss.gr <- resize(q2.tss.gr, width=1000, fix="center")
q3.tss.gr <- resize(q3.tss.gr, width=1000, fix="center")
q4.tss.gr <- resize(q4.tss.gr, width=1000, fix="center")
# make sure they didn't extend past boundaries
q1.tss.gr <- trim(q1.tss.gr)
q2.tss.gr <- trim(q2.tss.gr)
q3.tss.gr <- trim(q3.tss.gr)
q4.tss.gr <- trim(q4.tss.gr)
Now we can extrcat out the RNA Polymerase II values and make metaplots for each quartile.
# set the location of our RNA polII bigwWig file
bwFile <- "/n/projects/CompGenomics/Data/GSE115436/R1_WT_ES_RPB1.bw"
polII.q1 <- grangesMatrix(q1.tss.gr, bwFile)
polII.q2 <- grangesMatrix(q2.tss.gr, bwFile)
polII.q3 <- grangesMatrix(q3.tss.gr, bwFile)
polII.q4 <- grangesMatrix(q4.tss.gr, bwFile)
Let’s make a plot and see what it looks like!
plot(colMeans(polII.q4), type="l", main="RNA Polymerase II by OCT4 quartile", ylab="mean RPM", xaxt="n")
lines(colMeans(polII.q3), col="blue")
lines(colMeans(polII.q2), col="green")
lines(colMeans(polII.q1), col="red")
# add and axis
axis(1, at=c(0,500,1000), labels=c("-500","0","+500"))
# add a vertical line at TSS
abline(v=500, col="red", lty="dashed", lwd=2)
legend("topright", c("q1", "q2","q3","q4"), col=c("red","green","blue","black"), lwd=2)
For reproducibility, this analysis was performed with the following R/Bioconductor session:
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-3.5.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicFeatures_1.32.0 AnnotationDbi_1.42.1 Biobase_2.40.0
[4] reshape2_1.4.3 ggplot2_3.0.0 rtracklayer_1.40.3
[7] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0 IRanges_2.14.11
[10] S4Vectors_0.18.3 BiocGenerics_0.26.0 BiocInstaller_1.30.0
[13] RColorBrewer_1.1-2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 lattice_0.20-35
[3] prettyunits_1.0.2 Rsamtools_1.32.2
[5] Biostrings_2.48.0 assertthat_0.2.0
[7] digest_0.6.15 R6_2.2.2
[9] plyr_1.8.4 RSQLite_2.1.1
[11] evaluate_0.11 httr_1.3.1
[13] pillar_1.3.0 progress_1.2.0
[15] zlibbioc_1.26.0 rlang_0.2.1
[17] lazyeval_0.2.1 blob_1.1.1
[19] Matrix_1.2-14 rmarkdown_1.10.9
[21] labeling_0.3 BiocParallel_1.14.2
[23] stringr_1.3.1 biomaRt_2.36.1
[25] RCurl_1.95-4.11 bit_1.1-14
[27] munsell_0.5.0 DelayedArray_0.6.6
[29] compiler_3.5.0 xfun_0.3
[31] pkgconfig_2.0.1 htmltools_0.3.6
[33] tidyselect_0.2.4 SummarizedExperiment_1.10.1
[35] tibble_1.4.2 GenomeInfoDbData_1.1.0
[37] matrixStats_0.54.0 XML_3.98-1.12
[39] crayon_1.3.4 dplyr_0.7.6
[41] withr_2.1.2 GenomicAlignments_1.16.0
[43] bitops_1.0-6 grid_3.5.0
[45] gtable_0.2.0 DBI_1.0.0
[47] magrittr_1.5 scales_0.5.0
[49] stringi_1.2.4 XVector_0.20.0
[51] bindrcpp_0.2.2 htmldeps_0.1.1
[53] tools_3.5.0 bit64_0.9-7
[55] glue_1.3.0 purrr_0.2.5
[57] hms_0.4.2 yaml_2.2.0
[59] colorspace_1.3-2 memoise_1.1.0
[61] knitr_1.20.8 bindr_0.1.1