MetaGene Analysis

Objective
Reading bam files in R
Getting the read coverage at each nucleotide
Let's find the gene with the closest promoter
Defining our Region Of Interest (ROI)
Getting the coverages around our TSS of interest
Retreiving the coverages around our TSS of interests
Dealing with gene strandness
Could we have done all this a simpler way?
RPM Normalization
- Now we can compute the average coverage in RPM
  - Average gene plot with confidence interval
An alternate approach
Homework (optional)

Objective

In this tutorial we learn how to load in data from BAM files, generate read coverage across the genome, and then extract read coverage for various features.

Because of Jeff's excellent tutoral on GRanges, some of this is review, and some of it can be done using his "helper functions".

The data for this section is RNA polymerase II as measured in the toll10b background.

Reading bam files in R

##################################################
### Reading a bam file with Rsamtools package
##################################################
library(GenomicAlignments)

## Specifying the bam file to read
BF <- BamFile("/home/cws/dorsal/BAM/toll10b_4h8_2to4h_1.bam")

## Read the alignment file
aln <- readGAlignments(BF)

# examine the result
aln

# in order to proceed below, with functions that take GRanges objects,
# we would need to recast this as a GRanges object
aln <- as(aln,'GRanges')

We could also do this in one step as follows, this time using rtracklayer:

library(rtracklayer)
# read in the BAM file and convert to GRanges in one step
aln <- granges(import(BF))

# set the seqlengths
#library(BSgenome.Dmelanogaster.UCSC.dm6)
#seqlengths(aln) <- seqlengths(Dmelanogaster)[seqlevels(aln)]

In the above code we use a trick to set the seqlengths of aln using only the seqlengths in Mmusculus with matching names from aln. [Note: we no longer have to do this, as the BAM file has the sequence names and lengths in it's header and recent updates to the import function use this information to set the seqlengths. So that section of code is commented out]

Getting the read coverage at each nucleotide

library(GenomicRanges)
## Let's resize the reads to the approximate ChIP Seq
## fragment length.
aln <- resize(aln, width=150)

## We might get an error that some have extended beyond
## the chromosome ends. We can trim them.
aln <- trim(aln)

# Now get the read coverage at each nucleotide
cov <- coverage(aln)

# examine the results
cov

Let's find the gene with the closest promoter

Loading the data

##################################################
### Loading a location of the summits for twist
##################################################
summits.f <- '/home/cws/dorsal/MACS/twist/twist_summits.bed'
twist.summits <- import.bed(summits.f)

# let's get rid of the non-canonical chromosomes
# they're the ones with an underscore in the name
# we can use logical grep, and invert it.
twist.summits <- twist.summits[!grepl("_", seqnames(twist.summits))]

Getting the location of all transcripts for all the genes

Do you remember where your transcriptDB is?

## Load the transcript database
library(GenomicFeatures)
txdbfile <- '/home/cws/dorsal/analysis/D.melanogaster.Ens_98.dm6.db'

txdb <- loadDb(txdbfile)

## Get all the transcripts for each gene
allTx <- transcriptsBy(txdb, by='gene')

Finding the genes with a TSS closest to the summits

## Recast our GRangesList of genes as a GRanges with all transcripts
allTx.gr <- unlist(allTx)

## Let's rename the rows with the transcript name not the gene name...
# names(allTx.gr) <- allTx.gr$tx_name
# already done in latest version of R, not necessary.

## Get all TSSs
TSS <- resize(allTx.gr,width=1,fix='start')

## Examine histogram of twist peak scores
hist(twist.summits$score)

## make it look nicer? breaks=50, rug()

## the scores represent transformed p-values
## so draw a line at a decent cutoff
abline(v=-10 * log10(0.01), col="red")

## Filtering only summits with good pvalue
is.good <- twist.summits$score > -10 * log10(0.01)

# How many?
sum(is.good)

## Find the nearest TSS for each summit
TSS.iv <- nearest(twist.summits[is.good],TSS)

# let'sget rid of those summits with no genes

nearest.TSS <- TSS[TSS.iv]

Defining our Region Of Interest (ROI)

So, what we want is the coverages of a 500 nt window around the TSS of our genes of interest. Let's resize our GRanges containing our TSS

## We can resize each TSS, and then trim those that extend
## past chromosome boundaries, and discard them
ROI <- resize(nearest.TSS,width=1000,fix='center')

# count how many there are
sum(width(ROI) == 1000)

# trim those that extend past boundaries
ROI <- trim(ROI)

# count them again
sum(width(ROI) == 1000)

# remove those less than 1kb
ROI <- ROI[width(ROI) == 1000]

## Finally, let's make sure we are not double counting
ROI <- unique(ROI)

Getting the coverages around our TSS of interest

But first, we need to recast our ROI into a different object

##################################################
### Extracting coverages using Views
##################################################

##################################################
### Views accross genomes work with RleList and RangesList
### Recasting our GRanges and RangesList
##################################################
ROI.RL <- as(ROI,'IntegerRangesList')

# view the result
ROI.RL

Now, extracting the coverages

##################################################
### ATTENTION!!!! The chromosome in ROI.RL are not
### in the same order as in cov!! R don't care!
##################################################
all(names(ROI.RL) == names(cov))

## Let's reorder our chromosome
## but first, let's get the set of chromosome in our data
chr.names <- unique(c(names(ROI.RL),names(cov)))

## Now, the reordeing
ROI.RL <- ROI.RL[chr.names]
cov <- cov[chr.names]

##################################################
### Extracting the coverages around our TSS of interest
##################################################
### Using a loop structure called mapply (for multiple apply)
tss.cov <- mapply(Views,cov,ROI.RL)
### Recasting the results as an RleViewsList
tss.cov <- RleViewsList(tss.cov)

# examine the result
tss.cov

Ok… but is this really what I wanted?? This is way too fast!!

Let's look at the first chromosome in our RleViewsList

names(tss.cov)

names(tss.cov[[1]])
##################################################
### The Views on chr2L
##################################################
tss.cov[[1]]

Now let's look at the first ROI on the first chromosome

##################################################
### The coverages around the first TSS of chr2L
##################################################

tss.cov[[1]][[1]]

Great!

Retreiving the coverages around our TSS of interests

Getting the coverage data back

The first thing we need is to get the covrage back in a matrix-like type data format

##################################################
### from Views to vector of coverages
##################################################

## We need to remove the chromosome with no coverages
is.empty <- sapply(tss.cov,length) == 0

tss.cov <- tss.cov[!is.empty]

  ## So we know how to get a single Rle from a chromosome
  tss.cov[[1]][[2]]

  ## To get a vector, let's recast that object!
  as.vector(tss.cov[[1]][[2]])

  ## Cool!!!!

  ## Now, for the full genome, let's use viewApply
  tss.cov.LM <- viewApply(tss.cov,as.vector)

  ## What are the classes of the return items
  sapply(tss.cov.LM,class)
  ## How many rows
  sapply(tss.cov.LM,nrow)
  ## How many columns
  sapply(tss.cov.LM,ncol)


  ## Ok!!! one finale stiching step
  tss.cov.mat <- do.call(cbind,as.list(tss.cov.LM))

  ## Let's just flip the matrix so that
  tss.cov.mat <- t(tss.cov.mat)

  dim(tss.cov.mat)

Dealing with gene strandness

##################################################
### Ajusting the orientation of genes on the negative strand
##################################################

## But first...
## From what transcript is the first TSS in tss.cov.mat from?
t1 <- tss.cov.mat[1,]

### Our second TSS on chr2L is now the second row of our matrix
t2 <- as.vector(tss.cov[[1]][[1]])

### Are they the same??
identical(t1,t2)

## Ok, we have establish this at least...
## What name as tss.cov[[1]][[1]]??
first.tss <- names(tss.cov[[1]])[1]

## Ok! now we know what transcript this TSS comes from
## What is the strand of this TSS?
ROI[first.tss]

## On the negative strand... We need to reverse the coverages...
### Take a look at what appen with the funtion rev()
rev(tss.cov.mat[1,])

##################################################
## Reversing of coverages for ROI on the negative strand
##################################################

## first, let's get the name of the TSS in our tss.cov
tss.names <- unlist(lapply(tss.cov,names))


## Cool! so now, we can find all the TSS on the negative strand
## And reverse their order!
## Which rows of the matrix are from ROI on the reverse strand?
is.negative <- as.vector(strand(ROI[tss.names]) == '-')

## Now, using a loop over the rows that are on the negative strand
## run the rev function
tss.cov.mat[is.negative,] <- t(apply(tss.cov.mat[is.negative,],1,rev))

dim(tss.cov.mat)

Could we have done all this a simpler way?

Yes! Jeff gave us a helper function that does all the above: grangesMatrix()

A much much simpler way - Jeff's function does all the above for you. Using your regions of interest (ROI) and your coverage object, you can return a strand corrected matrix of coverages around features.

# get access to those wonderful functions
source("http://research.stowers.org/cws/CompGenomics/Resources/granges1.1.r")

polII_m <- grangesMatrix(ROI, cov)

This single lines of code does everything that large chunk of code does above, and is much simpler to write and understand.

In addition, those helper function can operate directly on bigwig files, so we don't have to create coverage objects and keep them in memory for processing.

# Let's export our coverage as a bigWig file
# so we can view it in a Genome Browser OR use it
# for computation with the helper functions.
export(cov, "tracks/polII_tol10b_150.bw")

With our coverage saved as a bigWig, we can call it directly into our helper function.

# Reference our data in a bigwig file
polII_m <- grangesMatrix(ROI, "/home/cws/dorsal/analysis/tracks/polII_tol10b_150.bw")

Now that we have a matrix of coverages, we can make a metagene plot. By averaging the columns of our matrix, we will see what the average profile looks like.

# generate a plot of the column means
plot(colMeans(polII_m))

# we can make it look prettier
mean.cov <- colMeans(polII_m)
plot(mean.cov, type='l', xaxt='n', main="average read coverage\ntwist bound TSS",
     ylab="avg read coverage", xlab="Position around TSS (nt)")

# add and axis
axis(1, at=c(0,500,1000), labels=c("-500","0","+500"))

# add a vertical line at TSS
abline(v=500, col="red", lty="dashed", lwd=2)

RPM Normalization

How might we "normalize" the data to reads per million? (RPM)

# get the total number of reads
totalReads <- length(aln)

# What is the class of our coverage object?
class(cov)

## We can normalize to reads per million by multiplying each 
## nucleotide position by 10^6, then dividing by the total number of reads.
## Since our coverage object is a list, let's process it like a list.
rpm <- lapply(cov, function(x) signif(10^6 * x)/totalReads)

# convert to SimpleRleList
rpm <- as(rpm, "SimpleRleList")

# we might want to save this too
export(rpm, "tracks/polII_tol10b_rpm.bw")

Now we can compute the average coverage in RPM

avg.cov <- grangesMatrix(ROI, rpm)

Average gene plot with confidence interval

We can make a metagene plot with a 95% confidence interval using gglot2.

library(ggplot2)

# get average
p2.avg <- colMeans(avg.cov)
# get sd
p2.sd <- apply(avg.cov,2,sd)

# get the confidence interval
ci <- qnorm(0.975)*p2.sd/sqrt(nrow(avg.cov))

# create a data frame
p2.df <- data.frame(x=seq_along(p2.avg), avg=p2.avg,l=p2.avg-ci,u=p2.avg+ci)

ggplot(p2.df, aes(x)) + geom_line(aes(y=avg), colour="blue") + 
  geom_ribbon(aes(ymin=l, ymax=u), alpha=0.2, colour="lightblue")

An alternate approach

# read in the summits
twist.summits <- import.bed(summits.f)
# remove the non-canonical chromosomes
twist.summits <- keepStandardChromosomes(twist.summits, pruning.mode="coarse")

# Declare a function to map peaks to genes and list distance
peak2gene <- function(peaks,genes){
  # shrink features and genes to a point
  # return nearest and signed distance
  tss <- resize(genes,width=1,fix='start')
  gr.p <- resize(peaks,width=1,fix='center')

  gr.p$nearest.gene <- rep(NA,length(gr.p))
  gr.p$dist.tss <- rep(NA,length(gr.p))

  n.iv <- nearest(gr.p,tss)

  # create index vectors
  notNA <- !is.na(n.iv)
  nearest.gene <- tss[n.iv[notNA]]

  gr.p$nearest.gene[notNA] <- names(nearest.gene)
  # re-instate original gr coordinates
  start(gr.p) <- start(peaks)
  end(gr.p) <- end(peaks)
  gr.p$dist.tss[notNA] <- (start(gr.p[notNA]) - start(nearest.gene)) * ifelse(strand(nearest.gene) == "-", -1,1)
  return(gr.p)
}

Get all gene descriptions from our txdb object

genes.gr <- genes(txdb)

Map genes to our peaks with distances

twist.summits <- peak2gene(twist.summits, genes.gr)

Get the ones close to a TSS

goi <- twist.summits$nearest.gene[abs(twist.summits$dist.tss) < 1000]

length(goi)
goi <- unique(goi)
length(goi)

Grab those genes. The genes.gr are named, so we can grab them by name:

goi.gr <- genes.gr[goi]

Now we resize them into uniform regions, and extract signal:

goi.gr <- resize(goi.gr, 1, fix="start")
goi.gr <- resize(goi.gr, 1000, fix="center")

goi.cov <- grangesMatrix(goi.gr, cov)

Now we can plot something

plot(colMeans(goi.cov))

Maybe we can view an image:

# take a random sample of rows
r.iv <- sample(1:nrow(goi.cov), 500)
image(t(goi.cov[r.iv,]))

Hmmm, not anything there?

Let's try scaling the rows:

foo <- goi.cov[r.iv,]
foo.s <- t(apply(foo, 1, scale))
image(t(foo.s), axes=FALSE)

There's something there, but we have to process it further to effectively view it.

We're probably have a mixture of profiles covering a vast range of values. Let's examine how our values are distributed by looking at the quartiles:

# examine quantiles of the data
quantile(foo, seq(0,1, by=0.25))

We see values spanning several orders of magnitude. We could use log scale for the plot.

# add 1 to all positions of foo to avoid log2(0) errors
image(t(log2(foo+1)))

There is data there. Let's sort the rows by signal in the center region.

# order the rows
iv <- order(rowSums(foo[,400:800]), decreasing=FALSE)
# examine the result
image(t(log2(foo[iv,]+1)))

We can view our data on a common linear scale, if we simply max out those values at the top. We can try setting all values greater than sum upper value to a ceiling. Let's try 2x the third quartile.

# set max to 2x the 3rd quantile
thirdQ <- quantile(foo, seq(0,1, by=0.25))[4]
# set the highest tier values of foo to a common ceiling
foo[foo > 2*thirdQ] <- 2*thirdQ

Make the final plot

# create a color pallete
cols <- colorRampPalette(c("white","black"))(256)

image(t(foo[iv,]), col=cols, axes=FALSE)

## To add a box around the graph
box()

## Now, making the lower axis
axis(1,at=c(0,0.5,1),labels=c('-500','TSS','+500'))

## Printing a TSS Marker
abline(v=0.5,col='steelblue',lty=2,lwd=2)

Homework (optional)

Reproduce figures for one of the other ips