Tidbits on R Objects and Functions
Table of Contents
Create RPM track file from BAM file
If you want to view or work with normalized coverage, you have to normalize the data.
In the code below, you load some libraries, load your bam files and convert them to coverage and normalize them one at a time.
For each bam file we open it, convert it to GRanges, extend the reads so they resemble the original DNA fragment length, then we calculate coverage across the genome, and then transform the coverage based on the total number of read counts. We then simply export the result as a bigwig file that we can load into a browser.
library(Rsamtools)
library(rtracklayer)
library(GenomicRanges)
library(GenomicAlignments)
library(BSgenome.Dmelanogaster.UCSC.dm6)
bamfile <- "twist_ip.bam"
# read in the BAM file and convert to GRanges in one step
aln <- granges(import(bamfile))
# get total number of reads
totalReads <- length(aln)
# set the chromosome sequence lengths for the object
#seqlengths(aln) <- seqlengths(Dmelanogaster)[seqlevels(aln)]
# resize the reads to the fragment length
aln <- resize(aln, width=150)
aln <- trim(aln)
# calculate coverage
cov <- coverage(aln)
# normalize the data to RPM
rpm <- lapply(cov, function(x) signif(10^6 * x/totalReads,3))
rpm <- as(rpm,"SimpleRleList")
export.bw(rpm, "twist_ip_rpm.bw")
Perform many steps in one using a function
Since there are several steps involved, the simplest way to do this is to write a function and then apply it to a set of bam files.
### create a function which takes a BAM file name and returns RPM normalized coverage
getRPMcov <- function(bf){
aln <- granges(import(bf))
totalReads <- length(aln)
#seqlengths(aln) <- seqlengths(Dmelanogaster)[seqlevels(aln)]
aln <- resize(aln, width=150)
aln <- trim(aln)
cov <- coverage(aln)
rpm <- lapply(cov, function(x) signif(10^6 * x/totalReads,3))
rpm <- as(rpm,"SimpleRleList")
return(rpm)
}
# now we can process a data set into coverage objects
bamfiles <- c("twist_ip_1.bam", "twist_ip_2.bam", "snail_ip_1.bam", "snail_ip_2.bam", "control_tc.bam")
# with one line we do many things
twi_ip1 <- getRPMcov(bamfiles[1])
twi_ip2 <- getRPMcov(bamfiles[2])
sna_ip1 <- getRPMcov(bamfiles[3])
sna_ip2 <- getRPMcov(bamfiles[4])
input <- getRPMcov(bamfiles[5])
A better way is to probably convert the BAM files to RPM bigWig files. That way we can see them in the genome browser, and use them for calculations with our GRanges helper functions.
RLE objects are act like regular vectors
If we want to average two data sets, we cans simply add them and divide by two:
# average the two data streams
twist_avg <- (twi_ip1 + twi_ip2)/2
snail_avg <- (sna_ip1 + sna_ip2)/2
Ratio of IP and input
# create a moderated ratio
twist_rat <- log2((twist_avg + 1)/(input + 1))
Notice that we add 1 to each data set prior to averaging to avoid division by zero errors. This produces "moderated" ratios - i.e. the ratios are slightly squelched, and our signal will have to be higher than 1 for use to see a ratio. You can choose the moderation factor to suit your needs.
Smooth the data
There are many ways to smooth data. An easy method is to simply take the running median value
# smooth the data
twist_rat.s <- as(lapply(twist_rat, function(x) runmed(x,50)), "SimpleRleList")
# examine the difference
export.bw(twist_rat, "twist_avg_rat.bw")
export.bw(twist_rat.s, "twist_avg_rat_s.bw")
—