Convert BAM to Normalized bigWig
Table of Contents
Objective
Convert your BAM files to a depth normalized bigWig track for viewing in a genome browser. There are at least two ways to do this. One uses a command line tool set called bedtools, the second uses R.
Method 1: RPM track file from BAM file
We can use the bedtools suite to create a normalized coverage track. The genomecov option can create coverage in bedgraph format.
# this command creates a coverage track in bedgraph format
# change "filename" to the name of your file
bedtools genomecov -ibam filename.bam -bg > filename.bedgraph
# Usually we would like to scale our coverage to reads/million
# instead of the command above, use the scaling option to
# scale it to the total number of mapped reads in millions
# replace the 1 in "-scale 1" with however many million reads are
# in your file
bedtools genomecov -ibam filename.bam -bg -scale 1 > filename.bedgraph
Bedgraph format is not that useful in itself. Normally we would convert the bedgraph file to bigwig format for display in a genome browser. This can be accomplished using the bedGraphToBigWig program that converts our bedGraph file to a bigWig file. It needs a description of the chromosome sizes.
# we need a file of chromosome sizes for the bedGraphToBigWig program
dm6=/home/cws/CompGenomics/dm6/dm6.chrom.sizes
bedGraphToBigWig filename.bedgraph $dm6 filename.bw
Getting the number of mapped reads
In order to scale the coverage in terms of reads per million, you have to know how many mapped reads you have in your BAM file. In case you don't know the number, one way to find it is to use samtools to report the index stats (from the indexed BAM file), and then use a command-line utility called awk to sum them up. For example, if I have an indexed BAM file, I can get the total number of mapped reads by doing the following:
samtools idxstats filename.bam | awk '{sum+=$3}END{print sum}'
With awk, the first set of brackets is executed for each line of input, but the second set of brackets marked END, are only executed at after all the input has finished. So the one-liner above takes the 3rd field of the index report (which has the number of mapped reads to each chromosome), and adds it to a running sum. Once all the lines are done, the sum is printed out. You can then divide this number by 1 million and use this as your scaling factor for bedtools.
Method 2: RPM track file from BAM file
In the code below, you load some libraries, load your bam files and convert them to bigwig one at a time.
Since there are several steps involved, the simplest way to do this is to write a function and then apply it to a set of bam files.
For each bam file we open it, convert it to GRanges, extend the reads so they resemble the original DNA fragment length, then we calculate coverage across the genome, and then transform the coverage based on the total number of read counts. We then simply export the result as a bigwig file that we can load into a browser.
library(Rsamtools)
library(rtracklayer)
library(GenomicRanges)
library(GenomicAlignments)
library(BSgenome.Mmusculus.UCSC.mm10)
# create a function to read in data
# extend reads, get coverage, normalize to RPM
# and export to bw
bam2bw <- function(bamfile){
extLen <- 150
cat("opening:", bamfile, sep="\n")
bd <- readGAlignments(bamfile)
cat("convert to GRanges\n")
mygr <- as(bd,"GRanges")
cat("extending reads\n")
mygr <- resize(mygr, extLen)
mygr <- trim(mygr)
totalReads <- length(mygr)
cat("getting coverage\n")
# get coverage
cov <- coverage(mygr)
rpm <- lapply(cov, function(x) signif(10^6 * x/totalReads,3))
rpm <- as(rpm,"SimpleRleList")
# export rpm to bigWig
outfile <- sub(".bam", "", bamfile)
outfile <- paste(outfile, ".bw", sep="")
cat(paste("exporting to bigwig", outfile, "\n", sep="\t"))
export.bw(rpm, outfile)
}
# read the bam file names into a vector
bamfiles <- dir(pattern=".bam$")
# for each element of our vector, call the bam2bw function
sapply(bamfiles, bam2bw)
If your working directory in R is your bam directory, you can execute the code for all your bam files by simply running it in R.
However, there is a second way to run the code as well. You can run it on the linux command line without ever opening R. If you copy the code above to a file, say bam2bw.r, and then place the file in the directory with your BAM files, you can execute it by moving to the BAM directory and executing it with the following command:
cat bam2bw.r | R --vanilla
This is a way to send the code to R directly from the command line.
View the Results in IGV
We can view the results by dropping the files into an IGV window. You might have to download IGV from MIT. You can also use the https://webfs/ mechanism to load the files by URL.
View the results at UCSC
To view the results on the UCSC genome browser we have to copy the files to a web server, and then create a track line for UCSC that specifies where the data is and how it should be displayed.
Covid Caveat : this section is only applies if we have access to the Stowers Intranet. Since we do not have access, we'll have to use a more general mechanism, as explained in the Covid Countermeasure section below.
We have a web server for serving track data called the "tracks server", and we use the unix secure copy command to copy our files there. To use this server, make a ServiceNow request for an account on the tracks server. Your login will be the same as your linux login, so you can ssh to tracks and create a directory to hold your track files. Directories are created under /r. For instance if we wanted to creat a directory called myDirectory on tracks we would do the following:
ssh username@tracks
cd /r
mkdir myDirectory
exit
You can replace "myDirectory" with whatever you like. A logical choice might be to make a directory that is the same as your username. Once you have track files, a bigWig file for example, you can copy it to the track server as follows:
scp myFile.bw compgenomics@tracks:/r/myDirectory
Create a bigWig header line
Once your file is in place, you can display it in the UCSC genome browser by creating a "track line" which references the file, and describes how it should be displayed. You past the track line into the custom tracks box on the browser page.
track type=bigWig name=snail_ip description="Snail IP" graphType=bar visibility=full color=0,0,150 \
bigDataUrl=http://tracks.stowers.org/r/myDirectory/myFile.bw
You create one of these lines for each track file you want to display. Make sure your trackfiles have the right permissions, e.g. chmod 644 *.bw
.
Covid Track Server Countermeasure
Instead of using our specialized track server to serve tracks to UCSC as explained above, we can serve them using a more traditional mechanism. Our server: franklin.stowers.org is also a general webserver, which means you have a website on this machine and can serve content to the world with a URL! In your home directory is a subdirectory called: public_html. Anything you place in this directory is accessible via the web using a URL of the form: http://franklin.stowers.org/~username where username is your login id.
So if I had a bigWig file in my project directory (myFile.bw), and I wanted to make it available to UCSC or IGV, I could do something like the following:
### assume I'm working in /home/cws/dl_project and I have a file called twist_ip_rpm.bw
# create a directory for data tracks in my website
mkdir ../public_html/tracks
# change permissions to make file world readable
chmod o+r twist_ip_rpm.bw
# mv file to web accessible location
mv twist_ip_rpm.bw ../public_html/tracks
Now that the file is web accessible, I can create a track line for it and access it via URL:
track type=bigWig name=twist_ip description="Twist IP" graphType=bar visibility=full color=0,0,150 \
bigDataUrl=http://franklin.stowers.org/~cws/tracks/twist_ip_rpm.bw
—