Project: Genomics Module 2020
Author: Chris Seidel & Jeff Johnston
Generated: Sun Oct 04 2020, 02:26 PM
We use GRanges
extensively in our analyses. Defined by the Bioconductor GenomicRanges
library, GRanges
provide a way to store and manipulate sets of genomic regions. Examples of such regions are peaks obtained from a peak calling program such as MACS, motif locations and even genes.
Let’s create a simple GRanges
object using the GRanges()
constructor:
library(GenomicRanges)
gr <- GRanges(ranges=IRanges(start=c(100, 200), end=c(199, 299)),
seqnames=c("chr2L", "chr3R"),
strand=c("+", "-"))
gr
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2L 100-199 +
[2] chr3R 200-299 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
At a minimum, you need to specify both the ranges
and seqnames
arguments. If you omit strand
, it will default to *
(unstranded). You can access the properties of a GRanges
object using various accessors:
start(gr)
[1] 100 200
end(gr)
[1] 199 299
width(gr)
[1] 100 100
seqnames(gr)
factor-Rle of length 2 with 2 runs
Lengths: 1 1
Values : chr2L chr3R
Levels(2): chr2L chr3R
strand(gr)
factor-Rle of length 2 with 2 runs
Lengths: 1 1
Values : + -
Levels(3): + - *
GRanges
objects can also optionally hold information about the lengths of the sequences (normally chromosomes) associated with the coordinates it contains. It is generally a good idea to set these lengths, as doing so will allow you to see warnings when the contained ranges extend outside the bounds of each chromosome. You can specify these lengths when you create the GRanges
object or later using the seqlengths()
method. The BSgenome
reference genome packages are the easiest way to obtain the chromosome lengths for most model organisms.
library(BSgenome.Dmelanogaster.UCSC.dm6)
gr <- GRanges(ranges=IRanges(start=c(100, 200), end=c(199, 299)),
seqnames=c("chr2L", "chr3R"),
strand=c("+", "-"),
seqlengths=seqlengths(Dmelanogaster))
head(seqlengths(gr))
chr2L chr2R chr3L chr3R chr4 chrX
23513712 25286936 28110227 32079331 1348131 23542271
Ranges inside a GRanges
object can contain additional information using metadata columns stored alongside the range data. You can access this information in the form of a DataFrame
using the mcols()
accessor. Alternatively, you can also use $
to access, add and remove columns as you would a normal data.frame
:
gr$name <- "My range"
gr$score <- 5
gr
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2L 100-199 + | My range 5
[2] chr3R 200-299 - | My range 5
-------
seqinfo: 1870 sequences from an unspecified genome
mcols(gr)
DataFrame with 2 rows and 2 columns
name score
<character> <numeric>
1 My range 5
2 My range 5
Adding metadata columns to a GRanges
is very useful. For example, if you have a set of peaks and want to calculate the ChIP-seq enrichment of a certain factor for each peak, you can store the enrichment values as metadata alongside the peaks.
The rtracklayer
package allows us to import a variety of file formats into GRanges
(or GRanges
-like) objects. Let’s import a common genomic file format (BED) to see how it works. We’ll use a BED file generated by the MACS peak calling tool.
library(rtracklayer)
bed_file <- "data/twist_summits.bed"
bed.gr <- import(bed_file)
bed.gr
GRanges object with 13218 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2L 7773 * | twist_peak_1 3.3996
[2] chr2L 8854 * | twist_peak_2 18.3901
[3] chr2L 26602 * | twist_peak_3 15.864
[4] chr2L 27285 * | twist_peak_4 119.641
[5] chr2L 34176 * | twist_peak_5 14.9637
... ... ... ... . ... ...
[13214] chrY_CP007114v1_random 26329 * | twist_peak_13214 4.58682
[13215] chrY_DS485113v1_random 447 * | twist_peak_13215 12.8345
[13216] chrY_DS485143v1_random 316 * | twist_peak_13216 6.37728
[13217] chrY_DS485732v1_random 190 * | twist_peak_13217 7.88986
[13218] chrY_DS485865v1_random 174 * | twist_peak_13218 5.52593
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
We can see that both the name and score columns from the MACS BED file were imported as metadata columns.
If you want to view your GRanges
objects in a genome browser such as IGV, you’ll first need to export them. Most of the time you will export a GRanges
object to a BED file. If you GRanges
object has a name and score metadata column, those will be included in the BED file.
export(gr, "guide/simple_granges_export_example.bed")
Alternatively, you can save your GRanges
objects to a text file to use in a spreadsheet application such as Microsoft Excel:
write.table(as.data.frame(gr), file="guide/simple_granges_export_example.xls",
quote=FALSE, row.names=FALSE, col.names=TRUE, sep="\t", na="")
A number of useful functions are available to manipulate GRanges
objects. For example, notice that each MACS peak in the BED file we imported above is of width 1 (these are the detected peak summits). If we want to expand these regions, we can use resize()
:
resize(bed.gr[1:5], width=100)
GRanges object with 5 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2L 7773-7872 * | twist_peak_1 3.3996
[2] chr2L 8854-8953 * | twist_peak_2 18.3901
[3] chr2L 26602-26701 * | twist_peak_3 15.864
[4] chr2L 27285-27384 * | twist_peak_4 119.641
[5] chr2L 34176-34275 * | twist_peak_5 14.9637
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
Notice that resize()
kept the start coordinate of each range the same, but extended the end coordinates so that each range was now 100bp in width. That’s because resize()
is strand-aware, and by default keeps the start coordinate fixed for ranges on the +
or *
strand, and keeps the end coordinate fixed for ranges on the -
strand. We can see an example of this using our simple GRanges
object from earlier:
resize(gr, width=50)
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2L 100-149 + | My range 5
[2] chr3R 250-299 - | My range 5
-------
seqinfo: 1870 sequences from an unspecified genome
What we really wanted was to extend the peak summits equally in both directions, so that the original coordinate of the peak summit remained in the center of each range. To do that, we use the fixed="center"
argument. We should also specify an odd number for width
so that an equal number of bases are to the right and left of the summit:
resize(bed.gr[1:5], width=101, fix="center")
GRanges object with 5 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2L 7723-7823 * | twist_peak_1 3.3996
[2] chr2L 8804-8904 * | twist_peak_2 18.3901
[3] chr2L 26552-26652 * | twist_peak_3 15.864
[4] chr2L 27235-27335 * | twist_peak_4 119.641
[5] chr2L 34126-34226 * | twist_peak_5 14.9637
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
In addition to resize()
, there are a number of other functions for manipulating GRanges
objects. A few useful ones are flank()
, promoters()
and shift()
:
flank(gr, width=10)
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2L 90-99 + | My range 5
[2] chr3R 300-309 - | My range 5
-------
seqinfo: 1870 sequences from an unspecified genome
Notice that flank()
is also strand-aware and creates ranges that are upstream by default. Downstream ranges can be obtained using the start=FALSE
argument.
Similar to flank()
is the promoters()
function, which creates regions around the start of ranges (the start coordinate for ranges on the +
strand and the end coordinate for ranges on the -
strand):
promoters(gr, upstream=10, downstream=5)
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2L 90-104 + | My range 5
[2] chr3R 295-309 - | My range 5
-------
seqinfo: 1870 sequences from an unspecified genome
GRanges
objects are an efficient means of representing the results of a high-throughput DNA sequencing experiment such as ChIP-seq. After alignment to a reference genome, the coordinates of all aligned reads can be easily imported from a BAM file as a GRanges
object. But what then? We typically express the results of a ChIP-seq experiment as a genome-wide coverage track in which each base of the reference genome is assigned a score. These tracks can not only be viewed in a genome browser but are also very useful for performing analysis.
We’ll first need to import the aligned reads from our BAM file. As the BAM file contains a lot of additional information (such as the read sequence and read quality scores), we can use the granges()
function to extract only the read coordinates:
library(GenomicAlignments)
bam <- granges(import("data/twist_ip.bam"))
bam
GRanges object with 6588203 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2L 58-75 -
[2] chr2L 112-129 -
[3] chr2L 155-172 +
[4] chr2L 192-209 +
[5] chr2L 251-268 +
... ... ... ...
[6588199] chrY_DS486003v1_random 871-888 -
[6588200] chrY_DS486003v1_random 874-891 -
[6588201] chrY_DS486003v1_random 879-896 -
[6588202] chrY_DS486003v1_random 899-916 -
[6588203] chrY_DS486003v1_random 967-984 -
-------
seqinfo: 1870 sequences from an unspecified genome
Next, we’ll extend each each so that it represents the (estimated) full size of the DNA fragment and not just the sequenced portion. For ChIP-seq, this is typically about 150 bp.
bam <- resize(bam, width=150)
From the warning we can see that some of the aligned reads, once extended, go beyond the boundaries of their chromosome. We can fix that using the trim()
function:
bam <- trim(bam)
Using the coverage()
function on a GRanges
object will assign each base a score based on how many ranges it overlaps. The result is a SimpleRleList with values for each base in the genome, although we often refer to it as a coverage object:
cov <- coverage(bam)
cov
RleList of length 1870
$chr2L
integer-Rle of length 23513712 with 1749615 runs
Lengths: 75 54 7 5 13 37 59 11 9 16 5 12 1 19 ... 15 20 10 7 13 8 10 10 5 3 3 4 12 1
Values : 2 1 0 1 2 3 4 5 6 7 6 5 6 5 ... 15 16 15 16 17 16 15 14 13 12 11 10 9 8
$chr2R
integer-Rle of length 25286936 with 1920456 runs
Lengths: 20 8 7 32 1 5 16 29 7 8 8 ... 16 7 15 140 150 436 63 87 63 119
Values : 4 5 6 7 6 7 6 7 8 7 8 ... 3 2 1 0 1 0 1 3 2 0
$chr3L
integer-Rle of length 28110227 with 1961328 runs
Lengths: 30 82 64 4 5 2 21 18 36 51 13 ... 1 1 3 2 1 2 3 1 6 2
Values : 1 2 3 4 3 4 5 6 5 4 5 ... 28 27 26 24 23 21 18 17 16 14
$chr3R
integer-Rle of length 32079331 with 2438966 runs
Lengths: 8 9 1 17 10 3 12 5 2 4 10 1 6 4 ... 28 10 8 11 82 49 8 60 33 96 21 6 3 30
Values : 6 7 8 9 10 11 12 11 12 11 12 13 14 15 ... 2 1 2 3 2 3 2 1 2 1 2 1 2 3
$chr4
integer-Rle of length 1348131 with 49391 runs
Lengths: 391 3 8 33 106 3 8 33 19 115 19 ... 9 1 19 6 21 3 38 5 19 19
Values : 0 1 2 3 6 5 4 3 0 2 3 ... 8 7 6 5 4 3 2 3 2 1
...
<1865 more elements>
Coverage objects encode the genome-wide data in a list of run-length encoded values (hence the name RleList). You can extract specific regions (chromosomes) using the $
operator and an index. We can coerce the run-length encoded values to a standard numeric vector using as.numeric()
:
as.numeric(cov$chr2L[84055:84065])
[1] 16 16 16 16 16 16 15 15 15 15 15
This tells us that the first 6 bases of our coordinate range (chr2L: 84055 - 84065) had 16 overlapping GRanges
, while the remaining bases had 15 overlapping ranges. As you can see, this method of extraction is somewhat tedious. Instead, we use a set of functions that perform this extraction en masse on all the ranges contained in a GRanges
object.
The following helper functions are defined in /home/cws/public_html/Resources/granges1.1.r
:
Let’s take a look at each of these functions using our OCT4 ChIP-seq data. We have the coverage object for the ChIP-seq experiment (in the variable called cov
), and we have the results from the peak-calling program MACS in the form of a BED file (in the variable called bed.gr
):
length(bed.gr)
[1] 13218
head(elementNROWS(cov))
chr2L chr2R chr3L chr3R chr4 chrM
23513712 25286936 28110227 32079331 1348131 19524
As the MACS peak coordinates are 1-bp ranges representing the summits of each detected peak, we’ll first resize them to 201-bp centered at the summit:
bed.gr <- resize(bed.gr, width=201, fix="center")
We can now use grangesSums()
to calculate the ChIP-seq signal in each 201-bp peak region:
source("http://research.stowers.org/cws/CompGenomics/Resources/granges1.1.r")
bed.gr$twist_signal <- grangesSums(bed.gr, cov)
bed.gr[1:5]
GRanges object with 5 ranges and 3 metadata columns:
seqnames ranges strand | name score twist_signal
<Rle> <IRanges> <Rle> | <character> <numeric> <integer>
[1] chr2L 7673-7873 * | twist_peak_1 3.3996 5959
[2] chr2L 8754-8954 * | twist_peak_2 18.3901 8347
[3] chr2L 26502-26702 * | twist_peak_3 15.864 10155
[4] chr2L 27185-27385 * | twist_peak_4 119.641 36274
[5] chr2L 34076-34276 * | twist_peak_5 14.9637 9312
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
Note the first range in our list of peaks: chr2L from 7673 to 7873. If we manually extract and sum the ChIP-seq signal at that location, it should match the value provided by grangesSums()
:
sum(cov$chr2L[7673:7873])
[1] 5959
The next three functions, grangesMeans()
, grangesMaxs()
and grangesMins()
, behavior similarly by returning the means (sum divided by width), maximum per-base values, and minimum per-base values, respectively.
bed.gr$twist_mean_signal <- grangesMeans(bed.gr, cov)
bed.gr$twist_max_signal <- grangesMaxs(bed.gr, cov)
bed.gr$twist_min_signal <- grangesMins(bed.gr, cov)
bed.gr[1:5]
GRanges object with 5 ranges and 6 metadata columns:
seqnames ranges strand | name score twist_signal twist_mean_signal
<Rle> <IRanges> <Rle> | <character> <numeric> <integer> <numeric>
[1] chr2L 7673-7873 * | twist_peak_1 3.3996 5959 29.6467661691542
[2] chr2L 8754-8954 * | twist_peak_2 18.3901 8347 41.5273631840796
[3] chr2L 26502-26702 * | twist_peak_3 15.864 10155 50.5223880597015
[4] chr2L 27185-27385 * | twist_peak_4 119.641 36274 180.467661691542
[5] chr2L 34076-34276 * | twist_peak_5 14.9637 9312 46.3283582089552
twist_max_signal twist_min_signal
<integer> <integer>
[1] 37 24
[2] 60 20
[3] 71 15
[4] 251 128
[5] 65 20
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
Again, we can manually verify these values for the first peak:
mean(cov$chr2L[7673:7873])
[1] 29.64677
max(cov$chr2L[7673:7873])
[1] 37
min(cov$chr2L[7673:7873])
[1] 24
The next two functions, grangesWhichMaxs()
and grangesWhichMins()
, are a bit different. They both return a coordinate for each range where the signal is at a maximum or minimum. If the maximum or minimum value is found at multiple coordinates, the first (lowest coordinate) is returned.
bed.gr$twist_max_coord <- grangesWhichMaxs(bed.gr, cov)
bed.gr$twist_min_coord <- grangesWhichMins(bed.gr, cov)
bed.gr[1:5]
GRanges object with 5 ranges and 8 metadata columns:
seqnames ranges strand | name score twist_signal twist_mean_signal
<Rle> <IRanges> <Rle> | <character> <numeric> <integer> <numeric>
[1] chr2L 7673-7873 * | twist_peak_1 3.3996 5959 29.6467661691542
[2] chr2L 8754-8954 * | twist_peak_2 18.3901 8347 41.5273631840796
[3] chr2L 26502-26702 * | twist_peak_3 15.864 10155 50.5223880597015
[4] chr2L 27185-27385 * | twist_peak_4 119.641 36274 180.467661691542
[5] chr2L 34076-34276 * | twist_peak_5 14.9637 9312 46.3283582089552
twist_max_signal twist_min_signal twist_max_coord twist_min_coord
<integer> <integer> <integer> <integer>
[1] 37 24 7780 7673
[2] 60 20 8855 8754
[3] 71 15 26609 26502
[4] 251 128 27210 27315
[5] 65 20 34175 34098
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
None of the above functions are affected by the strand of each range. The next function, grangesMatrix()
, does take the strand of each range into account. It returns a matrix
of values when provided with a GRanges
object containing ranges of equal length and a coverage object. Each row of the matrix comes from the corresponding range in the input GRanges
and each column contains values relative to the start of each strand-specific range.
small_peaks.gr <- resize(bed.gr[1:10], width=12)
small_peaks.m <- grangesMatrix(small_peaks.gr, cov)
small_peaks.m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 24 24 24 24 24 24 24 24 25 25 25 25
[2,] 20 20 20 20 21 21 21 21 21 21 21 21
[3,] 15 15 15 16 16 16 16 16 16 16 16 18
[4,] 246 247 246 250 250 248 245 247 247 249 249 248
[5,] 24 24 24 24 25 23 23 22 22 23 23 24
[6,] 71 73 74 75 76 76 76 78 79 81 82 83
[7,] 43 43 43 43 44 45 45 45 45 46 47 48
[8,] 11 13 16 16 16 16 17 18 18 19 20 21
[9,] 188 194 202 208 210 213 219 223 227 239 243 247
[10,] 182 182 183 184 186 187 189 194 194 194 195 195
Each row contains the reads for the corresponding range in small_peaks.gr
. For example, the 5th row of values should match the 5th range:
small_peaks.gr[5]
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | name score twist_signal twist_mean_signal
<Rle> <IRanges> <Rle> | <character> <numeric> <integer> <numeric>
[1] chr2L 34076-34087 * | twist_peak_5 14.9637 9312 46.3283582089552
twist_max_signal twist_min_signal twist_max_coord twist_min_coord
<integer> <integer> <integer> <integer>
[1] 65 20 34175 34098
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
values_for_5 <- as.numeric(cov[["chr2L"]][34076:34087])
values_for_5
[1] 24 24 24 24 25 23 23 22 22 23 23 24
all.equal(values_for_5, small_peaks.m[5, ])
[1] TRUE
Our input small_peaks.gr
ranges were all unstranded (a strand of *
) and so were treated the same as if they were on the positive strand. If we change the strand of the last three ranges to -
, we should see that the last three rows of our matrix are reversed:
strand(small_peaks.gr)[8:10] <- "-"
small_peaks.m <- grangesMatrix(small_peaks.gr, cov)
small_peaks.m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 24 24 24 24 24 24 24 24 25 25 25 25
[2,] 20 20 20 20 21 21 21 21 21 21 21 21
[3,] 15 15 15 16 16 16 16 16 16 16 16 18
[4,] 246 247 246 250 250 248 245 247 247 249 249 248
[5,] 24 24 24 24 25 23 23 22 22 23 23 24
[6,] 71 73 74 75 76 76 76 78 79 81 82 83
[7,] 43 43 43 43 44 45 45 45 45 46 47 48
[8,] 21 20 19 18 18 17 16 16 16 16 13 11
[9,] 247 243 239 227 223 219 213 210 208 202 194 188
[10,] 195 195 194 194 194 189 187 186 184 183 182 182
Comparing this matrix with the original one, we can see that the last three rows are indeed reversed.
The final GRanges
helper function is grangesApply()
. It allows you to provide a custom function that will be applied to values of each range in your input GRanges
. In fact, the grangesMatrix()
function uses grangesApply()
to extract the values for the resulting matrix.
As an example, let’s say we wanted to compute the median value for each range instead of the mean provided by grangesMeans()
. We can do that with grangesApply()
:
some_peaks.gr <- bed.gr[1:10]
some_peaks.gr$twist_median_signal <- unlist(grangesApply(some_peaks.gr, cov, median))
some_peaks.gr
GRanges object with 10 ranges and 9 metadata columns:
seqnames ranges strand | name score twist_signal twist_mean_signal
<Rle> <IRanges> <Rle> | <character> <numeric> <integer> <numeric>
[1] chr2L 7673-7873 * | twist_peak_1 3.3996 5959 29.6467661691542
[2] chr2L 8754-8954 * | twist_peak_2 18.3901 8347 41.5273631840796
[3] chr2L 26502-26702 * | twist_peak_3 15.864 10155 50.5223880597015
[4] chr2L 27185-27385 * | twist_peak_4 119.641 36274 180.467661691542
[5] chr2L 34076-34276 * | twist_peak_5 14.9637 9312 46.3283582089552
[6] chr2L 66712-66912 * | twist_peak_6 36.6593 22437 111.626865671642
[7] chr2L 83749-83949 * | twist_peak_7 25.7948 12243 60.910447761194
[8] chr2L 87314-87514 * | twist_peak_8 9.49031 8236 40.9751243781095
[9] chr2L 90175-90375 * | twist_peak_9 156.123 59894 297.980099502488
[10] chr2L 91605-91805 * | twist_peak_10 153.301 48815 242.860696517413
twist_max_signal twist_min_signal twist_max_coord twist_min_coord twist_median_signal
<integer> <integer> <integer> <integer> <integer>
[1] 37 24 7780 7673 29
[2] 60 20 8855 8754 45
[3] 71 15 26609 26502 55
[4] 251 128 27210 27315 171
[5] 65 20 34175 34098 53
[6] 144 63 66812 66911 120
[7] 78 39 83819 83941 61
[8] 59 11 87413 87314 43
[9] 374 153 90267 90375 316
[10] 286 182 91749 91605 243
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
grangesApply()
always returns a list, as the custom function you provide can return any kind of value. If your function only returns a single value per range, as with median()
, you can use unlist()
to convert the values into a vector as shown above.
Since coverage objects store values for every base in the genome, they can take up a lot of memory even though they store these values in a compressed form. Using these GRanges
helper functions on multiple coverage objects can quickly consume enough memory to make many analyses impractical to perform on a standard workstation.
To address this issue, we need to avoid storing the entire set of genome-wide values in memory and instead only load the values for the regions we are interested in. Fortunately, the UCSC BigWig file format allows us to do just that. The format was created so that the UCSC genome browser could quickly extract genome-wide data for specific regions in the genome (for example, each time you navigate to a new region in the genome browser while viewing a ChIP-seq track).
The rtracklayer
package allows you to easily create BigWig files from coverage objects:
export(cov, "guide/twist_ip.bw")
The GRanges
helper functions are already equipped to handle BigWig files in addition to coverage objects. All you need to do is specify a path to a BigWig file instead of a coverage object when you use the functions:
twist_bigwig <- "guide/twist_ip.bw"
grangesSums(small_peaks.gr, cov)
chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L
292 248 191 2972 281 924 537 201 2613 2265
grangesSums(small_peaks.gr, twist_bigwig)
chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L
292 248 191 2972 281 924 537 201 2613 2265
If you are curious how this works, examine the shared/granges.r
file. The check_coverage_argument()
is what converts the path to a BigWig file into a coverage object. The resulting coverage object only contains values for the specific ranges required by the GRanges
helper function; all other values in the genome are set to zero. This reduces memory usage and improves performance.
Let’s see how we can use these tools to perform a simple analysis. Using the MACS peaks for Twist, we’ll compare the amount of Twist ChIP signal to the amount of Snail ChIP signal at each peak and plot the results as a scatterplot.
First, we need to create a Sox2 BigWig file:
snail.gr <- granges(import("data/snail_ip.bam"))
snail.gr <- trim(resize(snail.gr, width=150))
snail.cov <- coverage(snail.gr)
export(snail.cov, "guide/snail_ip.bw")
Now we have the Twist peaks, the Twist BigWig and the Snail BigWig:
twist_bigwig <- "guide/twist_ip.bw"
snail_bigwig <- "guide/snail_ip.bw"
twist_peaks <- import("data/twist_summits.bed")
We’ll expand each of the Twist peak summits to 201 bp and add the ChIP signal for Twist and Snail as metadata columns:
twist_peaks <- resize(twist_peaks, width=201, fix="center")
twist_peaks$twist_signal <- grangesSums(twist_peaks, twist_bigwig)
twist_peaks$snail_signal <- grangesSums(twist_peaks, snail_bigwig)
twist_peaks
GRanges object with 13218 ranges and 4 metadata columns:
seqnames ranges strand | name score twist_signal
<Rle> <IRanges> <Rle> | <character> <numeric> <numeric>
[1] chr2L 7673-7873 * | twist_peak_1 3.3996 5959
[2] chr2L 8754-8954 * | twist_peak_2 18.3901 8347
[3] chr2L 26502-26702 * | twist_peak_3 15.864 10155
[4] chr2L 27185-27385 * | twist_peak_4 119.641 36274
[5] chr2L 34076-34276 * | twist_peak_5 14.9637 9312
... ... ... ... . ... ... ...
[13214] chrY_CP007114v1_random 26229-26429 * | twist_peak_13214 4.58682 4802
[13215] chrY_DS485113v1_random 347-547 * | twist_peak_13215 12.8345 7735
[13216] chrY_DS485143v1_random 216-416 * | twist_peak_13216 6.37728 11362
[13217] chrY_DS485732v1_random 90-290 * | twist_peak_13217 7.88986 6369
[13218] chrY_DS485865v1_random 74-274 * | twist_peak_13218 5.52593 5399
snail_signal
<numeric>
[1] 5697
[2] 4699
[3] 1486
[4] 5886
[5] 2745
... ...
[13214] 3994
[13215] 4693
[13216] 8548
[13217] 2993
[13218] 3208
-------
seqinfo: 125 sequences from an unspecified genome; no seqlengths
Now we have a GRanges
object with the coordinates of 13,218 Twist peaks along with the Twist and Snail ChIP-seq signal around those peaks. We can easily coerce the GRanges
into a data.frame
and plot the results with ggplot
:
library(ggplot2)
twist_peaks_df <- as.data.frame(twist_peaks)
g <- ggplot(twist_peaks_df, aes(x=twist_signal, y=snail_signal)) +
geom_point(size=1.1, alpha=0.5) +
scale_x_continuous(trans=scales::log2_trans()) +
scale_y_continuous(trans=scales::log2_trans()) +
theme_bw() +
labs(x="Twist ChIP signal",
y="Snail ChIP signal",
title="Twist and Snail ChIP signal at Twist peaks")
g
Warning: Transformation introduced infinite values in continuous x-axis
Warning: Transformation introduced infinite values in continuous y-axis
We also have available a whole cell extract control sample for these experiments. Can you express Twist and Snail signal as enrichment over the whole cell extract? How might you normalize these values to account for differences in the number of sequenced reads?
How might you determine whether, within these Twist peaks, the summit of the Snail ChIP signal is at the same location as Twist’s?
Can you think of how to calculate and plot the average Twist ChIP-seq signal surrounding a set of peaks? Hint: use grangesMatrix()
.
For reproducibility, this analysis was performed with the following R/Bioconductor session:
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /n/apps/CentOS7/install/r-3.6.1/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-3.6.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] digest_0.6.25 pander_0.6.3
[3] ggplot2_3.3.2 BSgenome.Dmelanogaster.UCSC.dm6_1.4.1
[5] BSgenome_1.54.0 rtracklayer_1.46.0
[7] GenomicAlignments_1.22.1 Rsamtools_2.2.3
[9] Biostrings_2.54.0 XVector_0.26.0
[11] SummarizedExperiment_1.16.1 DelayedArray_0.12.3
[13] BiocParallel_1.20.1 matrixStats_0.56.0
[15] Biobase_2.46.0 GenomicRanges_1.38.0
[17] GenomeInfoDb_1.22.1 IRanges_2.20.2
[19] S4Vectors_0.24.4 BiocGenerics_0.32.0
[21] RColorBrewer_1.1-2
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 xfun_0.16 purrr_0.3.4 lattice_0.20-41
[5] colorspace_1.4-1 vctrs_0.3.2 generics_0.0.2 htmltools_0.5.0
[9] yaml_2.2.1 XML_3.99-0.3 rlang_0.4.7 pillar_1.4.6
[13] glue_1.4.1 withr_2.2.0 GenomeInfoDbData_1.2.2 lifecycle_0.2.0
[17] stringr_1.4.0 zlibbioc_1.32.0 munsell_0.5.0 gtable_0.3.0
[21] evaluate_0.14 knitr_1.29 Rcpp_1.0.5 scales_1.1.1
[25] farver_2.0.3 stringi_1.4.6 dplyr_1.0.2 grid_3.6.1
[29] tools_3.6.1 bitops_1.0-6 magrittr_1.5 RCurl_1.98-1.2
[33] tibble_3.0.3 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1
[37] Matrix_1.2-18 rmarkdown_2.3 R6_2.4.1 compiler_3.6.1