Introduction to GRanges for genomics analysis

Project: Genomics Module 2020

Author: Chris Seidel & Jeff Johnston

Generated: Sun Oct 04 2020, 02:26 PM

GenomicRanges

We use GRanges extensively in our analyses. Defined by the Bioconductor GenomicRanges library, GRanges provide a way to store and manipulate sets of genomic regions. Examples of such regions are peaks obtained from a peak calling program such as MACS, motif locations and even genes.

Defining a GRanges object

Let’s create a simple GRanges object using the GRanges() constructor:

library(GenomicRanges)

gr <- GRanges(ranges=IRanges(start=c(100, 200), end=c(199, 299)), 
              seqnames=c("chr2L", "chr3R"),
              strand=c("+", "-"))
gr

GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]    chr2L   100-199      +
  [2]    chr3R   200-299      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

At a minimum, you need to specify both the ranges and seqnames arguments. If you omit strand, it will default to * (unstranded). You can access the properties of a GRanges object using various accessors:

start(gr)

[1] 100 200

end(gr)

[1] 199 299

width(gr)

[1] 100 100

seqnames(gr)

factor-Rle of length 2 with 2 runs
  Lengths:     1     1
  Values : chr2L chr3R
Levels(2): chr2L chr3R

strand(gr)

factor-Rle of length 2 with 2 runs
  Lengths: 1 1
  Values : + -
Levels(3): + - *

GRanges objects can also optionally hold information about the lengths of the sequences (normally chromosomes) associated with the coordinates it contains. It is generally a good idea to set these lengths, as doing so will allow you to see warnings when the contained ranges extend outside the bounds of each chromosome. You can specify these lengths when you create the GRanges object or later using the seqlengths() method. The BSgenome reference genome packages are the easiest way to obtain the chromosome lengths for most model organisms.

library(BSgenome.Dmelanogaster.UCSC.dm6)

gr <- GRanges(ranges=IRanges(start=c(100, 200), end=c(199, 299)), 
              seqnames=c("chr2L", "chr3R"),
              strand=c("+", "-"),
              seqlengths=seqlengths(Dmelanogaster))
head(seqlengths(gr))

   chr2L    chr2R    chr3L    chr3R     chr4     chrX 
23513712 25286936 28110227 32079331  1348131 23542271

Annotating GRanges objects with metadata

Ranges inside a GRanges object can contain additional information using metadata columns stored alongside the range data. You can access this information in the form of a DataFrame using the mcols() accessor. Alternatively, you can also use $ to access, add and remove columns as you would a normal data.frame:

gr$name <- "My range"
gr$score <- 5
gr

GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        name     score
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]    chr2L   100-199      + |    My range         5
  [2]    chr3R   200-299      - |    My range         5
  -------
  seqinfo: 1870 sequences from an unspecified genome

mcols(gr)

DataFrame with 2 rows and 2 columns
         name     score
  <character> <numeric>
1    My range         5
2    My range         5

Adding metadata columns to a GRanges is very useful. For example, if you have a set of peaks and want to calculate the ChIP-seq enrichment of a certain factor for each peak, you can store the enrichment values as metadata alongside the peaks.

Importing data as GRanges objects

The rtracklayer package allows us to import a variety of file formats into GRanges (or GRanges-like) objects. Let’s import a common genomic file format (BED) to see how it works. We’ll use a BED file generated by the MACS peak calling tool.

library(rtracklayer)

bed_file <- "data/twist_summits.bed"
bed.gr <- import(bed_file)
bed.gr

GRanges object with 13218 ranges and 2 metadata columns:
                        seqnames    ranges strand |             name     score
                           <Rle> <IRanges>  <Rle> |      <character> <numeric>
      [1]                  chr2L      7773      * |     twist_peak_1    3.3996
      [2]                  chr2L      8854      * |     twist_peak_2   18.3901
      [3]                  chr2L     26602      * |     twist_peak_3    15.864
      [4]                  chr2L     27285      * |     twist_peak_4   119.641
      [5]                  chr2L     34176      * |     twist_peak_5   14.9637
      ...                    ...       ...    ... .              ...       ...
  [13214] chrY_CP007114v1_random     26329      * | twist_peak_13214   4.58682
  [13215] chrY_DS485113v1_random       447      * | twist_peak_13215   12.8345
  [13216] chrY_DS485143v1_random       316      * | twist_peak_13216   6.37728
  [13217] chrY_DS485732v1_random       190      * | twist_peak_13217   7.88986
  [13218] chrY_DS485865v1_random       174      * | twist_peak_13218   5.52593
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

We can see that both the name and score columns from the MACS BED file were imported as metadata columns.

Exporting GRanges objects

If you want to view your GRanges objects in a genome browser such as IGV, you’ll first need to export them. Most of the time you will export a GRanges object to a BED file. If you GRanges object has a name and score metadata column, those will be included in the BED file.

export(gr, "guide/simple_granges_export_example.bed")

Alternatively, you can save your GRanges objects to a text file to use in a spreadsheet application such as Microsoft Excel:

write.table(as.data.frame(gr), file="guide/simple_granges_export_example.xls", 
            quote=FALSE, row.names=FALSE, col.names=TRUE, sep="\t", na="")

Manipulating GRanges objects

A number of useful functions are available to manipulate GRanges objects. For example, notice that each MACS peak in the BED file we imported above is of width 1 (these are the detected peak summits). If we want to expand these regions, we can use resize():

resize(bed.gr[1:5], width=100)

GRanges object with 5 ranges and 2 metadata columns:
      seqnames      ranges strand |         name     score
         <Rle>   <IRanges>  <Rle> |  <character> <numeric>
  [1]    chr2L   7773-7872      * | twist_peak_1    3.3996
  [2]    chr2L   8854-8953      * | twist_peak_2   18.3901
  [3]    chr2L 26602-26701      * | twist_peak_3    15.864
  [4]    chr2L 27285-27384      * | twist_peak_4   119.641
  [5]    chr2L 34176-34275      * | twist_peak_5   14.9637
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

Notice that resize() kept the start coordinate of each range the same, but extended the end coordinates so that each range was now 100bp in width. That’s because resize() is strand-aware, and by default keeps the start coordinate fixed for ranges on the + or * strand, and keeps the end coordinate fixed for ranges on the - strand. We can see an example of this using our simple GRanges object from earlier:

resize(gr, width=50)

GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        name     score
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]    chr2L   100-149      + |    My range         5
  [2]    chr3R   250-299      - |    My range         5
  -------
  seqinfo: 1870 sequences from an unspecified genome

What we really wanted was to extend the peak summits equally in both directions, so that the original coordinate of the peak summit remained in the center of each range. To do that, we use the fixed="center" argument. We should also specify an odd number for width so that an equal number of bases are to the right and left of the summit:

resize(bed.gr[1:5], width=101, fix="center")

GRanges object with 5 ranges and 2 metadata columns:
      seqnames      ranges strand |         name     score
         <Rle>   <IRanges>  <Rle> |  <character> <numeric>
  [1]    chr2L   7723-7823      * | twist_peak_1    3.3996
  [2]    chr2L   8804-8904      * | twist_peak_2   18.3901
  [3]    chr2L 26552-26652      * | twist_peak_3    15.864
  [4]    chr2L 27235-27335      * | twist_peak_4   119.641
  [5]    chr2L 34126-34226      * | twist_peak_5   14.9637
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

In addition to resize(), there are a number of other functions for manipulating GRanges objects. A few useful ones are flank(), promoters() and shift():

flank(gr, width=10)

GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        name     score
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]    chr2L     90-99      + |    My range         5
  [2]    chr3R   300-309      - |    My range         5
  -------
  seqinfo: 1870 sequences from an unspecified genome

Notice that flank() is also strand-aware and creates ranges that are upstream by default. Downstream ranges can be obtained using the start=FALSE argument.

Similar to flank() is the promoters() function, which creates regions around the start of ranges (the start coordinate for ranges on the + strand and the end coordinate for ranges on the - strand):

promoters(gr, upstream=10, downstream=5)

GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        name     score
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]    chr2L    90-104      + |    My range         5
  [2]    chr3R   295-309      - |    My range         5
  -------
  seqinfo: 1870 sequences from an unspecified genome

Alignments

GRanges objects are an efficient means of representing the results of a high-throughput DNA sequencing experiment such as ChIP-seq. After alignment to a reference genome, the coordinates of all aligned reads can be easily imported from a BAM file as a GRanges object. But what then? We typically express the results of a ChIP-seq experiment as a genome-wide coverage track in which each base of the reference genome is assigned a score. These tracks can not only be viewed in a genome browser but are also very useful for performing analysis.

Importing BAM alignments

We’ll first need to import the aligned reads from our BAM file. As the BAM file contains a lot of additional information (such as the read sequence and read quality scores), we can use the granges() function to extract only the read coordinates:

library(GenomicAlignments)

bam <- granges(import("data/twist_ip.bam"))
bam

GRanges object with 6588203 ranges and 0 metadata columns:
                          seqnames    ranges strand
                             <Rle> <IRanges>  <Rle>
        [1]                  chr2L     58-75      -
        [2]                  chr2L   112-129      -
        [3]                  chr2L   155-172      +
        [4]                  chr2L   192-209      +
        [5]                  chr2L   251-268      +
        ...                    ...       ...    ...
  [6588199] chrY_DS486003v1_random   871-888      -
  [6588200] chrY_DS486003v1_random   874-891      -
  [6588201] chrY_DS486003v1_random   879-896      -
  [6588202] chrY_DS486003v1_random   899-916      -
  [6588203] chrY_DS486003v1_random   967-984      -
  -------
  seqinfo: 1870 sequences from an unspecified genome

Extending aligned reads

Next, we’ll extend each each so that it represents the (estimated) full size of the DNA fragment and not just the sequenced portion. For ChIP-seq, this is typically about 150 bp.

bam <- resize(bam, width=150)

From the warning we can see that some of the aligned reads, once extended, go beyond the boundaries of their chromosome. We can fix that using the trim() function:

bam <- trim(bam)

Creating a “coverage” object

Using the coverage() function on a GRanges object will assign each base a score based on how many ranges it overlaps. The result is a SimpleRleList with values for each base in the genome, although we often refer to it as a coverage object:

cov <- coverage(bam)
cov

RleList of length 1870
$chr2L
integer-Rle of length 23513712 with 1749615 runs
  Lengths: 75 54  7  5 13 37 59 11  9 16  5 12  1 19 ... 15 20 10  7 13  8 10 10  5  3  3  4 12  1
  Values :  2  1  0  1  2  3  4  5  6  7  6  5  6  5 ... 15 16 15 16 17 16 15 14 13 12 11 10  9  8

$chr2R
integer-Rle of length 25286936 with 1920456 runs
  Lengths:  20   8   7  32   1   5  16  29   7   8   8 ...  16   7  15 140 150 436  63  87  63 119
  Values :   4   5   6   7   6   7   6   7   8   7   8 ...   3   2   1   0   1   0   1   3   2   0

$chr3L
integer-Rle of length 28110227 with 1961328 runs
  Lengths:  30  82  64   4   5   2  21  18  36  51  13 ...   1   1   3   2   1   2   3   1   6   2
  Values :   1   2   3   4   3   4   5   6   5   4   5 ...  28  27  26  24  23  21  18  17  16  14

$chr3R
integer-Rle of length 32079331 with 2438966 runs
  Lengths:  8  9  1 17 10  3 12  5  2  4 10  1  6  4 ... 28 10  8 11 82 49  8 60 33 96 21  6  3 30
  Values :  6  7  8  9 10 11 12 11 12 11 12 13 14 15 ...  2  1  2  3  2  3  2  1  2  1  2  1  2  3

$chr4
integer-Rle of length 1348131 with 49391 runs
  Lengths: 391   3   8  33 106   3   8  33  19 115  19 ...   9   1  19   6  21   3  38   5  19  19
  Values :   0   1   2   3   6   5   4   3   0   2   3 ...   8   7   6   5   4   3   2   3   2   1

...
<1865 more elements>

Coverage objects encode the genome-wide data in a list of run-length encoded values (hence the name RleList). You can extract specific regions (chromosomes) using the $ operator and an index. We can coerce the run-length encoded values to a standard numeric vector using as.numeric():

as.numeric(cov$chr2L[84055:84065])

 [1] 16 16 16 16 16 16 15 15 15 15 15

This tells us that the first 6 bases of our coordinate range (chr2L: 84055 - 84065) had 16 overlapping GRanges, while the remaining bases had 15 overlapping ranges. As you can see, this method of extraction is somewhat tedious. Instead, we use a set of functions that perform this extraction en masse on all the ranges contained in a GRanges object.

GRanges helper functions

The following helper functions are defined in /home/cws/public_html/Resources/granges1.1.r:

grangesSums()
grangesMeans()
grangesMaxs()
grangesMins()
grangesWhichMaxs()
grangesWhichMins()
grangesMatrix()
grangesApply()

Let’s take a look at each of these functions using our OCT4 ChIP-seq data. We have the coverage object for the ChIP-seq experiment (in the variable called cov), and we have the results from the peak-calling program MACS in the form of a BED file (in the variable called bed.gr):

length(bed.gr)

[1] 13218

head(elementNROWS(cov))

   chr2L    chr2R    chr3L    chr3R     chr4     chrM 
23513712 25286936 28110227 32079331  1348131    19524

As the MACS peak coordinates are 1-bp ranges representing the summits of each detected peak, we’ll first resize them to 201-bp centered at the summit:

bed.gr <- resize(bed.gr, width=201, fix="center")

We can now use grangesSums() to calculate the ChIP-seq signal in each 201-bp peak region:

source("http://research.stowers.org/cws/CompGenomics/Resources/granges1.1.r")

bed.gr$twist_signal <- grangesSums(bed.gr, cov)
bed.gr[1:5]

GRanges object with 5 ranges and 3 metadata columns:
      seqnames      ranges strand |         name     score twist_signal
         <Rle>   <IRanges>  <Rle> |  <character> <numeric>    <integer>
  [1]    chr2L   7673-7873      * | twist_peak_1    3.3996         5959
  [2]    chr2L   8754-8954      * | twist_peak_2   18.3901         8347
  [3]    chr2L 26502-26702      * | twist_peak_3    15.864        10155
  [4]    chr2L 27185-27385      * | twist_peak_4   119.641        36274
  [5]    chr2L 34076-34276      * | twist_peak_5   14.9637         9312
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

Note the first range in our list of peaks: chr2L from 7673 to 7873. If we manually extract and sum the ChIP-seq signal at that location, it should match the value provided by grangesSums():

sum(cov$chr2L[7673:7873])

[1] 5959

The next three functions, grangesMeans(), grangesMaxs() and grangesMins(), behavior similarly by returning the means (sum divided by width), maximum per-base values, and minimum per-base values, respectively.

bed.gr$twist_mean_signal <- grangesMeans(bed.gr, cov)
bed.gr$twist_max_signal <- grangesMaxs(bed.gr, cov)
bed.gr$twist_min_signal <- grangesMins(bed.gr, cov)
bed.gr[1:5]

GRanges object with 5 ranges and 6 metadata columns:
      seqnames      ranges strand |         name     score twist_signal twist_mean_signal
         <Rle>   <IRanges>  <Rle> |  <character> <numeric>    <integer>         <numeric>
  [1]    chr2L   7673-7873      * | twist_peak_1    3.3996         5959  29.6467661691542
  [2]    chr2L   8754-8954      * | twist_peak_2   18.3901         8347  41.5273631840796
  [3]    chr2L 26502-26702      * | twist_peak_3    15.864        10155  50.5223880597015
  [4]    chr2L 27185-27385      * | twist_peak_4   119.641        36274  180.467661691542
  [5]    chr2L 34076-34276      * | twist_peak_5   14.9637         9312  46.3283582089552
      twist_max_signal twist_min_signal
             <integer>        <integer>
  [1]               37               24
  [2]               60               20
  [3]               71               15
  [4]              251              128
  [5]               65               20
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

Again, we can manually verify these values for the first peak:

mean(cov$chr2L[7673:7873])

[1] 29.64677

max(cov$chr2L[7673:7873])

[1] 37

min(cov$chr2L[7673:7873])

[1] 24

The next two functions, grangesWhichMaxs() and grangesWhichMins(), are a bit different. They both return a coordinate for each range where the signal is at a maximum or minimum. If the maximum or minimum value is found at multiple coordinates, the first (lowest coordinate) is returned.

bed.gr$twist_max_coord <- grangesWhichMaxs(bed.gr, cov)
bed.gr$twist_min_coord <- grangesWhichMins(bed.gr, cov)
bed.gr[1:5]

GRanges object with 5 ranges and 8 metadata columns:
      seqnames      ranges strand |         name     score twist_signal twist_mean_signal
         <Rle>   <IRanges>  <Rle> |  <character> <numeric>    <integer>         <numeric>
  [1]    chr2L   7673-7873      * | twist_peak_1    3.3996         5959  29.6467661691542
  [2]    chr2L   8754-8954      * | twist_peak_2   18.3901         8347  41.5273631840796
  [3]    chr2L 26502-26702      * | twist_peak_3    15.864        10155  50.5223880597015
  [4]    chr2L 27185-27385      * | twist_peak_4   119.641        36274  180.467661691542
  [5]    chr2L 34076-34276      * | twist_peak_5   14.9637         9312  46.3283582089552
      twist_max_signal twist_min_signal twist_max_coord twist_min_coord
             <integer>        <integer>       <integer>       <integer>
  [1]               37               24            7780            7673
  [2]               60               20            8855            8754
  [3]               71               15           26609           26502
  [4]              251              128           27210           27315
  [5]               65               20           34175           34098
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

None of the above functions are affected by the strand of each range. The next function, grangesMatrix(), does take the strand of each range into account. It returns a matrix of values when provided with a GRanges object containing ranges of equal length and a coverage object. Each row of the matrix comes from the corresponding range in the input GRanges and each column contains values relative to the start of each strand-specific range.

small_peaks.gr <- resize(bed.gr[1:10], width=12)
small_peaks.m <- grangesMatrix(small_peaks.gr, cov)
small_peaks.m

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
 [1,]   24   24   24   24   24   24   24   24   25    25    25    25
 [2,]   20   20   20   20   21   21   21   21   21    21    21    21
 [3,]   15   15   15   16   16   16   16   16   16    16    16    18
 [4,]  246  247  246  250  250  248  245  247  247   249   249   248
 [5,]   24   24   24   24   25   23   23   22   22    23    23    24
 [6,]   71   73   74   75   76   76   76   78   79    81    82    83
 [7,]   43   43   43   43   44   45   45   45   45    46    47    48
 [8,]   11   13   16   16   16   16   17   18   18    19    20    21
 [9,]  188  194  202  208  210  213  219  223  227   239   243   247
[10,]  182  182  183  184  186  187  189  194  194   194   195   195

Each row contains the reads for the corresponding range in small_peaks.gr. For example, the 5th row of values should match the 5th range:

small_peaks.gr[5]

GRanges object with 1 range and 8 metadata columns:
      seqnames      ranges strand |         name     score twist_signal twist_mean_signal
         <Rle>   <IRanges>  <Rle> |  <character> <numeric>    <integer>         <numeric>
  [1]    chr2L 34076-34087      * | twist_peak_5   14.9637         9312  46.3283582089552
      twist_max_signal twist_min_signal twist_max_coord twist_min_coord
             <integer>        <integer>       <integer>       <integer>
  [1]               65               20           34175           34098
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

values_for_5 <- as.numeric(cov[["chr2L"]][34076:34087])
values_for_5

 [1] 24 24 24 24 25 23 23 22 22 23 23 24

all.equal(values_for_5, small_peaks.m[5, ])

[1] TRUE

Our input small_peaks.gr ranges were all unstranded (a strand of *) and so were treated the same as if they were on the positive strand. If we change the strand of the last three ranges to -, we should see that the last three rows of our matrix are reversed:

strand(small_peaks.gr)[8:10] <- "-"
small_peaks.m <- grangesMatrix(small_peaks.gr, cov)
small_peaks.m

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
 [1,]   24   24   24   24   24   24   24   24   25    25    25    25
 [2,]   20   20   20   20   21   21   21   21   21    21    21    21
 [3,]   15   15   15   16   16   16   16   16   16    16    16    18
 [4,]  246  247  246  250  250  248  245  247  247   249   249   248
 [5,]   24   24   24   24   25   23   23   22   22    23    23    24
 [6,]   71   73   74   75   76   76   76   78   79    81    82    83
 [7,]   43   43   43   43   44   45   45   45   45    46    47    48
 [8,]   21   20   19   18   18   17   16   16   16    16    13    11
 [9,]  247  243  239  227  223  219  213  210  208   202   194   188
[10,]  195  195  194  194  194  189  187  186  184   183   182   182

Comparing this matrix with the original one, we can see that the last three rows are indeed reversed.

The final GRanges helper function is grangesApply(). It allows you to provide a custom function that will be applied to values of each range in your input GRanges. In fact, the grangesMatrix() function uses grangesApply() to extract the values for the resulting matrix.

As an example, let’s say we wanted to compute the median value for each range instead of the mean provided by grangesMeans(). We can do that with grangesApply():

some_peaks.gr <- bed.gr[1:10]
some_peaks.gr$twist_median_signal <- unlist(grangesApply(some_peaks.gr, cov, median))
some_peaks.gr

GRanges object with 10 ranges and 9 metadata columns:
       seqnames      ranges strand |          name     score twist_signal twist_mean_signal
          <Rle>   <IRanges>  <Rle> |   <character> <numeric>    <integer>         <numeric>
   [1]    chr2L   7673-7873      * |  twist_peak_1    3.3996         5959  29.6467661691542
   [2]    chr2L   8754-8954      * |  twist_peak_2   18.3901         8347  41.5273631840796
   [3]    chr2L 26502-26702      * |  twist_peak_3    15.864        10155  50.5223880597015
   [4]    chr2L 27185-27385      * |  twist_peak_4   119.641        36274  180.467661691542
   [5]    chr2L 34076-34276      * |  twist_peak_5   14.9637         9312  46.3283582089552
   [6]    chr2L 66712-66912      * |  twist_peak_6   36.6593        22437  111.626865671642
   [7]    chr2L 83749-83949      * |  twist_peak_7   25.7948        12243   60.910447761194
   [8]    chr2L 87314-87514      * |  twist_peak_8   9.49031         8236  40.9751243781095
   [9]    chr2L 90175-90375      * |  twist_peak_9   156.123        59894  297.980099502488
  [10]    chr2L 91605-91805      * | twist_peak_10   153.301        48815  242.860696517413
       twist_max_signal twist_min_signal twist_max_coord twist_min_coord twist_median_signal
              <integer>        <integer>       <integer>       <integer>           <integer>
   [1]               37               24            7780            7673                  29
   [2]               60               20            8855            8754                  45
   [3]               71               15           26609           26502                  55
   [4]              251              128           27210           27315                 171
   [5]               65               20           34175           34098                  53
   [6]              144               63           66812           66911                 120
   [7]               78               39           83819           83941                  61
   [8]               59               11           87413           87314                  43
   [9]              374              153           90267           90375                 316
  [10]              286              182           91749           91605                 243
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

grangesApply() always returns a list, as the custom function you provide can return any kind of value. If your function only returns a single value per range, as with median(), you can use unlist() to convert the values into a vector as shown above.

Using BigWig files instead of coverage objects

Since coverage objects store values for every base in the genome, they can take up a lot of memory even though they store these values in a compressed form. Using these GRanges helper functions on multiple coverage objects can quickly consume enough memory to make many analyses impractical to perform on a standard workstation.

To address this issue, we need to avoid storing the entire set of genome-wide values in memory and instead only load the values for the regions we are interested in. Fortunately, the UCSC BigWig file format allows us to do just that. The format was created so that the UCSC genome browser could quickly extract genome-wide data for specific regions in the genome (for example, each time you navigate to a new region in the genome browser while viewing a ChIP-seq track).

The rtracklayer package allows you to easily create BigWig files from coverage objects:

export(cov, "guide/twist_ip.bw")

The GRanges helper functions are already equipped to handle BigWig files in addition to coverage objects. All you need to do is specify a path to a BigWig file instead of a coverage object when you use the functions:

twist_bigwig <- "guide/twist_ip.bw"
grangesSums(small_peaks.gr, cov)

chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L 
  292   248   191  2972   281   924   537   201  2613  2265

grangesSums(small_peaks.gr, twist_bigwig)

chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L chr2L 
  292   248   191  2972   281   924   537   201  2613  2265

If you are curious how this works, examine the shared/granges.r file. The check_coverage_argument() is what converts the path to a BigWig file into a coverage object. The resulting coverage object only contains values for the specific ranges required by the GRanges helper function; all other values in the genome are set to zero. This reduces memory usage and improves performance.

Example analysis: comparing Twist and Snail signal at Twist peaks

Let’s see how we can use these tools to perform a simple analysis. Using the MACS peaks for Twist, we’ll compare the amount of Twist ChIP signal to the amount of Snail ChIP signal at each peak and plot the results as a scatterplot.

First, we need to create a Sox2 BigWig file:

snail.gr <- granges(import("data/snail_ip.bam"))
snail.gr <- trim(resize(snail.gr, width=150))
snail.cov <- coverage(snail.gr)
export(snail.cov, "guide/snail_ip.bw")

Now we have the Twist peaks, the Twist BigWig and the Snail BigWig:

twist_bigwig <- "guide/twist_ip.bw"
snail_bigwig <- "guide/snail_ip.bw"
twist_peaks <- import("data/twist_summits.bed")

We’ll expand each of the Twist peak summits to 201 bp and add the ChIP signal for Twist and Snail as metadata columns:

twist_peaks <- resize(twist_peaks, width=201, fix="center")
twist_peaks$twist_signal <- grangesSums(twist_peaks, twist_bigwig)
twist_peaks$snail_signal <- grangesSums(twist_peaks, snail_bigwig)
twist_peaks

GRanges object with 13218 ranges and 4 metadata columns:
                        seqnames      ranges strand |             name     score twist_signal
                           <Rle>   <IRanges>  <Rle> |      <character> <numeric>    <numeric>
      [1]                  chr2L   7673-7873      * |     twist_peak_1    3.3996         5959
      [2]                  chr2L   8754-8954      * |     twist_peak_2   18.3901         8347
      [3]                  chr2L 26502-26702      * |     twist_peak_3    15.864        10155
      [4]                  chr2L 27185-27385      * |     twist_peak_4   119.641        36274
      [5]                  chr2L 34076-34276      * |     twist_peak_5   14.9637         9312
      ...                    ...         ...    ... .              ...       ...          ...
  [13214] chrY_CP007114v1_random 26229-26429      * | twist_peak_13214   4.58682         4802
  [13215] chrY_DS485113v1_random     347-547      * | twist_peak_13215   12.8345         7735
  [13216] chrY_DS485143v1_random     216-416      * | twist_peak_13216   6.37728        11362
  [13217] chrY_DS485732v1_random      90-290      * | twist_peak_13217   7.88986         6369
  [13218] chrY_DS485865v1_random      74-274      * | twist_peak_13218   5.52593         5399
          snail_signal
             <numeric>
      [1]         5697
      [2]         4699
      [3]         1486
      [4]         5886
      [5]         2745
      ...          ...
  [13214]         3994
  [13215]         4693
  [13216]         8548
  [13217]         2993
  [13218]         3208
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

Now we have a GRanges object with the coordinates of 13,218 Twist peaks along with the Twist and Snail ChIP-seq signal around those peaks. We can easily coerce the GRanges into a data.frame and plot the results with ggplot:

library(ggplot2)

twist_peaks_df <- as.data.frame(twist_peaks)

g <- ggplot(twist_peaks_df, aes(x=twist_signal, y=snail_signal)) +
     geom_point(size=1.1, alpha=0.5) +
     scale_x_continuous(trans=scales::log2_trans()) +
     scale_y_continuous(trans=scales::log2_trans()) +
     theme_bw() +
     labs(x="Twist ChIP signal", 
          y="Snail ChIP signal",
          title="Twist and Snail ChIP signal at Twist peaks") 

g

Warning: Transformation introduced infinite values in continuous x-axis

Warning: Transformation introduced infinite values in continuous y-axis

Additional analysis ideas

We also have available a whole cell extract control sample for these experiments. Can you express Twist and Snail signal as enrichment over the whole cell extract? How might you normalize these values to account for differences in the number of sequenced reads?

How might you determine whether, within these Twist peaks, the summit of the Snail ChIP signal is at the same location as Twist’s?

Can you think of how to calculate and plot the average Twist ChIP-seq signal surrounding a set of peaks? Hint: use grangesMatrix().

Session information

For reproducibility, this analysis was performed with the following R/Bioconductor session:

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /n/apps/CentOS7/install/r-3.6.1/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-3.6.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] digest_0.6.25                         pander_0.6.3                         
 [3] ggplot2_3.3.2                         BSgenome.Dmelanogaster.UCSC.dm6_1.4.1
 [5] BSgenome_1.54.0                       rtracklayer_1.46.0                   
 [7] GenomicAlignments_1.22.1              Rsamtools_2.2.3                      
 [9] Biostrings_2.54.0                     XVector_0.26.0                       
[11] SummarizedExperiment_1.16.1           DelayedArray_0.12.3                  
[13] BiocParallel_1.20.1                   matrixStats_0.56.0                   
[15] Biobase_2.46.0                        GenomicRanges_1.38.0                 
[17] GenomeInfoDb_1.22.1                   IRanges_2.20.2                       
[19] S4Vectors_0.24.4                      BiocGenerics_0.32.0                  
[21] RColorBrewer_1.1-2                   

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0       xfun_0.16              purrr_0.3.4            lattice_0.20-41       
 [5] colorspace_1.4-1       vctrs_0.3.2            generics_0.0.2         htmltools_0.5.0       
 [9] yaml_2.2.1             XML_3.99-0.3           rlang_0.4.7            pillar_1.4.6          
[13] glue_1.4.1             withr_2.2.0            GenomeInfoDbData_1.2.2 lifecycle_0.2.0       
[17] stringr_1.4.0          zlibbioc_1.32.0        munsell_0.5.0          gtable_0.3.0          
[21] evaluate_0.14          knitr_1.29             Rcpp_1.0.5             scales_1.1.1          
[25] farver_2.0.3           stringi_1.4.6          dplyr_1.0.2            grid_3.6.1            
[29] tools_3.6.1            bitops_1.0-6           magrittr_1.5           RCurl_1.98-1.2        
[33] tibble_3.0.3           crayon_1.3.4           pkgconfig_2.0.3        ellipsis_0.3.1        
[37] Matrix_1.2-18          rmarkdown_2.3          R6_2.4.1               compiler_3.6.1