Assigning peaks to genes

Setting up your Analysis
The data
Getting Gene Features
- Downloading it for the first time
- Do we need to download it everytime?
Getting the location of all the genes in the genome
- Side note on lists
- Our GRangesList is just a list of GRanges objects
Reduce the transcripts to the gene
How many peaks are within genes?
How many peaks or summits within intergenic regions?
Let's find the nearest gene
Let's find the gene with the closest promoter
Exercise
- Can you plot the distribution of the distances between peaks and nearest gene?
Extras
Other tools that operate on ranges

Setting up your Analysis

First, go to your terminal and create a directory peakAssignment within your chipSeq directory
Then, set your working directory from RStudio Server
- where is your working directory?
Finally, let me introduce you to your new best friend, BioConductor!

http://bioconductor.org

The data

MACS produces BED files, corresponding to the genomic location of DNA enriched in our IP experiments. Let's load them. In my working directory, I have the result for the twist IP

##################################################
## CODE BLOCK #1
##################################################

# Loading a library
library(rtracklayer)

# assigning bed file location to variable
summits_file <- '/home/cws/dl_project/MACS/twist/twist_summits.bed'

# import the BED data
twist.summits <- import.bed(summits_file)

# what kind of object is this?
class(twist.summits)

Yeah, but what is a GRanges

# take a look at it
twist.summits

GRanges object with 13218 ranges and 2 metadata columns:
                        seqnames    ranges strand |             name     score
                           <Rle> <IRanges>  <Rle> |      <character> <numeric>
      [1]                  chr2L      7773      * |     twist_peak_1    3.3996
      [2]                  chr2L      8854      * |     twist_peak_2   18.3901
      [3]                  chr2L     26602      * |     twist_peak_3    15.864
      ...                    ...       ...    ... .              ...       ...
  [13216] chrY_DS485143v1_random       316      * | twist_peak_13216   6.37728
  [13217] chrY_DS485732v1_random       190      * | twist_peak_13217   7.88986
  [13218] chrY_DS485865v1_random       174      * | twist_peak_13218   5.52593
  -------
  seqinfo: 125 sequences from an unspecified genome; no seqlengths

We want to map these peaks to genes. So we need a description of genes in some format.

Getting Gene Features

R provides us a way to access and manipulate features on the genome. We can retrieve gene features from a database (such as UCSC), or from a file.

Downloading it for the first time

##################################################
## CODE BLOCK #2
##################################################

# Load the library
library(GenomicFeatures)

##################################################
### Get gene features
##################################################

# We can get them from UCSC
# txdb <- makeTxDbFromUCSC(genome="mm10", tablename="ensGene")

# We've done our analysis thus far with a particular version of ensembl
# in the form of a GTF file, so let's use that.
gtf_file <- "/home/cws/CompGenomics/dm6/dm6.Ens_98.gtf"
# It could be this easy
txdb <- makeTxDbFromGFF(file=gtf_file)

# What is this thing?
class(txdb)

# examine it
txdb

It's a transcript database object. Like many of the objects we will work with for genomic analysis (see GRanges above), this object has an ability of be aware of the chromosomes, or sequences space, on which the features exist. We can see this at the end of the output for the GRanges object. We can query this information more carefully using an accessor function called seqinfo(). For instance:

seqinfo(txdb)

Seqinfo object with 25 sequences (1 circular) from an unspecified genome; no seqlengths:
  seqnames         seqlengths isCircular genome
  chr2L                  <NA>       <NA>   <NA>
  chr2R                  <NA>       <NA>   <NA>
  chr3L                  <NA>       <NA>   <NA>
  chr3R                  <NA>       <NA>   <NA>
  chr4                   <NA>       <NA>   <NA>
  ...                     ...        ...    ...
  chrUn_CP007120v1       <NA>       <NA>   <NA>
  chrUn_DS483646v1       <NA>       <NA>   <NA>
  chrUn_DS483910v1       <NA>       <NA>   <NA>
  chrUn_DS484581v1       <NA>       <NA>   <NA>
  chrUn_DS484898v1       <NA>       <NA>   <NA>

The object knows something about our sequence set: the chromosomes. But it doesn't know how long the chromosomes should be. We would get the same result if we examined our GRanges object of twist summits.

Chromosomes usually have bounds (unless circular), and it's helpful if the object knows the edges of the chromosomes, in case we want to perform some operation that would go off the ends. Currently the lengths of each chromosome are not set. We should fix that.

The easiest way to find and set the lengths of the chromosomes is using the bioconductor BSGenome package. We're working with Drosophila, so we can load that genome, and use a function to extract out the sequence lengths and set them in our object.

library(BSgenome.Dmelanogaster.UCSC.dm6)

# this gives us access to a genome object:
Dmelanogaster

# Examine seqinfo
seqinfo(Dmelanogaster)

# Examine the help for makeTxDbFromGFF
?makeTxDbFromGFF

We can see from the help file that we can create our object, and hand it parameters that will help us in the future.

# make our transcript db object
txdb <- makeTxDbFromGFF(file=gtf_file, chrominfo=seqinfo(Dmelanogaster), organism="Drosophila melanogaster")

# check the result
seqinfo(txdb)

# Let's save a copy (using a special function)
saveDb(txdb, "D.melanogaster.Ens_98.dm6.db")

Do we need to download it everytime?

# Get gene features from UCSC, unless we already have them
library(GenomicFeatures)
txdbfile <-'./D.melanogaster.Ens_98.dm6.db'

if(!(file.exists(txdbfile))) {
    ## If we already create the txdb object, no need to download it again
    if(!exists('txdb')){
        txdb <- makeTxDbFromGFF(file=gtf_file, chrominfo=seqinfo(Dmelanogaster), organism="Drosophila melanogaster")
    }
    saveDb(txdb, txdbfile)
} else {
  txdb <- loadDb(txdbfile)
}

Getting the location of all the genes in the genome

Our database tells us about transcripts and their parental gene. So, let's retreive a structure of all trancripts grouped by gene.

### Get all the transcripts for each gene
allTx <- transcriptsBy(txdb, by='gene')

# examine the result
allTx

GRangesList object of length 17753:
$FBgn0000003
GRanges object with 1 range and 2 metadata columns:
      seqnames          ranges strand |     tx_id     tx_name
         <Rle>       <IRanges>  <Rle> | <integer> <character>
  [1]    chr3R 6822498-6822796      + |     20656 FBtr0081624
  -------
  seqinfo: 1870 sequences (1 circular) from dm6 genome

$FBgn0000008
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id     tx_name
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]    chr2R 22136968-22172834      + |      9418 FBtr0071763
  [2]    chr2R 22136968-22172834      + |      9419 FBtr0100521
  [3]    chr2R 22136968-22172834      + |      9420 FBtr0342981
  [4]    chr2R 22137433-22172834      + |      9421 FBtr0071764
  -------
  seqinfo: 1870 sequences (1 circular) from dm6 genome

Observe two things:

The result is a GRangesList. Thus we can access the elements like a list.
The object knows something about our sequence set, both the sequence lengths AND the version of the genome. This can be very helpful.

Side note on lists

List are collections of other items.

# let's make a list
foo <- list(a=c(1:5), b=LETTERS)

# examine the result
foo

# first position of the list
foo[1]

# first list element
foo[[1]]

### What's the difference?

# examine what gets returned
class(foo[1])

class(foo[[1]])

# access the 14th position of the second thing on the list
foo[[2]][14]

### What happens when we unlist something?
unlist(foo)

Our GRangesList is just a list of GRanges objects

# examine just the 1st element
allTx[1]

# examine just the second element
allTx[2]

Reduce the transcripts to the gene

The gene is essentially the foot print of all the transcripts on the locus. So let's collapse (reduce) all the trancripts into one foot print.

Take a look at the latest paper on GenomicRanges and other genome utilities
The GenomicRanges page at Bioconductor has good documentation.

##################################################
## CODE BLOCK #3
##################################################

### Getting the foot print of all transcripts

# collapse all transcripts into a single range
genic.tx <- reduce(allTx)

# examine the result
genic.tx

# Then, let's get rid of the list format
genes <- unlist(genic.tx)

# examine the result
genes

GRanges object with 17753 ranges and 0 metadata columns:
              seqnames            ranges strand
                 <Rle>         <IRanges>  <Rle>
  FBgn0000003    chr3R   6822498-6822796      +
  FBgn0000008    chr2R 22136968-22172834      +
  FBgn0000014    chr3R 16807214-16830049      -
  FBgn0000015    chr3R 16927212-16972236      -
  FBgn0000017    chr3L 16615866-16647882      -
          ...      ...               ...    ...
  FBgn0286199    chr3R 24279572-24281576      +
  FBgn0286203    chr2R   5413744-5456095      +
  FBgn0286204    chr3R   8950246-8963037      -
  FBgn0286213    chr3L 13023352-13024762      +
  FBgn0286222     chrX   6678424-6681845      +
  -------
  seqinfo: 1870 sequences (1 circular) from dm6 genome

How many peaks are within genes?

We have a description of peaks, we have a description of genes. We can ask our first question: how many genes overlap with a peak?
There's a function for that, find the help: ?countOverlaps

### Count how many peaks overlap each gene
overlap.genes <- countOverlaps(genes,twist.summits)

### What do you get here?
head(overlap.genes)

length(overlap.genes)
length(genes)

sum(overlap.genes)

# Is there an easy way to summarize or *tabulate* the results?

We gave the function a query and a subject, so we get an answer for each element of the query.

How many peaks or summits within intergenic regions?

Since we're studying a transcription factor, we would really like to know also how peaks map to intergenic regions.

Question: How do we define intergenic regions?

##################################################
## CODE BLOCK #4
##################################################

### Define Intergenic Regions

# make a copy of gene locations
genic <- genes

# remove strand information
strand(genic) <- '*'

# reduce gene overlaps to one feature
genic <- reduce(genic)

# Now that we have a definition of all "genic" feature space
# we can take the inverse of this by subtracting it from the
# chromosome definitions.

### Get the chromosome info from the db and make a GRanges out of it
chroms <- as(seqinfo(txdb),'GRanges')

#########################################################
### Take the set difference between chromosomes and genes
#########################################################
intergenic <- setdiff(chroms,genic)

# examine the results
intergenic

# Did it really work?
# Make a bed file so we can check it in IGV
export(intergenic,'intergenic.bed')

Counting peaks in intergenic regions

# Now we can count peaks in intergenic regions
overlap.intergenic <- countOverlaps(intergenic,twist.summits)

# What do we get?
head(overlap.intergenic,25)

sum(overlap.intergenic)

# Let's tabulate the results
table(overlap.intergenic)

Let's find the nearest gene

The GenomicRanges package offers a lots of very nice utilities to find the neareast feature from one list in another. Let's look at the documentation!
Look up the help for the nearest() function: ?nearest
let's use nearest()

##################################################
## CODE BLOCK #5
##################################################

# Let's use a function straight out of the box
nearestGene <- nearest(twist.summits,genes)

# examine the results
head(nearestGene)

The answer comes back as an index vector.

### Now, let's get the genes
genes[nearestGene]

# any problematic values?
sum(is.na(nearestGene))

# in case NA values are returned we can avoid them
genes[nearestGene[!is.na(nearestGene)]]

# let's fix the index vector
nearestGene <- nearestGene[!is.na(nearestGene)]

Let's now get the gene ids

### Get the set of gene IDs near peaks
geneNearPeaks <- unique(names(genes[nearestGene]))

head(geneNearPeaks,15)

Cool! Now, you can ask all the questions you want!

Mine is: Is that really what you want to find???
Why not?

We want peaks near genes, but since we're studying a transcription factor we don't want the closest peak to just any part of the gene. We want peaks close to the promoter. What's the best way to define that?

Let's find the gene with the closest promoter

How can we find the promoters???

The promoter is define as something close to the transcription start site. Let's get all the start sites for all transcripts for all genes! First, let's get all transcripts as GRanges, i.e. not grouped by genes.

##################################################
## CODE BLOCK #6
##################################################

### get all transcript definitions
allTx.gr <- unlist(allTx)

### Let's define the TSS, a 1 nt window at the 5' end
### i.e. at the START of the transcript.
### Fortunately, resize() knows what the start of a transcript is.
TSS <- resize(allTx.gr,width=1,fix='start')

### Is this really all tss??? Let's look at them!
TSS

export(TSS,"TSS.bed")

Ok… Let's stop for a moment and reflect about all that good mojo!

Did I really just pulled out the TSSs? What about the strand of the transcript?

Let's examine then another way.

##################################################
## CODE BLOCK #7
##################################################

### We can use the strand() accessor function
### to create a boolean vector to divide transcripts by strand

# get all Watson
allTx.gr[strand(allTx.gr) == '+']
# get all Crick
allTx.gr[strand(allTx.gr) == '-']

# actually let's just get one of each
t1 <- allTx.gr[strand(allTx.gr) == '+'][1]
t2 <- allTx.gr[strand(allTx.gr) == '-'][1]

## Examine them
## Take notice to their start and end
t1
t2 

## What happens when we resize them?
resize(t1,width=1,fix='start')
resize(t2,width=1,fix='start')

Now we have a reference point for our promoter. Let's find the TSS nearest to all of our twist ChIP-Seq summits.

##################################################
## CODE BLOCK #8
##################################################

### Get a GRanges object of the nearest TSSs
nearest.iv <- nearest(twist.summits,TSS)

# test for NA values
sum(is.na(nearest.iv))

# some peaks have no genes nearby. Which ones?
twist.summits[is.na(nearest.iv)]

# Why don't these peaks have any associated genes?
seqnames(twist.summits[is.na(nearest.iv)]) %in% seqnames(TSS) 

# Let's get rid of those.
twist.summits <- twist.summits[!is.na(nearest.iv)]

# now we should be able to do the following
nearest.tss <- TSS[nearest(twist.summits,TSS)]

## Examine the result.
## Fortunately, the names of the TSS are the name of the genes...
nearest.tss

## Sometimes we sample things randomly, or not.
set.seed(666)
sample(unique(names(nearest.tss)),15)

Exercise

Can you plot the distribution of the distances between peaks and nearest gene?

hint: look up the help for nearest()

Extras

the ID of the gene whose promoter the peak summit is nearest
the strand of that gene
the (signed) distance from the summit to that gene’s tss (an upstream summit would have a positive distance)

export(twist.summits, "twist_summits_anno.bed")

Other tools that operate on ranges

There is a command-line suite of tools for operating on BED files called: bedtools. Some of the man pages have nice figures illustrating complex intersections or how to query for the closest feature.

Michael Lawrence, the author of GenomicRanges, wrote a library/tutorial for showing how to do many bedtools operations using the genomic ranges utilities in R. It is called: HelloRanges.