GO analysis

Objectives
BioMart
- Using BioMart to retreive info from Ensembl
- Using Ensembl
The other source of information in R: the AnnotationDbi
Using GOstats
On our data
Doing the real GO analysis
Homework
Extras

Objectives

Learn how to use R and BioMart to retrieve gene annotations and ontology information automatically, and how to perform a GO enrichment analysis on a gene set.

BioMart

In the first R tutorial, we had to manually assign the gene name to a gene ID. Is there a more systematic way of doing this? Is there other information we can retrieve?

Of course! And here I present you biomaRt.

Using BioMart to retreive info from Ensembl

Ensembl (http://www.ensembl.org) is a rich source of genomic information you should get familiar with. It is a project between the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute devoted to genomic information of model organisms. Data from it can be accessed from R using the biomaRt package.

Like all bioconductor packages, there is a good vignette on biomaRt. However, you can also access the pdf vignettes from within R: browseVignettes(package = "biomaRt")

Let's try grabbing some gene information from within R.

##################################################
### Exploring biomaRt
##################################################

# load the library for biomart
library("biomaRt")
# list the available marts
listMarts()

The default marts are from Ensembl.

Using Ensembl

# We want to use the database from Ensembl
# We create a Mart object, and tell it to use ensembl
ensembl <- useMart("ensembl")
# what is this thing?
class(ensembl)

# but this database has lot of data sets, which do we want?
listDatasets(ensembl)

There are a lot. How would we find one for Mouse? See any patterns? What do we do with patterns?

grep("dmel", listDatasets(ensembl)$dataset)

How would you use this to find the data set of interest?

Set up our mart in one step

If we know what we're looking for we can set it up in one line.

# create the Mart object for the ensembl database, choose the mmus dataset
ensembl <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl")

We have a mart pointing at the right place, but we still have exploring to do. We've interacted with our mart using a function listDatasets(), now we use another function to find out what's in the database: listAttributes()

### What information can we access?
atts <- listAttributes(ensembl)

# examine the result
head(atts)

# are there a lot?
dim(atts)

Let's take a look at what we can get.

I know there are some flybase specific attributes in there. Let's see if we can find them.

grep("fly", atts[,1], ignore.case=TRUE)

TODO: use the expression above to view the matching rows of the attribute matrix.

Can we get GO terms?

atts[grep('GO',atts$description),]

The main query function of biomaRt is getBM()

Let's take a look at some different kinds of ids.

# entrezgene_trans_name
id2gene <- getBM(mart=ensembl, attributes=c('flybase_gene_id','flybasename_gene', 'ensembl_gene_id', 'external_gene_name', 'description'))
# examine the result  
head(id2gene)

# could we have found our genes of interest?

With getBM() we can return attributes, and we can set up queries for more specific information by using filters and specific values.
What if we just want information for the genes on chr4?

# let's see what we can filter on
listFilters(ensembl)

# Get our attributes just for the genes on chr4
getBM(mart=ensembl,
      attributes=c('flybase_gene_id', 'flybasename_gene', 'external_gene_name'),
      filters="chromosome_name",
      values="4")

# or more to the point, get information for our genes of interest
getBM(mart=ensembl,
      attributes=c('flybase_gene_id', 'external_gene_name', 'chromosome_name', 'start_position','end_position'),
      filters="external_gene_name",
      values=c("Tl","gd"))

# Can we get GO terms associated with gene?
getBM(mart=ensembl,
      attributes=c('flybase_gene_id','external_gene_name', 'go_id', 'name_1006'),
      filters="external_gene_name",
      values="Tl")
# notice some of the attributes we have to explicitly look up
# grep('GO',atts$description),]

# can we grab homologs from Cave Fish?
getBM(mart=ensembl,
      attributes=c('ensembl_gene_id', 'external_gene_name', 'amexicanus_homolog_ensembl_gene'),
      filters="chromosome_name",
      values="2L")

Build an annotation table for all our genes

Usually we might build a large annotation data frame matching our gene expression results so that the same index vectors can be used for both, and we can export the results to a spreadsheet.

anno <- getBM(mart=ensembl, 
                 attributes=c("ensembl_gene_id", "external_gene_name", "transcript_biotype", 
                                "chromosome_name", "start_position", "end_position", "strand", "description"))

# Examine the result
head(anno)

Once we load our data, we can use this to annotate it.

The other source of information in R: the AnnotationDbi

These are annotation objects from which we can extract data. They are created by bioconductor for many model organisms.

##################################################
### Loading the M. musculus database
##################################################
library(org.Dm.eg.db)

### What is this?
org.Dm.eg.db

Once we load this library, we have a new object in our environment (even though we can't see it, like LETTERS). The object is called: org.Dm.eg.db. We interact with it using accessor functions. The things to remember are: columns, keys, keytypes, and select.

We can think of it as a database, and databases have columns.

# what kind of data are available?
columns(org.Dm.eg.db)

Let's extract some data
We use keys to extract data

# examine some keys of type ENSEMBL
head(keys(org.Dm.eg.db,keytype='ENSEMBL'))

You can use the select() function to build large data frames of gene information based on your criteria.

# Return all keys and grab 10 of them randomly using sample
k <- sample(keys(org.Dm.eg.db,keytype='ENSEMBL'), 10)

# Call select to grab information for our keys
select(org.Dm.eg.db, keys=k, columns=c("SYMBOL","GENENAME"), keytype="ENSEMBL")

Now, it's time to start our GO anlysis

Using GOstats

There are many packages available for performing GO term analysis. GOstats was written by Robert Gentleman, one of the cofounders of R and bionductor, and is relatively simple to use. For some packages you have to extract all the GO terms from a database, but GOstats can use the GO terms found in the bioconductor annotation objects directly.

Let's play a little analysis trick! Let's take a random sample of 500 genes and see if we can come with something interesting! At the same time, learning how to perform a simple GO analysis

library(GOstats)
library(org.Dm.eg.db)

set.seed(666)
universeIDs <- keys(org.Dm.eg.db,'ENTREZID')
selectedIDs <-  sample(universeIDs,500)

goParams <- new("GOHyperGParams",
              geneIds=selectedIDs,
              universeGeneIds=universeIDs,
              annotation="org.Dm.eg.db",
              ontology="BP",
              pvalueCutoff=0.001,
              conditional=FALSE,
              testDirection="over")

goResults <- hyperGTest(goParams)

summary(goResults,categorySize=10)

On our data

Filtering the data

Locate your RNA Seq DE analysis files to run this on your data. The one below is the DE analysis from edgeR that I

saved as an object in my project directory. You should find yours.

##################################################
### Let's load the data
##################################################
library(edgeR)
load("edgeR.RData")

# adjust p-values
et$table$p.adj <- p.adjust(et$table$PValue, method="BH")

### Let's create some DE gene lists
# up
tol10b.enr <- et$table$logFC > 1 & et$table$p.adj < 0.05

# down
gd7.enr <- et$table$logFC < -1 & et$table$p.adj < 0.05

# count the results
sum(tol10b.enr)
sum(gd7.enr)

We now have two boolean vectors to target up or down gene sets in our universe of expressed genes.

Getting gene information from the Annotation DB

GOstats, like many analysis packages, takes only ENTREZ gene ids.
We have only Flybase ids for our lists, what if we want to annotate them with more information. like other ids, chromosomal locations, etc.?
We can grab information for any given list of genes using the select() function and the AnnotationDBi.

# Load the annotation database
library(org.Dm.eg.db)

##################################################
### The AnnotationDBi is based on NCBI. So first off
### are there gene ids in our result files
### that are not in NCBI database?
##################################################
sum(! rownames(et$table) %in% keys(org.Dm.eg.db,'ENSEMBL'))

# we can remove these from the analysis later

##################################################
### Let's find the gene symbols to the corresponding
### gene id and print some results
##################################################

# use select, and our top genes as the keys
head( select(org.Dm.eg.db,
             keys=rownames(et$table[tol10b.enr,])[1:25],
             columns=c("SYMBOL","GENENAME","ENSEMBL"),
             keytype="ENSEMBL"),25)

Looking at the up genes

# map flybase ids to gene symbols 
up.info <- select(org.Dm.eg.db,
            keys=rownames(et$table[tol10b.enr,])[1:25],
            columns=c("SYMBOL","GENENAME","ENSEMBL"),
            keytype="ENSEMBL")

# create a submatrix of our up genes
# we can take our selection, and
up.genes <- (et$table[tol10b.enr,])[1:25,]

# combine (column bind) our annotation and expression results
cbind(up.info,up.genes)

Anyone of these genes you make you feel happy?

Doing the real GO analysis

First, creating our universe and affected genes, converting them to Entrez IDs

##################################################
### Defining our universe and a select gene set of interest
##################################################
universeIDs <- select(org.Dm.eg.db,
                      keys=rownames(et$table),
                      columns='ENTREZID',
                      keytype="ENSEMBL")$ENTREZID

# check nonmapped and unique
length(universeIDs)
sum(is.na(universeIDs))
unique(universeIDs)

universeIDs <- unique(universeIDs)

selectedIDs <- select(org.Dm.eg.db,
                      keys=rownames(et$table[tol10b.enr,]),
                      columns='ENTREZID',
                      keytype="ENSEMBL")$ENTREZID

## Removing non-mapped IDs (ie NAs)
universeIDs <- universeIDs[!is.na(universeIDs)]
selectedIDs <- selectedIDs[!is.na(selectedIDs)]

Then, running the GO analysis on Biological Processes

#################################################
## running the GO analysis on the Biological Processes
#################################################
# Create an instance of the analysis object
goParams <- new("GOHyperGParams",
                 geneIds=selectedIDs,
                 universeGeneIds=universeIDs,
                 annotation="org.Dm.eg.db",
                 ontology="BP",
                 pvalueCutoff=0.001,
                 conditional=FALSE,
                 testDirection="over")

# Run the tests
goResults <- hyperGTest(goParams)

# summarize the results
summary(goResults,categorySize=10)

Then, on the Molecular Functions

##################################################
### running the GO analysis on the Biological Processes
##################################################
goParams <- new("GOHyperGParams",
                geneIds=selectedIDs,
                universeGeneIds=universeIDs,
                annotation="org.Dm.eg.db",
                ontology="MF",
                pvalueCutoff=0.001,
                conditional=FALSE,
                testDirection="over")

goResults <- hyperGTest(goParams)

summary(goResults,categorySize=10)

We could continue in this fashion for Cellular Component, and the remaining gene lists. However, it is also possible to shorten the process and use less code, by simply making a copy of the analysis object, and modifying the key parameters that need to be changed. The only thing we are changing between each analysis is either the ontology (BP,MF,CC), or the input gene list.

#################################################
### Examine the BP ontology for the down genes
#################################################
# get the IDs for the down genes
selectedIDs <- select(org.Dm.eg.db,
                      keys=rownames(et$table[gd7.enr,]),
                      columns='ENTREZID',
                      keytype="ENSEMBL")$ENTREZID

# Removing non-mapped IDs (ie NAs)
selectedIDs <- selectedIDs[!is.na(selectedIDs)]

### Now do the GO analysis
# Copy our existing GOHyperGParams object
goP.dn.bp <- goParams

# assign the new gene set to it
geneIds(goP.dn.bp) <- selectedIDs

# assign the ontology
ontology(goP.dn.bp) <- "BP"

# run the analysis
goP.dn.bp.res <- hyperGTest(goP.dn.bp)

# summarize the results
summary(goP.dn.bp.res,categorySize=10)

Homework

Save the different GO analyses to disk as txt or csv files
Come up with a way to graphically represent GO enrichment

Extras

Look up the help on the hyperGTest

Get the genes behind a GO term using biomart or the AnnotationDbi object