Identify peaks using MACS

1. Objective
2. Choose a directory
3. Run MACS
4. Examine the results
5. Load the peak.bed file into a browser
6. Use R to characterize your peak data
7. Loading the .narrowPeaks file into R

1 Objective

MACS is a program for finding peaks in ChIP Seq data. Your goal is to find peaks in various chIP Seq data sets.

2 Choose a directory

As with the other steps of your project, choose a directory where you can run MACS and store the results. You may want to create a separate directory for each comparison made by MACS, as several result files are generated for each run.

3 Run MACS

MACS will run directly on BAM files. There are two versions of the program: macs, and macs2. We will use macs2, but the manuals for both are instructive, and you should examine them.

Assuming you've created a directory to run macs, and moved to the that directory, a command to run macs on 1 ip and 1 wce file would look like the following:

macs2 callpeak -t ../BAM/twist_ip.bam -c ../BAM/snail_twist_wce.bam -f BAM -g dm --name twist -q 0.01 --nomodel

MACS generates diagnostic output as it is running. It can be useful to capture this output in case macs fails, or if you happen to be runnning macs on several files and you'd like to have a record of the results. To capture the ouput we can re-direct it to a file. To do this, you can simply add the following to the end of your command: 2>stderr.txt. The number 2 at the very end of a command is a way to capture the standard error stream from a program, and we simpy re-direct it to the file stderror.txt. Of course we can choose any file name we want. The standard error is a place for programs to have output that is different from the standard out output that we normally see, or re-direct to files or other programs.

Using this trick on our command gives us the following:

macs2 callpeak -t ../BAM/twist_ip.bam -c ../BAM/snail_twist_wce.bam -f BAM -g dm --name twist -q 0.01 --nomodel 2>macs2_msgs.txt

4 Examine the results

Once MACS is finished running, you should have several output files. Generally the first thing to do is list them and look at their size. In particular, you want to see that the BED file of peaks looks like it has something in it. Try running word count on it to see how many peaks you have.

In macs version2 the BED file with peaks has a suffix: narrowPeak. Change the name of this file so that it has the .bed extension. In my case, my peaks file for twist is called: twist_peaks.narrowPeak, so I wuold change is as follows:

mv twist_ip_peaks.narrowPeak twist_ip_peaks.bed

5 Load the peak.bed file into a browser

If you have IGV open, and you have access to your peaks file, you can drop it onto your IGV window and visualize the peaks.

6 Use R to characterize your peak data

You can easily use R to examine some of your peak attributes. One way to do this is using RStudio on the server: http://franklin.stowers.org

The main result file has a .xls extension but is really just a text file. You can read it into R as a table of data. It has a 25 line header containing information about the MACS run before the MACS results actually start. You can tell R to skip this header and read the results into a table.

The code below assumes your project is in a directory in your home dir, and that the MACS results are also in their own sub-directory. Change the directory paths to fit your results.

# set the working directory
setwd("/home/username/projectDir")
# read in the peak data
pdata <- read.table(file="MACS/twist_ip_peaks.xls", skip=22, header=T, sep="\t", as.is=T)

# check the column names, see if things imported correctly
# and see what kind of data is returned
head(pdata)

# fix the column names for p-value, fold enrichment, and q-value
colnames(pdata)[7:9] <- c("pval","fe","qval")

# examine the distribution of enrichment values using a histogram
hist(pdata$fe, main="Fold Enrichment")

Try examining a histogram of the p-values, and peak lengths.

7 Loading the .narrowPeaks file into R

When MACS2 produces the bed file of peaks, they are in a special BED format called BED6+4. To read them into R with the rtracklayer library, you need to give some extra information to the import() function. You can read the details on the Genomics Blog, but the short version is here:

# define extra BED column values
extraCols_narrowPeak <- c(FoldChange="numeric", pVal="numeric",
			  qVal="numeric", summit="integer")

# import our peak data
peaks <- import.bed("my_peaks.narrowPeak", extraCols=extraCols_narrowPeak)