Align ChIP Seq data with bowtie2

1. Objective
2. Locate FASTQ files
3. Find a Server
4. Get comfortable with bowtie
5. Start building the bowtie2 command
6. Setting up a full command with bowtie2 and samtools
7. creating a script to loop through a set of files
8. extra data sets

1 Objective

Learn how to use the bowtie2 program to align ChIP-Seq data to the fly genome.
Learn how to convert bowtie2 SAM output to BAM format, sort and index the results.
Learn how to write a script to repeat this process for many files.

2 Locate FASTQ files

The data is located in the class directory in the folder called Data:

/home/cws/CompGenomics/Data/ChIP-Seq/

3 Find a Server

You know how to do this.

4 Get comfortable with bowtie

Bowtie2 is simply an alignment program, so try aligning a few sequence reads with it, and see what the output looks like. It can be helpful to look at the bowtie2 manual.

To run bowtie2, you need an alignment index. We can find a bowtie2 index where the other indexes are. We specify it using the path and the root file name.

/home/cws/CompGenomics/dm6/bowtie2/dm6

syntax for the command is:

bowtie2 -x index_file fastq_sequence_file

However, bowtie can take sequence in a number of ways. You can try giving it a sequence on the command line. Specifying -c accomplishes this. Let's try an example using a drosophila index for quicker results:

bowtie2 -x /home/cws/CompGenomics/dm6/bowtie2/dm6 -c CGGCGGAGTCGACC

We can suppres the header lines if we want:

bowtie2 -x /home/cws/CompGenomics/dm6/bowtie2/dm6 --no-hd -c CGGCGGAGTCGACC

Try altering the sequence to see what happens.

This is good for exploring alignment and reporting options.

5 Start building the bowtie2 command

Let's give it a few reads from our experiment in a file, to work out our real command. Grab a few from one of the fastq files. Each sequence record in a fastq file consists of 4 lines. So grab some number of records, and put them into a file called foo.fq. Remember that the fastq files are compressed, so to grab a few lines you have to decompress on the fly, and use pipes:

# grab 2 sequence reads
gunzip -c /home/cws/CompGenomics/Data/ChIP-Seq/Dme_Dl_Tl10b_2to4h_1_ip.fastq.gz | head -8 > foo.fq

# examine the file
cat foo.fq

Run this command using your foo.fq file:

bowtie2 -x /home/cws/CompGenomics/dm6/bowtie2/dm6 foo.fq

What kind of output do you get?

Now we know we can run bowtie2 on a file and get results. So let's flesh it out.

Bowtie2 can read compressed data or uncompressed data, so we can hand it our files directly without decompressing them. The output is SAM output, and we would like BAM output. BAM files are a standard part of many genomic pipelines and analysis routines, yet we can't see them using normal programs. Rather we use a suite of tools for viewing and manipulating them. This suite is called samtools.

Samtools can take a SAM file as input, and covert it into BAM output. The input can be a simple SAM file, or the input can be read on stdin. This means we can give samtools input using a the pipe character. This will allows us to convert the SAM results to BAM results as they emerge from bowtie2.

6 Setting up a full command with bowtie2 and samtools

As before, we can set a variable to hold our alignment index:

# declare a variable for our index
INDEX=/home/cws/CompGenomics/dm6/bowtie2/dm6

Although the bowtie2 output is SAM format, there are a few extra lines coming out. They may have useful information, so we may want to save them, but they are not part of our SAM output. Instead they are on a different output stream called STDERR. We seem them, because our terminal shows us data from two streams: STDOUT, and STDERR. We can use a trick to redirect one of these streams to a file.

In the example below we run bowtie2 on an input fastq file, capture the STDERR out to a log file, and then pipe the other results to samtools, using the samtools view function to convert SAM input to BAM output, and we tell samtools to get it's input from STDIN (the pipe character) using a dash at the end of the command.

Given what we have so far, the structure of a full alignment command for bowtie would resemble the following:

bowtie2 -x $INDEX -p 2 -U input.fastq 2>>bowtie_log.txt | samtools view -S -b -o output.bam -

Two things to pay attention to are the input and output files. The names used above are simply examples. The input file you will be using is located in the class data directory, and has a different name. To use the input file you have to reference it by it's full address, unless it is in your current directory. We can type out the full pathway to the file, or we can use a variable to help us, so that if we have to run the same command on many files we can more easily manage the locations and do less typing. The location of the data files is defined at the top of this page. We can store that location in a variable, and then reference it. Thus for the first data file we can do the following:

# set a variable to hold the location of the data directory
DATA_DIR=/home/cws/CompGenomics/Data/ChIP-Seq
# align oct4
bowtie2 -x $INDEX -p 2 -U $DATA_DIR/Dme_Dl_Tl10b_2to4h_1_ip.fastq.gz 2>>bowtie_log.txt | samtools view -S -b -o Dme_Dl_Tl10b_2to4h_1_ip.bam -

I also changed the name of the output file to match that of the input file, by simply using the suffix of the input file as the base name of the output file.

If you're going to run this command, and it's going to take a while to complete you should consider using screen, or nohup as discussed in the RNA-Seq tutorial. An example using nohup is shown below. Make sure you're in the directory where you want the results to go, and then put nohup in front of your command, and an ampersand (&) at the end.

nohup bowtie2 -x $INDEX -p 2 -U $DATA_DIR/Dme_Dl_Tl10b_2to4h_1_ip.fastq.gz 2>>bowtie_log.txt | samtools view -S -b -o Dme_Dl_Tl10b_2to4h_1_ip.bam - &

The ampersand at the end sends the process into the background, giving you your command line back. And nohup will tell the computer to keep the process alive even after you log out.

You might ask yourself, is there a way to do all this programmatically? If you had a dozen alignments (or several dozen), would you want to type all that stuff out and make sure everything matches - and hope you didn't make a mistake? Why not use a loop and soem variable to do all the work for you?

7 creating a script to loop through a set of files

There are many ways to script this activity. Listed below is one example. The script below could be made more general by reading in all files ending in a pattern, such as fastq.gz.

#!/bin/bash

# set variables
INDEX=/home/cws/CompGenomics/dm6/bowtie2/dm6
DATA_DIR=/home/cws/CompGenomics/Data/ChIP-Seq

FILES="snail_ip.fastq.gz
snail_twist_wce.fastq.gz
toll10b_4h8_2to4h_1.fastq.gz
twist_ip.fastq.gz
Dme_Dl_Tl10b_2to4h_1_ip.fastq.gz
Dme_Dl_Tl10b_2to4h_2_ip.fastq.gz
Dme_Tl10b_2to4h_1_wce.fastq.gz"

for f in $FILES
 do
   base=$(basename $f .fastq.gz)
   echo $base
   echo $f
   echo $f >> bowtie_log.txt
   bowtie2 -x $INDEX -p 2 -U $DATA_DIR/$f 2>>bowtie_log.txt | samtools view -S -b -o $base.bam -
   echo "sorting $base.bam"
   samtools sort $base.bam -T $base -o $base.bam
   echo "indexing"
   samtools index $base.bam
done

If you create this script, you have to make sure that the resulting file is executable. You can give the file execute permission using the chmod command. For instance, if your file was called align.sh, you could do the following:

# make the file executable
chmod o+x align.sh
# run the alignment command
./align.sh

8 extra data sets

There is some extra data in the data directory. You can always grab data from GEO. It can be a little non-trivial but simply ask for help.