Alignment of RNA Seq data with STAR

1. Objectives
2. Prepare for an alignment
3. Run STAR
- 3.1. Run STAR on your data files
  - 3.1.1. Learn how to use shell variables
- 3.2. The Alignment Index
  - 3.2.1. Run STAR with all our parameters
4. You can script all of this
5. Check the Results and assemble a complete data set
6. Extras
- 6.1. Learn how to run programs while you're away
  - 6.1.1. screen
  - 6.1.2. nohup

1 Objectives

The goal of this exercise is to align FASTQ files to the mouse genome using STAR. The summary for this project describes the background information.

2 Prepare for an alignment

2.1 Find a Server

In order to align your RNA Seq data, you have to find an available server. Try logging into various servers and see if they look busy. You can use the top command to check server usage. You can also use ganglia to see the status of servers via web browser.

Here is a table of servers at the institute:

Server	cores	Memory
cedar	32	450 GB
hickory	64	512 GB
maple	24	128 GB
cypress	32	1 TB

Addenda: this is the server we'll be using:

Server	cores	Memory
franklin.stowers.org	16	64 GB

2.2 Choose a working directory

Think about how to organize your work up front. Each project you work on generates many files. Create a directory for your project to organize the work and results.

2.3 Choose a system to record your work

You will be executing commands, and taking notes on the results. Just like in the lab, you need a method for recording your work on the computer. A simple text file is one way of recording commands, file pathways, notes about problems or run times, etc. Every project directory should have some kind of descriptive file with information about how the results in the directory were generated. There may be more than one. A simple README or notes.txt file can record basic information, and can exist alongside more polished files that describe or summarize results.

3 Run STAR

The STAR manual has a description of how to run the program and all the options you can supply to STAR. It's a good idea to look this over.

To run STAR on your FASTQ files you generally need three things:

the location of your FASTQ files
a STAR index of the genome
a GTF file of transcript descriptions

The FASTQ files are located in the class Data directory:

/home/cws/CompGenomics/Data/RNA-Seq/

What are these files? How can we view them?

At the Stowers, we have many genome indexes for various organisms that have already been created for you. But since we're not on the internal network, our STAR alignment index is in:

/home/cws/CompGenomics/dm6/

You can see that this directory points to resources around the dm6 version of the fly genome, and these indices have been constructed using version 98 of the Ensembl annotations for mouse genes. There is a GTF file of transcript descriptions there. Can you find it?

The general structure of the STAR command to align some data looks as follows:

STAR NumberOfProcessors IndexDirectory FASTQ_files

We will be specifying these options, and a few others to the program using command-line arguments that begin with a double-dash, as described in the following table:

Argument	Description	Example Value
–runThreadN	Number of processors to use	2
–genomeDir	Path to index directory	/home/cws/CompGenomics/dm6/Ens98_STAR_51
–readFilesIn	path to input fastq file(s)	/home/cws/CompGenomics/Data/RNA-Seq
–sjdbGTFfile	GTF file of Transcripts	/home/cws/CompGenomics/dm6/dm6.Ens_98.gtf
–outSAMtype	output file type	BAM SortedByCoordinate
–quantMode	Count reads per gene	GeneCounts
–readFilesCommand	Command for data decompression	zcat
–outFileNamePrefix	Tag output files with custom prefix	sampleID

When working out a complex command, it's a good idea to test it on some sample data. Let's use our unix skills to grab some data and make a toy data file to play with as input.

There's a data file in the class data directory, we can non-destructively uncompress it and take the top 1000 lines, and then re-direct that to a local file in our current directory:

# grab 1000 lines of fastq data from a file
gunzip -c /home/cws/CompGenomics/Data/RNA-Seq/gd7_1.fastq.gz | head -1000 > sample.fastq

For now let's just begin with the simplest example possible: align our toy data using STAR

STAR --runThreadN 2 --genomeDir /home/cws/CompGenomics/dm6/Ens_98/STAR_51bp --readFilesIn sample.fastq

The command should complete within a minute or two. When it's done, explore your directory: ls -al
What do you see?

Once you explore the files, remember to remove them. They are simply a test result. You don't really need to keep them, unless you want to compare them to some other results generated with different parameters.

# clean up
rm *.log Aligned.out.sam SJ.out.tab

3.1 Run STAR on your data files

If the steps above are working, we can move on and use STAR to align all of our data. We will be using all of the options in the table above because we need to accomplish several things:

We will use more threads (processors). Change --runThreadN to 3.
Our input data is compressed in gz format, so we will use --readFilesCommand to supply STAR with the decompression method: zcat or gunzip -c.
We want to supply a set of transcript descriptions, using --sjdbGTFfile and the location of our GTF file.
We want to quantify read counts on genes, so we supply the --quantMode argument.
We need a BAM file as output. We use --outSAMtype to specify this.
We use --outFileNamePrefix to track the output files for each sample separately.

3.1.1 Learn how to use shell variables

Using all the variable above will make for some long unwieldy commands to type. Some of the things we want to include in our command have long path names. We can make things easier by assigning some of our resources to variables. This makes them easy to re-use, and more amenable to putting into a script, in case we want to run somethign on lots of files (as is common in genomics).

For instance, the location of the gtf file is: /home/cws/CompGenomics/dm6/dm6.Ens_98.gtf Instead of typing that out each time we want to use it, we can assign it to a variable, and then we can use the variable in all our commands. The variable is only defined in your current shell session, and will go away when you log out.

(FWIW, the pathways above are not that long, but if we were on our regular servers, some of the resources are in places with very long pathways)

# we can assign things to a variable using the equals sign
gtf=/home/cws/CompGenomics/dm6/dm6.Ens_98.gtf

Now we can use that file more easily. We use a dollar sign in front of the variable name to retreive the value. Let's try some commands to check to make sure our variable is properly defined.

# check that we set a value
echo $gtf

# check the file
ls -ah $gtf

# how many lines are in the file
wc $gtf

3.2 The Alignment Index

When you run STAR, you need an aligment index. Where do they come from?

The alignment index is created by STAR using a copy of the target genome. It can be created with or without gene annotations. If you create it with gene annotations, then you don't have to supply them when you align your fastq data. We created the index for you, with gene annotations, as follows. You do not have to run this step.

gtf=/home/cws/CompGenomics/dm6/dm6.Ens_98.gtf
genome=/home/cws/CompGenomics/dm6/dm6.fa
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir /home/cws/dm6/Ens98_STAR_51 --genomeFastaFiles $genome \
 --sjdbGTFfile $gtf --sjdbOverhang 51

You can use this existing index as is. Because it was created with the GTF file, you do not have to supply gene annotations when you run STAR on your fastq files.

3.2.1 Run STAR with all our parameters

# set up some variables
fastq_base=/home/cws/CompGenomics/Data/RNA-Seq
gd=/home/cws/CompGenomics/dm6/Ens98_STAR_51

# run STAR
STAR --runThreadN 3 --genomeDir $gd --readFilesIn $fastq_base/gd7_1.fastq.gz  --outSAMtype BAM SortedByCoordinate \
  --quantMode GeneCounts --readFilesCommand zcat --outFileNamePrefix gd7_1_

In the command above we define two variables. One of them is used as the path to our data, and we use it to help us access our individual data files $fastq_base/data.fastq.gz. The second new thing is that while this command is meant to be typed as one long line, I have used a backslash \ to make the command fit onto the page. Using the backslash is a way to tell the shell interpreter that your command extends on the next line.

4 You can script all of this

The command above is good for aligning one file. But what if you have many files? If you feel ambitious, you can write your command in a script so that it gets executed over and over again and is customize to run on each input file.

Writing a script can be as simple as putting your commands into a file, one after the other. You can then run the script like a program, and your commands will all be executed in order.

However, putting things into a script opens up a new world of possibilities. Here is an example of a script that runs STAR for all the gd7 samples:

#!/bin/bash

# define variables
index=/home/cws/CompGenomics/dm6/Ens98_STAR_51
# get our data files
FILES=/home/cws/CompGenomics/Data/RNA-Seq/gd7_*.fastq.gz

for f in $FILES
do
    echo $f
    base=$(basename $f .fastq.gz)
    echo $base
    STAR --runThreadN 3 --genomeDir $index --readFilesIn $f --outSAMtype BAM SortedByCoordinate \
         --quantMode GeneCounts --readFilesCommand zcat --outFileNamePrefix $base"_"
done

echo "done!"

Some things to notice:

The first line is special. It tells the computer what program will understand and execute your code. the bash program is the same program that gives you your command line.
We can use a for loop to do something over and over again, in this case, do something for each input data file.
We use a program called basename to extract pieces of our input file name, and assign them to a new variable.
We use echo to check that our variables contain the values we think they do.
You can build this command up, line by line. You can run it by commenting out the STAR part to make sure everything makes sense before running that part.
You can also place echo in front of the STAR command, which will print out the command, rather than run it. This way you can check to see if it looks like it's doing the right thing.

5 Check the Results and assemble a complete data set

At the end of the process above, you should have 2 sets of files for the gd7 mutants. You will have sorted BAM files that can be used for further analysis or visualization, and you will have files of gene counts that you can use for differential expression analysis.

To complete your data set, you need the toll10b data as well. The toll10b files have already been generated for you. Copy them from the class directory to your project directory:

cp /home/cws/CompGenomics/Data/RNA-Seq/STAR/toll10b_*ReadsPerGene.out.tab ./

With a full data set, now you're ready to take the next step of consolidating the data to a single table, and running EdgeR.

Running EdgeR

6 Extras

6.1 Learn how to run programs while you're away

Many programs take a long time to run (aligning a large data set can take several days). If you start a program and then log off the computer before your program is finished, it will usually be terminated by the system. There are two ways to avoid this and keep programs running after you have logged off.

6.1.1 screen

The screen command is a way to create a virtual console that you can detach from and reconnect to at will. For instance, typing screen will start a shell-like session where you can run commands. When you want to hide the session, you detach from it by typing control-A d, and it will disappear giving you your original terminal back. You can logout and your "virtual" session will still exist. You can reconnect to it, by typing screen -r and your "virtual" session will become your current terminal. You can exit your screen session by typing exit. It is possible to run many different screens at once, and they will all be identified by a unique number.

6.1.2 nohup

Placing the word nohup in front of your command will cause your command to continue running even after you have logged off the machine. This command causes your process to ignore the HUP (hangup) signal. Any output from your command that would have gone to the terminal is placed in the file: nohup.out. An example is shown below. Placing the ampersand at the end puts your process in the background, and gives you the command-line back.

nohup STAR myindex myfastq &

Note: make sure you know where you are in the system when start your command, as nohup may need to be able to write the nohup.out file to that location. If you are in a place that you don't have permission to write to you will get an error. (you should only be running commands in places you can write to anyway).

Since nohup creates this log file of any command output that would otherwise be going to the screen, if your command is reporting progress you can always check it by looking at the nohup.out log file:

tail nohup.out

# note: use tail -f for continuous reading of the tail of the file