Alignment of RNA Seq data with STAR
Table of Contents
1 Objectives
The goal of this exercise is to align FASTQ files to the mouse genome using STAR. The summary for this project describes the background information.
2 Prepare for an alignment
2.1 Find a Server
In order to align your RNA Seq data, you have to find an available server. Try logging into various servers and see if they look busy. You can use the top command to check server usage. You can also use ganglia to see the status of servers via web browser.
Here is a table of servers at the institute:
Server | cores | Memory |
---|---|---|
cedar | 32 | 450 GB |
hickory | 64 | 512 GB |
maple | 24 | 128 GB |
cypress | 32 | 1 TB |
Addenda: this is the server we'll be using:
Server | cores | Memory |
---|---|---|
franklin.stowers.org | 16 | 64 GB |
2.2 Choose a working directory
Think about how to organize your work up front. Each project you work on generates many files. Create a directory for your project to organize the work and results.
2.3 Choose a system to record your work
You will be executing commands, and taking notes on the results. Just like in the lab, you need a method for recording your work on the computer. A simple text file is one way of recording commands, file pathways, notes about problems or run times, etc. Every project directory should have some kind of descriptive file with information about how the results in the directory were generated. There may be more than one. A simple README or notes.txt file can record basic information, and can exist alongside more polished files that describe or summarize results.
3 Run STAR
The STAR manual has a description of how to run the program and all the options you can supply to STAR. It's a good idea to look this over.
To run STAR on your FASTQ files you generally need three things:
- the location of your FASTQ files
- a STAR index of the genome
- a GTF file of transcript descriptions
The FASTQ files are located in the class Data directory:
/home/cws/CompGenomics/Data/RNA-Seq/
What are these files? How can we view them?
At the Stowers, we have many genome indexes for various organisms that have already been created for you. But since we're not on the internal network, our STAR alignment index is in:
/home/cws/CompGenomics/dm6/
You can see that this directory points to resources around the dm6 version of the fly genome, and these indices have been constructed using version 98 of the Ensembl annotations for mouse genes. There is a GTF file of transcript descriptions there. Can you find it?
The general structure of the STAR command to align some data looks as follows:
STAR NumberOfProcessors IndexDirectory FASTQ_files
We will be specifying these options, and a few others to the program using command-line arguments that begin with a double-dash, as described in the following table:
Argument | Description | Example Value |
---|---|---|
–runThreadN | Number of processors to use | 2 |
–genomeDir | Path to index directory | /home/cws/CompGenomics/dm6/Ens98_STAR_51 |
–readFilesIn | path to input fastq file(s) | /home/cws/CompGenomics/Data/RNA-Seq |
–sjdbGTFfile | GTF file of Transcripts | /home/cws/CompGenomics/dm6/dm6.Ens_98.gtf |
–outSAMtype | output file type | BAM SortedByCoordinate |
–quantMode | Count reads per gene | GeneCounts |
–readFilesCommand | Command for data decompression | zcat |
–outFileNamePrefix | Tag output files with custom prefix | sampleID |
When working out a complex command, it's a good idea to test it on some sample data. Let's use our unix skills to grab some data and make a toy data file to play with as input.
There's a data file in the class data directory, we can non-destructively uncompress it and take the top 1000 lines, and then re-direct that to a local file in our current directory:
# grab 1000 lines of fastq data from a file
gunzip -c /home/cws/CompGenomics/Data/RNA-Seq/gd7_1.fastq.gz | head -1000 > sample.fastq
For now let's just begin with the simplest example possible: align our toy data using STAR
STAR --runThreadN 2 --genomeDir /home/cws/CompGenomics/dm6/Ens_98/STAR_51bp --readFilesIn sample.fastq
The command should complete within a minute or two. When it's done, explore your directory: ls -al
What do you see?
Once you explore the files, remember to remove them. They are simply a test result. You don't really need to keep them, unless you want to compare them to some other results generated with different parameters.
# clean up
rm *.log Aligned.out.sam SJ.out.tab
3.1 Run STAR on your data files
If the steps above are working, we can move on and use STAR to align all of our data. We will be using all of the options in the table above because we need to accomplish several things:
- We will use more threads (processors). Change
--runThreadN
to 3. - Our input data is compressed in gz format, so we will use
--readFilesCommand
to supply STAR with the decompression method:zcat
orgunzip -c
. - We want to supply a set of transcript descriptions, using
--sjdbGTFfile
and the location of our GTF file. - We want to quantify read counts on genes, so we supply the
--quantMode
argument. - We need a BAM file as output. We use
--outSAMtype
to specify this. - We use
--outFileNamePrefix
to track the output files for each sample separately.
3.1.1 Learn how to use shell variables
Using all the variable above will make for some long unwieldy commands to type. Some of the things we want to include in our command have long path names. We can make things easier by assigning some of our resources to variables. This makes them easy to re-use, and more amenable to putting into a script, in case we want to run somethign on lots of files (as is common in genomics).
For instance, the location of the gtf file is: /home/cws/CompGenomics/dm6/dm6.Ens_98.gtf Instead of typing that out each time we want to use it, we can assign it to a variable, and then we can use the variable in all our commands. The variable is only defined in your current shell session, and will go away when you log out.
(FWIW, the pathways above are not that long, but if we were on our regular servers, some of the resources are in places with very long pathways)
# we can assign things to a variable using the equals sign
gtf=/home/cws/CompGenomics/dm6/dm6.Ens_98.gtf
Now we can use that file more easily. We use a dollar sign in front of the variable name to retreive the value. Let's try some commands to check to make sure our variable is properly defined.
# check that we set a value
echo $gtf
# check the file
ls -ah $gtf
# how many lines are in the file
wc $gtf
3.2 The Alignment Index
When you run STAR, you need an aligment index. Where do they come from?
The alignment index is created by STAR using a copy of the target genome. It can be created with or without gene annotations. If you create it with gene annotations, then you don't have to supply them when you align your fastq data. We created the index for you, with gene annotations, as follows. You do not have to run this step.
gtf=/home/cws/CompGenomics/dm6/dm6.Ens_98.gtf
genome=/home/cws/CompGenomics/dm6/dm6.fa
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir /home/cws/dm6/Ens98_STAR_51 --genomeFastaFiles $genome \
--sjdbGTFfile $gtf --sjdbOverhang 51
You can use this existing index as is. Because it was created with the GTF file, you do not have to supply gene annotations when you run STAR on your fastq files.
3.2.1 Run STAR with all our parameters
# set up some variables
fastq_base=/home/cws/CompGenomics/Data/RNA-Seq
gd=/home/cws/CompGenomics/dm6/Ens98_STAR_51
# run STAR
STAR --runThreadN 3 --genomeDir $gd --readFilesIn $fastq_base/gd7_1.fastq.gz --outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts --readFilesCommand zcat --outFileNamePrefix gd7_1_
In the command above we define two variables. One of them is used as the path to our data, and we use it to help us
access our individual data files $fastq_base/data.fastq.gz
. The second new thing is that while this command is meant to be
typed as one long line, I have used a backslash \
to make the command fit onto the page. Using the backslash is a
way to tell the shell interpreter that your command extends on the next line.
4 You can script all of this
The command above is good for aligning one file. But what if you have many files? If you feel ambitious, you can write your command in a script so that it gets executed over and over again and is customize to run on each input file.
Writing a script can be as simple as putting your commands into a file, one after the other. You can then run the script like a program, and your commands will all be executed in order.
However, putting things into a script opens up a new world of possibilities. Here is an example of a script that runs STAR for all the gd7 samples:
#!/bin/bash
# define variables
index=/home/cws/CompGenomics/dm6/Ens98_STAR_51
# get our data files
FILES=/home/cws/CompGenomics/Data/RNA-Seq/gd7_*.fastq.gz
for f in $FILES
do
echo $f
base=$(basename $f .fastq.gz)
echo $base
STAR --runThreadN 3 --genomeDir $index --readFilesIn $f --outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts --readFilesCommand zcat --outFileNamePrefix $base"_"
done
echo "done!"
Some things to notice:
- The first line is special. It tells the computer what program will understand and execute your code. the bash program is the same program that gives you your command line.
- We can use a
for loop
to do something over and over again, in this case, do something for each input data file. - We use a program called
basename
to extract pieces of our input file name, and assign them to a new variable. - We use
echo
to check that our variables contain the values we think they do. - You can build this command up, line by line. You can run it by commenting out the STAR part to make sure everything makes sense before running that part.
- You can also place
echo
in front of the STAR command, which will print out the command, rather than run it. This way you can check to see if it looks like it's doing the right thing.
5 Check the Results and assemble a complete data set
At the end of the process above, you should have 2 sets of files for the gd7 mutants. You will have sorted BAM files that can be used for further analysis or visualization, and you will have files of gene counts that you can use for differential expression analysis.
To complete your data set, you need the toll10b data as well. The toll10b files have already been generated for you. Copy them from the class directory to your project directory:
cp /home/cws/CompGenomics/Data/RNA-Seq/STAR/toll10b_*ReadsPerGene.out.tab ./
With a full data set, now you're ready to take the next step of consolidating the data to a single table, and running EdgeR.
6 Extras
6.1 Learn how to run programs while you're away
Many programs take a long time to run (aligning a large data set can take several days). If you start a program and then log off the computer before your program is finished, it will usually be terminated by the system. There are two ways to avoid this and keep programs running after you have logged off.
6.1.1 screen
The screen command is a way to create a virtual console that you can detach from and reconnect to at will. For instance, typing screen will start a shell-like session where you can run commands. When you want to hide the session, you detach from it by typing control-A d, and it will disappear giving you your original terminal back. You can logout and your "virtual" session will still exist. You can reconnect to it, by typing screen -r and your "virtual" session will become your current terminal. You can exit your screen session by typing exit. It is possible to run many different screens at once, and they will all be identified by a unique number.
6.1.2 nohup
Placing the word nohup in front of your command will cause your command to continue running even after you have logged off the machine. This command causes your process to ignore the HUP (hangup) signal. Any output from your command that would have gone to the terminal is placed in the file: nohup.out. An example is shown below. Placing the ampersand at the end puts your process in the background, and gives you the command-line back.
nohup STAR myindex myfastq &
Note: make sure you know where you are in the system when start your command, as nohup may need to be able to write the nohup.out file to that location. If you are in a place that you don't have permission to write to you will get an error. (you should only be running commands in places you can write to anyway).
Since nohup creates this log file of any command output that would otherwise be going to the screen, if your command is reporting progress you can always check it by looking at the nohup.out log file:
tail nohup.out
# note: use tail -f for continuous reading of the tail of the file