STOWERS INSTITUTE TAG TABLE PIPELINE - USER'S GUIDE
VERSION: BETA

#####################################################################################################################


Sections:

1. Setup
2. Running the Tag Table Generator (Step 1)
3. Running the Alignment Interpreter (Step 2)


#####################################################################################################################


###	SECTION 1:
###	Setup


********************


"The Pipeline" is actually two Perl scripts. Each script is as self-contained and transparent as possible, and well 
annotated.

The pipeline also uses an organism parameters file (Ensembl_Chromosome_Coord_Systems.par).  Users will need to edit 
this file if their organisms aren't already there.

The fourth file included is an auxiliary Excel spreadsheet (code_summary_autoformatter.xls).  It provides nice 
screen formatting for the code summary report produced in Step 2.


********************


The pipeline requires 5 things:

1. Linux (or Unix).
	The pipeline was built and tested on CentOS 5, GFS filesystem.

2. Perl 5.8.8 or higher.
	Required Modules: Cwd, DBI, Data::Dumper.
	Ensembl Required Modules: If using an Ensembl DB, you will need the current Perl API/bioperl modules 
	  installed.  See the Perl API section at www.ensembl.org.

3. A powerful computer.  
	At home, the pipeline is run on an 64-bit, 8x1GHz CPU GFS server with 64GB RAM.  
	The largest genomes (human, mouse) can take up to 45% of its memory, but still use only 1 CPU.

4. An Ensembl 'core' or 'otherfeatures' database on a mysql host.
	'Core' is the usual install and uses established genes and transcripts (see selection at ftp.ensembl.org).
	'Otherfeatures' is analogous to core but uses only EST loci and transcripts.
	Perl DBI access must be allowed; you must also have table creation, modification, and update permissions on 
	  the database.
	If you log into another computer to run the pipeline, the mysql host must be accessible to them.

5. A folder somewhere to call home.
	Preferably a dedicated folder with no global write permissions, as you don't want anyone tinkering with your
	  file tables.
	If you log into another computer to run the pipeline, your folder must be accessible to them.


********************


The Plan:

1. Create your folder somewhere.  This will be the pipeline 'root', scripts must be run from this folder.

2. Copy the scripts to this location.

3. Download and install the pertinent Ensembl database, if you haven't done so already.

4. Edit the 01_GEX_STEP_1.pl and 02_GEX_STEP_2.pl and change these things (all in the first few lines):
	A. The shebang (the first line, which starts with #!).
	B. The Bioperl and Ensembl API pointers.
	C. The mysql host name.
	D. Your mysql user name.
	E. Your mysql password.
	F. Comment off/on the $verbose variable. (basically: do you want play-by-play reporting to your screen?)
	G. To set a nonstandard tag length at runtime, turn on the $choose_tag_length variable.

5. Check Ensembl_Chromosome_Coord_Systems.par: are your organisms of interest there?  If not, you'll need to add 
   them (below).

6. You should now be ready to run the pipeline.


********************


Some pointers about the directory structure:

1. Directory structures for pipeline steps 1 and 2 overlap.  They are designed to use each other's directory trees 
   to keep outputs well organized.

2. Tag table generation makes a directory with the database name and organizes all its output inside this directory.  
   Each set of tag tables generated from that database is stored in a subfolder with the table-set systematic name,
   so multiple sets of tag tables (with different parameters) can be generated without interfering with each other.  
   Should you make two sets of tables with the same parameters, you will be warned and asked to add something to the
   systematic name, in order to keep things separate.

3. Alignment interpretation will use the same database directory to store all experiments you create which use that 
   database.  Within each experiment folder, interpretation runs are stored in timestamped subfolders, so outputs 
   will never be overwritten.


********************


Editing Ensembl_Chromosome_Coord_Systems.par:

This step is only required for keeping the file up-to-date with your Ensembl or Ensembl-ized databases.  This file 
is used to identify the database coordinate system associated with the standard chromosomes, so that appropriate 
sequence regions can be identified, and scaffolds / nonstandard chromosomes can easily be identified, even if they
share the same coordinate system.  It also identifies the organelle chromosomes so that their genes can be handled
separately.

1. You will need:

	1. The genus and species from the database name.
	2. The 'core' database coord_system_id that corresponds to chromosomes (see below).
	3. Optionally, the 'otherfeatures' database coord_system_id that corresponds to chromosomes.
	4. The list of chromosome names from the database (comma-delimited) (see below on finding the names).
	5. The mitochondrial chromosome name.
	6. The chloroplast chromosome name, if applicable.

2. Open Ensembl_Chromosome_Coord_Systems.par and add these six fields to a new line, in order, separated by tabs:

	A. genus_species
	B. coord_system_id for 'core' database
	C. coord_system_id for 'otherfeatures' database (or, 0)
	D. the mitochondrial chromosome name (or, blank)
	E. the chloroplast chromosome name (or, blank)
	F. The complete comma-delimited list of standard chromosomes (NO scaffolds) for this organism, including 
	   mito and chloroplast.  Remember to use the database's names!

3. Save and exit!  Unless you don't know those coord_system_id numbers:

	1. Log into the mysql host and switch to the database for the organism of interest.
	2. Type "select * from coord_system;" and hit Enter.
	3. Look for "chromosome" in the "name" column.  Note the corresponding coord_system_id number.  
		If more than one "chromosome", go with the one that says "default version".
		If no "chromosome", go with "scaffold" -- and prepare for a really long runtime.
		If no scaffold either, this genome build may be too preliminary to be useful!  You can try 
		  "supercontig" or something.
	4. Type "select distinct * from seq_region where coord_system_id = your_number;" and hit Enter.
		A list of chromosomes (and maybe some scaffolds) should appear, if this is the right coord_system_id.
	5. Here are all the chromosome names, as Ensembl records them.
		Take note of the real chromosomes versus any scaffolds.
		Are any names different from what you normally use?  You will have to use Ensembl's names when 
		  working with this database.
		Note that Ensembl uses "MT" for the mitochondrial chromosome, not of "M".


#####################################################################################################################


###	SECTION 2:
###	Running the Tag Table Generator ("Step 1")


********************


Running GEX_STEP_1.pl:

1. Execute the script from a competent machine -- requires 10-16GB RAM per gigabase of genome.
	A 3 gigabase genome can absorb over 33GB of RAM during processing.  1 gigabase genomes will take ~16GB.

2. Runtime parameters:

	A. Choose the Ensembl DB.  (The host is queried for a list of Ensembl 'core' and 'otherfeatures' DBs)
	B. Choose whether to use chromosomes only, scaffolds only, or both.
	C. Choose the exclusion level.
	D. Choose the enzyme.  
	
	Incidentals:
	E. If Compara databases are present, you will be asked to choose one (to use for repeat rescue)
	F. If $choose_tag_length was switched on, then you will also be asked for the tag length.
	G. If another set of tables with that extension exists in the folder or the chosen DB, you will be asked for
	   an extension to add.

3. As the script begins, it will pause a moment and allow you to send it to background (shell: CTRL-Z, then "bg")
	If you ignore the pause, it will continue in the foreground (best for troubleshooting)
	Runtimes are determined by processor power, mysql server load, and the number of reads being processed.  You
	  may want the process safely tucked away, as it will be running for a while.
	The script will continue reporting to the screen, even from background, unless the window is closed.
	The script will initially create some directories and some blank files; don't open the files while the script
	  is running.

4. The very last file to be created is fasta_README.txt in the 'fastas' folder.  The script will report to screen 
   when it has completed, along with a timestamp.


#####################################################################################################################


###	SECTION 3:
###	Running the Alignment Interpreter ("Step 2")


********************


Running GEX_STEP_2.pl:

1. Execute the script from a competent machine -- may require up to 15 GB RAM?
	Highest observed memory use is ~13.5GB for a 51-million read run, but a run half that size still takes 11GB.

2. Runtime parameters:

	A. Enter the full path to the Gerald output directory (or wherever you've put the Eland results)

	B. Do all 7 lanes belong to one experiment?
	   1. If yes, enter the experiment name.
	   2. If no, you will be asked to divide the lanes by experiment.  Enter separate experiments in blocks like
	      this: "ExpName:Lane1,Lane2,Lane3"
	      a. For instance, say the flowcell has two experiments on it.  One is lanes 1-4 (mouse), the other is 
		 lanes 5-7 (human).
	      b. You would enter the following:  "Mouse:1,2,3,4 Human:5,6,7" (without the quotes)

	C. For each experiment you defined, choose the following parameters:
	   1. Choose the Ensembl DB.  (The host is queried for a list of Ensembl 'core' and 'otherfeatures' DBs)
	   2. Choose the tag table index within that database.  (Multiple versions can coexist)
	   3. Filter Eland results?  (This restricts analysis to reads found in the s_*_results.txt files)
	   4. Crop reads to tag length?  (Alignments may have been a few bases longer than necessary)
	   5. Ignore N-containing reads?  (Eliminate reads with ambiguous basecalls)
	   6. Output raw counts or TPM?  (Should the output tables report raw read counts, or tags-per-million?  TPM 
	      values are calculated on a lane-by-lane basis, not scaled to the entire flowcell).
	   7. Generate "complete" track files?  (If yes, all nonstandard chromosomes and scaffolds will get tracked.
	      If no, only standard chromosomes will be tracked. Usually, "no" is the desired answer).

3. As the script begins, it will pause a moment and allow you to send it to background (shell: CTRL-Z, then "bg")
	If you ignore the pause, it will continue in the foreground (best for troubleshooting)
	Runtimes are determined by processor power, mysql server load, and the number of reads being processed.  You
	  may want the process safely tucked away, as it will be running for a while.
	The script will continue reporting to the screen, even from background, unless the window is closed.
	The script will initially create some directories and some blank files; don't open the files while the script
	  is running.

4. The very last file to be created is tables_README.txt in the timestamped folder.  The script will report to screen 
   when it has completed, along with a timestamp.


********************


Acceptable Inputs:

s_*_sequence.txt files in fastA, fastQ, or SCARF formats
s_*_eland_results.txt files -- NOT s_*_eland_match.txt or s_*_sorted.txt, although they will be supported later. 


********************


A notes on the fields found in annotations_table.txt:

Everything should be obvious, except for the three fields "M_Tags", "O_Tags", and "P_Tags". What these are:

"P_Tags" = "Possible Tags", the total number of tags assigned to that gene, according to the tag tables.

"O_Tags" = "Observed Tags", the subset of P_Tags which actually received mappings.

"M_Tags" = "Mapped Tags", the number of real tags that were mapped to O_Tags.  Due to U1 and U2 reads, this number is
often far greater than O_Tags. Just for error checking, the value of M_Tags cannot exceed 9*nCr(L,2)+3L+1, where L 
is the tag length in bp, and nCr(L,2) is the value of "n-Choose-r" with n=L and r=2. For 17bp tags, M_Tags <= 1275.


********************


Using the Code Summary Autoformatter:

1. Copy code_summary.txt over to a Windows drive or some place accessible from Windows.

2. Open it and hit CTRL-A, CTRL-C. (select all, copy)

3. Open code_summary_autoformatter.xls and select the "RawData" tab.

4. Select cell A1 and hit CTRL-V. (paste)

5. The contents of code_summary.txt should appear in this page.  

6. The Results are now visible in the "Formatted" tab, and look a lot nicer.  It should be ready to print, in 2 pages
   (readwise + tagwise)


#####################################################################################################################