Introduction to the Unix operating system

Goals
- Resources:
Unix Introduction
Handy Commands
Redirect Output
Extras
- gzip
- wget
Examples

Goals

Learn how to work in Unix
Accomplish a few exercises
Learn how to use a text editor
Learn some markdown to create documentation
Characterize two gene lists

Resources:

Unix tutorial written by Michael Stonebank.
Handy cheat sheet with many basic commands. Print it out and use it!
Text editors. Pick one, learn it, use it:
- nano - a very simple text editor.
- emacs - a very powerful text editor with many features.
- vim - another powerful text editor.
Julia's twist ChIP-Seq data can be found here: ~cws/CompGenomics/Data

Unix Introduction

bash and ssh

When you log into a system, your commands are interpreted and executed by something called the shell. Most systems use the bash shell, which stands for Bourne Again SHell, a reference to the creator of the highly influential Bourne shell (released in 1977) written by Stephen Bourne. The way you connect and log into a shell is by using a client such as terminal.app (Mac) or putty.exe (PC) which uses ssh - secure shell, a protocol for communicating back and forth with your shell. The secure part is that the information is encrypted over the network as it passes back and forth.

On a MAC, you have something called the terminal.app in your Utilities directory. It's a good idea to add it to your dock.

Terminology

host

The host is the name of the computer you want to connect to or work on.

To log into a system you need a username, and a password.
On a linux system, you might log in to another host with the ssh command: ssh username@host

Directories

Directories are folders. When you're logged into a system, you are always in some directory (akin to entering a building, you are always in some room). The system is laid out and organized by directories and how they are connected to each other. Directories can contain files or other directories (aka subdirectories). You can refer to any directory using a combination of names and symbols.

A dot (.) means "the current directory" - which is the directory you are currently in.
Two dots (..) means the directory above you.
If there is a directory in your current directory (aka a subdirectory) you refer to it by name, as you do most other directories.
When you log into the system, you are in your home directory: /home/username
- A shorthand for your home dir is the tilde symbol (~).
- A shorthand for someone else's home directory is tilde followed by their username (thus ~cws is how you would specify my home dir).

Forward slash (/)

The forward slash is used used to separate directory names and build a "path" to any particular directory.

If you have a folder in your home directory called: Twist, you could refer to it in a number of ways:
- if you are in your home directory: ./Twist
- or: /home/username/Twist (where username = your user name)
- or: ~/Twist

Can you tell how each of these are different? Would they all work equally well to get to your Twist data for any user on the system?

Directory and File names

A rule of thumb for how to name files and directories in a UNIX system is to use any name that you can type into a web browser address bar. The rules for internet domain names allow:

the characters of the alphabet (a-z, A-Z)
the numbers 0-9,
the dash "-" character.

No spaces, and no other odd characters are allowed. The reason for this is that spaces have meaning in UNIX. They are used to separate commands. Thus, if you have a directory called: "Twist Data", and you try to specify it's name on the command line like this:

~/Twist Data

The system will think you are naming two things: one is something called Twist in your home dir, and the other is something called Data in your current dir (wherever you happen to be while typing the above). If you do happen to have a folder with a space in the name you can still reference it, but you'd have to do it like this:

~/Twist\ Data

This is called escaping the space. By putting a slash in there you are telling the system not to interpret the space as a command separator. Many other characters have meaning in UNIX as well. Thus to be safe, it is wise to name your files and directories using domain-name like rules, with the addition of the underscore "_" and also taking advantage of the fact that UNIX is case sensitive. This means you can use something called CamelCase to create compound words or phrases without using a space by simply altering capitalization.

Do not use spaces or odd characters in file names. It's not necessary, and you will regret it.

The * symbol

The * symbol is called a wildcard because it is used in pattern matching, especially when you are working with files. If you wanted to reference all files that start with the word "twist" in their name, you could type: twist* and this would match any file names that start with the word twist.

Command structure and options

Commands are executed by typing or specifying their name. Sometimes you also need to specify parameters that control how the command runs. The parameters, or options, are often specified by a single dash followed by other characters. For instance typing date will return the local time, but if you specify other options:

date -u

this alters the command behavior, by telling it to return coordinated universal time.

Using man and info to learn more

To find out more about any command you can use the info utility. For example, the get information on the date command:

info date

Type q to quit the help viewer. Or use return or the space bar to scroll through the information.

An older and more commonly available method of getting help is to look up the manual page for a command using man. It works that same way as info, but can be a little terse.

Common Commands

ls

List files. This is the most common command you will use. The only way to "look around" is to list things.

ls list files
ls -a list all files, even "hidden" files (names start with a dot).
ls -al list all files, use long format (details).
ls -altrh what does this mean? How would you find out?

cd

Change directory. cd destination where destination is the location of some other directory.

cd ..                # Go to the directory above me.
cd /home/cws/CompGenomics  # go to the genomes directory. (use ls to see what's there)
cd                   # With no other arguments this command takes you to your home dir.

Use the tab key for autocompletion while specifying locations.

pwd: Print working directory. If you lose track of where you are, use this to remind yourself of your current location.
cp: Copy a file(s). The cp command has the syntax: cp original new
mv: Move a file from one location to another, or to rename a file: mv target destination
rm: Remove a file(s). Be careful with this one.

Examples

rm filename       # remove the file
rm *.bed          # remove all files with .bed suffix
rm -i filename    # ask me if I'm sure before removing a file
rm -f filename    # don't ask me, just do it (force)

On some systems, administrators have set rm to be rm -i by defaut, to keep people from accidentally removing files.

mkdir: Create a directory: mkdir dirname
wget: grab a file or web page from the internet: wget url

Alias

Try typing alias and see what you get.

You can create your own aliases using the alias command.

# create an alias to examine file attributes
alias fa='ls -alh'

An alias will go away when you log out. To create then so they are always there when you log in, you can add them to your .bashrc file.

Handy Commands

These commands are often used to work with and interrogate data in files and streams.

command	description
head	examine the top 10 lines of a file (can be customized: head -20 filename)
tail	examine the last 10 lines of a file (can also be customized)
wc	word count returns lines, words, characters in a file
cut	cut a column of data from a file: cut -f2 filename
sort	sort a file or a stream of data
grep	search lines of a file for a pattern
comm	compare two files or streams for what is common or unique to each

Redirect Output

The > and | symbols are for redirection of output.

These symbols allow you to link commands together and control where the output of a command goes. By default, the standard output of commands goes to the screen. If you're looking at too much data to fit on your screen, or if you want to do something else with the data besides look at it, this can be a problem. The answer is to redirect where the data goes.

For example, the programs installed on the system are in: /usr/bin. List them:

ls /usr/bin

Did you catch all that? If not, try sending the output to the paging program less using the pipe symbol.

ls /usr/bin | less

Now you can page through them with the space bar, or type q to quit the viewer.

operator	description
\|	pipe symbol, connect output of one program to input of another
>	redirect the output of a program to a file (creates a file, overwrites existing files)
>>	redirect the output of a program to append to a file

Extras

gzip

Compressing and decompressing files is a fundamental part of working with large data sets. gzip and gunzip are commands for compressing and uncompressing files respectively. They each have default behaviors you can modify.

Command	Description
gzip filename	compresses filename to filename.gz and removes original file
gunzip filename.gz	decompresses filename.gz to filename and removes compressed file
gunzip -c filename.gz	decompresses filename.gz non-destructively to stdout, so you can pipe it through another process.

We use the -c option frequently with gunzip to leave the original file unmodified while still reading data from the .gz file. The .gz files might be in an archival location (read-only). We can read them this way without having to copy them.

wget

Sometimes you want to retreive something from the web to your local environment. The wget command is for retreiving web-based content with URLs. It will download web content to a local file. You can also use it to mirror websites, etc.

# download that table of genes from the supplement
wget https://science.sciencemag.org/highwire/filestream/708652/field_highwire_adjunct_files/5/aaq1736_Table-S5.xlsx

Examples

Log into franklin.stowers.org and step through these examples to get familiar with executing commands. There is a directory of genomic resources at /home/cws/CompGenomics/dm6. List them:

ls /home/cws/CompGenomics/dm6

List the dates associated with the files

ls -l /home/cws/CompGenomics/dm6

Let's examine a file of data on all fly genes:

# Lines starting with # are comments, the system ignores them
# but you can use them to explain what you are doing.

# Here is a file with fly gene information
/home/cws/CompGenomics/dm6/dm6.Ens_98.gene_data.txt
# count the lines:
wc /home/cws/CompGenomics/dm6/dm6.Ens_98.gene_data.txt

Let's make a local copy of this file, but first we should confirm where we are in the system, and perhaps make a project directory for exploring this file.

# where are we?
pwd

# make a project directory
mkdir flygenes

# go into the directory
cd flygenes

# make a local copy of the fly gene file
cp /home/cws/CompGenomics/dm6/dm6.Ens_98.gene_data.txt ./

# did it work? How big is the file?
ls -l

# examine what kind of data is in the file
head dm6.Ens_98.gene_data.txt

### How many different chromosomes are in the file? ###

# Can we get just the chromosomes?
cut -f3 dm6.Ens_98.gene_data.txt | head

# list the unique entries
cut -f3 dm6.Ens_98.gene_data.txt | sort -u

# count them
cut -f3 dm6.Ens_98.gene_data.txt | sort -u | wc

# do they all have chr in their name?
cut -f3 dm6.Ens_98.gene_data.txt | sort -u | grep chr | wc

# an alternative to sort -u is uniq
cut -f3 dm6.Ens_98.gene_data.txt | sort | uniq | wc

# uniq has a nice option for counting.
# what does the following tell us?
cut -f3 dm6.Ens_98.gene_data.txt | sort | uniq -c

# Put just the gene ids into a file
cut -f1 dm6.Ens_98.gene_data.txt | sort -u > gene_ids.txt

Cleaning up is an important part of file management. We don't need these files anymore, so let's get rid of them.

# remove the file
rm dm6.Ens_98.gene_data.txt

# move out of the directory
cd ../

# remove the directory
rmdir flygenes

Go to exercises