Introduction to the Unix operating system
Table of Contents
Goals
- Learn how to work in Unix
- Accomplish a few exercises
- Learn how to use a text editor
- Learn some markdown to create documentation
- Characterize two gene lists
Resources:
- Unix tutorial written by Michael Stonebank.
- Handy cheat sheet with many basic commands. Print it out and use it!
- Text editors. Pick one, learn it, use it:
- Julia's twist ChIP-Seq data can be found here: ~cws/CompGenomics/Data
Unix Introduction
bash and ssh
When you log into a system, your commands are interpreted and executed by something called the shell. Most systems use the bash shell, which stands for Bourne Again SHell, a reference to the creator of the highly influential Bourne shell (released in 1977) written by Stephen Bourne. The way you connect and log into a shell is by using a client such as terminal.app (Mac) or putty.exe (PC) which uses ssh - secure shell, a protocol for communicating back and forth with your shell. The secure part is that the information is encrypted over the network as it passes back and forth.
- On a MAC, you have something called the terminal.app in your Utilities directory. It's a good idea to add it to your dock.
Terminology
host
The host is the name of the computer you want to connect to or work on.
- To log into a system you need a username, and a password.
- On a linux system, you might log in to another host with the ssh command:
ssh username@host
Directories
Directories are folders. When you're logged into a system, you are always in some directory (akin to entering a building, you are always in some room). The system is laid out and organized by directories and how they are connected to each other. Directories can contain files or other directories (aka subdirectories). You can refer to any directory using a combination of names and symbols.
- A dot (.) means "the current directory" - which is the directory you are currently in.
- Two dots (..) means the directory above you.
- If there is a directory in your current directory (aka a subdirectory) you refer to it by name, as you do most other directories.
- When you log into the system, you are in your home directory: /home/username
- A shorthand for your home dir is the tilde symbol (~).
- A shorthand for someone else's home directory is tilde followed by their username (thus ~cws is how you would specify my home dir).
Forward slash (/)
The forward slash is used used to separate directory names and build a "path" to any particular directory.
- If you have a folder in your home directory called: Twist, you could refer to it in a number of ways:
- if you are in your home directory:
./Twist
- or:
/home/username/Twist
(where username = your user name) - or:
~/Twist
- if you are in your home directory:
Can you tell how each of these are different? Would they all work equally well to get to your Twist data for any user on the system?
Directory and File names
A rule of thumb for how to name files and directories in a UNIX system is to use any name that you can type into a web browser address bar. The rules for internet domain names allow:
- the characters of the alphabet (a-z, A-Z)
- the numbers 0-9,
- the dash "-" character.
No spaces, and no other odd characters are allowed. The reason for this is that spaces have meaning in UNIX. They are used to separate commands. Thus, if you have a directory called: "Twist Data", and you try to specify it's name on the command line like this:
~/Twist Data
The system will think you are naming two things: one is something called Twist in your home dir, and the other is something called Data in your current dir (wherever you happen to be while typing the above). If you do happen to have a folder with a space in the name you can still reference it, but you'd have to do it like this:
~/Twist\ Data
This is called escaping the space. By putting a slash in there you are telling the system not to interpret the space as a command separator. Many other characters have meaning in UNIX as well. Thus to be safe, it is wise to name your files and directories using domain-name like rules, with the addition of the underscore "_" and also taking advantage of the fact that UNIX is case sensitive. This means you can use something called CamelCase to create compound words or phrases without using a space by simply altering capitalization.
Do not use spaces or odd characters in file names. It's not necessary, and you will regret it.
The * symbol
The * symbol is called a wildcard because it is used in pattern matching, especially when you are working with files. If you wanted to reference all files that start with the word "twist" in their name, you could type: twist* and this would match any file names that start with the word twist.
Command structure and options
Commands are executed by typing or specifying their name. Sometimes you also need to specify parameters that
control how the command runs. The parameters, or options, are often specified by a single dash followed by other characters.
For instance typing date
will return the local time, but if you specify other options:
date -u
this alters the command behavior, by telling it to return coordinated universal time.
Using man and info to learn more
To find out more about any command you can use the info
utility. For example, the get information on the date command:
info date
Type q
to quit the help viewer. Or use return
or the space bar
to scroll through the information.
An older and more commonly available method of getting help is to look up the manual page for a
command using man
. It works that same way as info
, but can be a little terse.
Common Commands
- ls
List files. This is the most common command you will use. The only way to "look around" is to list things.
ls
list files
ls -a
list all files, even "hidden" files (names start with a dot).
ls -al
list all files, use long format (details).
ls -altrh
what does this mean? How would you find out?- cd
- Change directory.
cd destination
where destination is the location of some other directory.
cd .. # Go to the directory above me. cd /home/cws/CompGenomics # go to the genomes directory. (use ls to see what's there) cd # With no other arguments this command takes you to your home dir.
Use the tab
key for autocompletion while specifying locations.
- pwd
- Print working directory. If you lose track of where you are, use this to remind yourself of your current location.
- cp
- Copy a file(s). The cp command has the syntax:
cp original new
- mv
- Move a file from one location to another, or to rename a file:
mv target destination
- rm
- Remove a file(s). Be careful with this one.
Examples
rm filename # remove the file rm *.bed # remove all files with .bed suffix rm -i filename # ask me if I'm sure before removing a file rm -f filename # don't ask me, just do it (force)
On some systems, administrators have set rm to be rm -i
by defaut, to keep people from accidentally removing files.
- mkdir
- Create a directory:
mkdir dirname
- wget
- grab a file or web page from the internet:
wget url
Alias
Try typing alias
and see what you get.
You can create your own aliases using the alias
command.
# create an alias to examine file attributes
alias fa='ls -alh'
An alias will go away when you log out. To create then so they are always there when you log in, you can add them to your .bashrc file.
Handy Commands
These commands are often used to work with and interrogate data in files and streams.
command | description |
---|---|
head | examine the top 10 lines of a file (can be customized: head -20 filename) |
tail | examine the last 10 lines of a file (can also be customized) |
wc | word count returns lines, words, characters in a file |
cut | cut a column of data from a file: cut -f2 filename |
sort | sort a file or a stream of data |
grep | search lines of a file for a pattern |
comm | compare two files or streams for what is common or unique to each |
Redirect Output
The >
and |
symbols are for redirection of output.
These symbols allow you to link commands together and control where the output of a command goes. By default, the standard output of commands goes to the screen. If you're looking at too much data to fit on your screen, or if you want to do something else with the data besides look at it, this can be a problem. The answer is to redirect where the data goes.
For example, the programs installed on the system are in: /usr/bin. List them:
ls /usr/bin
Did you catch all that? If not, try sending the output to the paging program less
using the pipe symbol.
ls /usr/bin | less
Now you can page through them with the space bar
, or type q
to quit the viewer.
operator | description |
---|---|
| | pipe symbol, connect output of one program to input of another |
> | redirect the output of a program to a file (creates a file, overwrites existing files) |
>> | redirect the output of a program to append to a file |
Extras
gzip
Compressing and decompressing files is a fundamental part of working with large data sets. gzip
and gunzip
are commands
for compressing and uncompressing files respectively. They each have default behaviors you can modify.
Command | Description |
---|---|
gzip filename | compresses filename to filename.gz and removes original file |
gunzip filename.gz | decompresses filename.gz to filename and removes compressed file |
gunzip -c filename.gz | decompresses filename.gz non-destructively to stdout, so you can pipe it through another process. |
We use the -c option frequently with gunzip to leave the original file unmodified while still reading data from the .gz file. The .gz files might be in an archival location (read-only). We can read them this way without having to copy them.
wget
Sometimes you want to retreive something from the web to your local environment. The wget
command is for retreiving web-based content with URLs.
It will download web content to a local file. You can also use it to mirror websites, etc.
# download that table of genes from the supplement
wget https://science.sciencemag.org/highwire/filestream/708652/field_highwire_adjunct_files/5/aaq1736_Table-S5.xlsx
Examples
Log into franklin.stowers.org and step through these examples to get familiar with executing commands.
There is a directory of genomic resources at /home/cws/CompGenomics/dm6
. List them:
ls /home/cws/CompGenomics/dm6
List the dates associated with the files
ls -l /home/cws/CompGenomics/dm6
Let's examine a file of data on all fly genes:
# Lines starting with # are comments, the system ignores them
# but you can use them to explain what you are doing.
# Here is a file with fly gene information
/home/cws/CompGenomics/dm6/dm6.Ens_98.gene_data.txt
# count the lines:
wc /home/cws/CompGenomics/dm6/dm6.Ens_98.gene_data.txt
Let's make a local copy of this file, but first we should confirm where we are in the system, and perhaps make a project directory for exploring this file.
# where are we?
pwd
# make a project directory
mkdir flygenes
# go into the directory
cd flygenes
# make a local copy of the fly gene file
cp /home/cws/CompGenomics/dm6/dm6.Ens_98.gene_data.txt ./
# did it work? How big is the file?
ls -l
# examine what kind of data is in the file
head dm6.Ens_98.gene_data.txt
### How many different chromosomes are in the file? ###
# Can we get just the chromosomes?
cut -f3 dm6.Ens_98.gene_data.txt | head
# list the unique entries
cut -f3 dm6.Ens_98.gene_data.txt | sort -u
# count them
cut -f3 dm6.Ens_98.gene_data.txt | sort -u | wc
# do they all have chr in their name?
cut -f3 dm6.Ens_98.gene_data.txt | sort -u | grep chr | wc
# an alternative to sort -u is uniq
cut -f3 dm6.Ens_98.gene_data.txt | sort | uniq | wc
# uniq has a nice option for counting.
# what does the following tell us?
cut -f3 dm6.Ens_98.gene_data.txt | sort | uniq -c
# Put just the gene ids into a file
cut -f1 dm6.Ens_98.gene_data.txt | sort -u > gene_ids.txt
Cleaning up is an important part of file management. We don't need these files anymore, so let's get rid of them.
# remove the file
rm dm6.Ens_98.gene_data.txt
# move out of the directory
cd ../
# remove the directory
rmdir flygenes