Working with Affy GeneChip Sequence and Annotation Files in Access

Earl F. Glynn
Stowers Institute


Affymetrix provides sequence and annotation information as raw data files for use with a variety of software tools.  This page discusses some details that my be helpful in working with these raw data files.  Mouse and Chicken GeneChips will be used in the following discussion.

Where are the raw data files containing Affy sequence and annotation information?  For some time the starting point to get GeneChip information is the Affy products page. The selections on that page recently changed.  Look for a dropdown box under "Expression Analysis Arrays" on that page and select an array of interest.

I find Affy's naming conventions somewhat confusing for their mouse chips.  The older chip was called the "Murine Genome U74Av2" (MGU74Av2).  Newer chips are called "Mouse Expression Set 430" and are sold in "A" and "B" chips, MOE430A and MOE430B.  The annotations were better for the "A" chip in general, with many ESTs present on the "B" chip..  Then these "A" and "B" chips were combined into a single chip, Mouse430_2.  For some reason there is an "A" version of this chip now, Mouse430A_2.  Having two chips with similar names, MOE430A and Mouse430A_2, is a confusing convention

After you are certain which chip you care about and have found its web page, there should be an area at the lower right labeled "ADDITIONAL SUPPORT".  One of the links under this heading is "Technical Notes."  To find data, follow the link labeled "All Support Materials...".  Here are direct links to these support materials:  Mouse430_2   Chicken

On each support materials page you'll want the links under "NetAffx Annotation Files" and "Sequence Files."  For example, with Mouse430_2 you'll see these files:

NetAffx Annotations Files

  • BLASTP
  • BLASTX
  • CSV
  • MAGE-ML XML
  • Orthologs

Sequence Files

  • Consensus Sequences
  • Control Sequences
  • Probe Sequences, FASTA or Tabular
  • Target Sequences, FASTA
 

Affy updates these files roughly quarterly (March, June, Sept, Dec), and they provide no automatic notification when the files are updated.  As far as I know, the only way to find out if new files exist is to periodically check for updates. For now, Affy provides some release notes on this page identifying the significant changes. 

Automating the download process:  I never would have found Affy's automatic download tool without calling and asking Affy on the phone: On the main Affy page, select the Support link from the tabsheet horizontal menu bar.  Along the left-side, click on the Developers' Network link.  Next, select the DevNet Tools link.  Look for the section "NetAffx Download Utility."  Download and install this program under Windows.  This utility is setup to check for changes in the files every day (even though most files change quarterly at most). To automatically detect changes, one needs to check the annotation directory every day and check for changes.  Here's what this utility looks like:

DO NOT specify a network drive for the Annotation Directory or the service will just die silently without transferring any files. 

The NetAffyx Download Utility is not a "normal" program.  You control this program via an icon in the Windows Control Panel.


What information is in the CSV Annotation file?  The Mouse430_2 annotation file has significant information for the mouse GeneChip.  However, very little information is currently available for the Chick GeneChip.  Knowing which fields have information, and which fields should be ignored because they are constant or missing, for a particular genome is useful information.

In the past I have loaded the annotation CSV files into Access databases to "connect" Affy results with this annotation information.  I found the default Access definitions often truncated a number of data fields.  I would need to changed any "text" fields that were longer than 256 bytes to a "memo" to avoid this truncation.

To solve both problems (to find which fields have useful data, and to see how long the fields are), I wrote a simple R script to create a summary table. Copy and paste this line into R and select a file to process an Affy xxx_annot.csv file:

source("http://research.stowers-institute.org/efg/ScientificSoftware/Applications/Affy/Annotations/AffyStats.R")

Here are the results from processing the file Mouse430_2_annot.csv:

              GeneChip.Array Species.Scientific.Name Annotation.Date
1 Mouse Genome 430 2.0 Array            Mus musculus    Sep 16, 2005

                               name MinLength MaxLength UniqueCount MissingCount
1                      Probe.Set.ID        10        29       45101            0
2                    GeneChip.Array        26        26           1            0
3           Species.Scientific.Name        12        23           5            0
4                   Annotation.Date        12        12           1            0
5                     Sequence.Type        16        18           2            0
6                   Sequence.Source         7        31           2            0
7       Transcript.ID.Array.Design.         6        26       39079            0
8                Target.Description        76       814       42783            0
9          Representative.Public.ID         6        26       39079            0
10         Archival.UniGene.Cluster         3         9       32169         2812
11                       UniGene.ID         3        23       20579         5919
12                   Genome.Version        20        20           1            0
13                       Alignments         3       996       37658         2472
14                       Gene.Title         3      6259       24385         2770
15                      Gene.Symbol         1      2234       21842         4607
16             Chromosomal.Location         3      1752        4304        10091
17             Unigene.Cluster.Type         3        19           4        11330
18                          Ensembl         3      2847       15932        12127
19                      Entrez.Gene         3      1777       21727         4475
20                        SwissProt         3      4142       16969        10308
21                               EC         3       169         483        43472
22                             OMIM         3         3           1        45101
23                RefSeq.Protein.ID         3      2423       17922         9332
24             RefSeq.Transcript.ID         3      2121       17941         9300
25                          FlyBase         3         3           1        45101
26                              AGI         3         3           1        45101
27                         WormBase         3         3           1        45101
28                         MGI.Name         3      2047       13558        23541
29                         RGD.Name         3         4           2        45100
30             SGD.accession.number         3         3           1        45101
31 Gene.Ontology.Biological.Process         3     10734        5022        24729
32 Gene.Ontology.Cellular.Component         3      5147        2847        24266
33 Gene.Ontology.Molecular.Function         3     10091        5365        22245
34                          Pathway         3       923         438        39949
35                 Protein.Families         3      1056         423        44355
36                  Protein.Domains         3      1267         618        43936
37                         InterPro         3      2401        4050        21957
38                   Trans.Membrane         3      2205        4994        37696
39                              QTL         3        40           2        45100
40           Annotation.Description       146       189         200            0
41    Annotation.Transcript.Cluster         6      6000       41566            0
42           Transcript.Assignments        48     29972       43111            0
 43                 Annotation.Notes         3     29972       19485        25510 

Notes about Mouse430_2 Data Summary:

  • The UniqueCount for Probe.Set.ID gives the number of probes in the GeneChip, which is 45,101 in this case.
  • Many of the MinLength values are 3, since Affy uses "---" for missing values.
  • Fields with MaxLength > 255 cannot be loaded into Microsoft Access database table as "Text" fields.  A "Memo" field must be used to avoid field truncations.
  • Fields with only a single unique value, e.g., GeneChip.Array, Annotation.Date, do not have any "per probe" information.  These fields are constant, at least for this GeneChip.
  • Fields with considerable missing data are clearly identified from the high MissingCount column with UniqueValue=1, e.g., FlyBase, AGI, and WormBase fields.
  • The MGI.Name field dropped from 18,249 unique identifiers on March 31, 2005 to only 13,073 on June 20, 2005.  Most of the other fields had similar statistics in the last two versions of this file.
  • R added periods to field names to force compliance with its conventions. 

Here are the results from processing the file Chicken_annot.csv:

   GeneChip.Array Species.Scientific.Name Annotation.Date
1  Chicken Array           Gallus gallus    Sep 16, 2005

                               name MinLength MaxLength UniqueCount MissingCount
1                      Probe.Set.ID        12        25       38535            0
2                    GeneChip.Array        13        13           1            0
3           Species.Scientific.Name        13        13           1            0
4                   Annotation.Date        12        12           1            0
5                     Sequence.Type        18        18           1            0
6                   Sequence.Source         3        31           3        12784
7       Transcript.ID.Array.Design.         6        16       33534            0
8                Target.Description        35       708       34597            0
9          Representative.Public.ID         3        21       33920          689
10         Archival.UniGene.Cluster         3         9       16829        17361
11                       UniGene.ID         3        46       14995        16335
12                   Genome.Version        39        39           1            0
13                       Alignments         3       903       33864         2415
14                       Gene.Title         2      4248       21758         4298
15                      Gene.Symbol         1      1241       13804        15857
16             Chromosomal.Location         3         7          31        38267
17             Unigene.Cluster.Type         3        11           3        33028
18                          Ensembl         3         3           1        38535
19                      Entrez.Gene         3       974       13736        15644
20                        SwissProt         3       589        3443        32543
21                               EC         3        29          73        38373
22                             OMIM         3         3           1        38535
23                RefSeq.Protein.ID         3      1419       13596        15833
24             RefSeq.Transcript.ID         3      1241       13597        15832
25                          FlyBase         3         3           1        38535
26                              AGI         3         3           1        38535
27                         WormBase         3         3           1        38535
28                         MGI.Name         3         3           1        38535
29                         RGD.Name         3         4           2        38534
30             SGD.accession.number         3         3           1        38535
31 Gene.Ontology.Biological.Process         3       561           2        38534
32 Gene.Ontology.Cellular.Component         3       184           2        38534
33 Gene.Ontology.Molecular.Function         3       522           2        38534
34                          Pathway         3       606          41        38446
35                 Protein.Families         3       734         485        37492
36                  Protein.Domains         3      1174         745        36663
37                         InterPro         3       374        3689        18341
38                   Trans.Membrane         3      7417        3537        32115
39                              QTL         3        40           2        38534
40           Annotation.Description       133       190         148            0
41    Annotation.Transcript.Cluster         0      3853       32917            0
42           Transcript.Assignments        32     19852       36945            0
43                 Annotation.Notes         3      5581       14663        23763

Notes about Chicken GeneChip Data Summary:

  • The UniqueCount for Probe.Set.ID gives the number of probes in the GeneChip, which is 38,535 in this case.
  • In the June 20, 2005 database almost all the fields were missing, however in this Sept 16, 2005 database there are a significant number of values defined for these fields:  Archival.UniGene.Cluster, UniGene.ID, Alignments, Gene.Title, Gene.Symbol, Entrez.Gene, RefSeq.Protein.ID, RefSeq.Transcript.ID.  Three months ago there was only one probe annotated with information in all these fields, so remarkable progress has been made.
  • It's clear from this statistical summary that some fields are nearly entirely "missing data":  When MissingCount is 38,535, there is absolutely no data in the given field.
  • Many of the MinLength values are 3, since Affy uses "---" for missing values, but seems to have not used this convention in fields 40-42..

Importing Annotation CSV Files into Access.  A few additional notes from Affymetrix would make this process much easier.  In particular, Affy's Probe Set Annotations in Tabular Format document should include some of the following notes.

Why not Excel instead of Access?  Unbelievably, Microsoft still has a limit of 65536  (i.e., 64K) rows in Excel 2002.  If you try to load more rows than that, like with the Orthologs file, Excel will show this message

You won't see records past row 65536 (The Orthologs file for the older MOE430A has over 200,000 rows). 

Access provides database "joins" and other ways to manipulate the data not done easily in Excel.  (The same functionality is available in R using the "merge" statement. I hope to replace some work I've been doing in Access and use R instead.


Loading the Annotation Table in Access.  The Microsoft Access defaults are not always helpful in importing this file.  Start with an empty Access database:

File | Get External Data | Import | files of type:  Text Files (*.csv) |
select the MOE430A_annot.csv file | Import | Delimited | Next

Check:  First Row Contains Field Names

Next | In a New Table | Next | Next | No Primary Key | Next

Select the Advanced button.

Microsoft's Import Specification dialog box is a bit rigid and doesn't allow resizing so all the fields can be viewed at one time. 

As shown below, none of the fields need to be indexed for now.

Several of the Data Types need to be changed from the usual default Text data type (that only works with strings with 255 or fewer characters) to a Memo data type since the strings are fairly long:

Check the MaxLength column in the above AffyStats summary table to determine which fields must be a "memo" type to avoid truncation.

Many of these "Memo" fields are so long because they contain "denormalized data" from a database perspective.  These fields contain "repeating groups," such as multiple Gene Ontology identifiers within a single field.  

I named the new table "Annotation" to be consistent with the column name on the Affy web page.

Specify the name ("Annotation" as shown above) | Finish

If all goes well, you'll see the above message.  Select OK.

When you have problems, you're likely to see a message such as this:

Fields that are truncated should be changed from "Text" to "Memo".

The Probe Set ID can be used as a primary key since it's unique, which can be changed, if desired, in table design mode.  Here's what the table looks like in design mode:

Here are observations about the various data fields for MOE430A:

Comments about Annotation Data File for MOE430A GeneChip

Field Comments
Probe Set ID

This field is unique, so it can be used as a primary key.
MOE430A has 22690 probesets.

GeneChip Array Constant:  Mouse Genome MOE430A Array
Species Scientific Name Mus musculus  (count = 22,645)

Bacillius subtilis (count = 24)
Enterobacteria phage P1 (count = 4)
Escherichia coli (count = 14)
Homo sapiens (count = 3)

All non-mouse species are part of the "Control Sequence" probesets. See Sequence Type.

Annotation Date Mar 21, 2005
Sequence Type Consensus (count = 22,626)  or Control (count = 64)
[Note:  "Control" sequences have a prefix of "AFFX" in their Probe Set ID.]
Sequence Source GenBank  (count = 22,626) or "---" (count = 64)
Target Description

Denomalized field that could be parsed into additional fields. For example: 

gb:BC024686.1 /DB_XREF=gi:19354080 /FEA=FLmRNA
/CNT=416 /TID=Mm.26422.1 /TIER=FL+Stack /STK=110
/UG=Mm.26422 /LL=54161 /UG_GENE=Copg1 /DEF=Mus
musculus, coatomer protein complex, subunit gamma 1, clone
MGC:30335 IMAGE:3992144, mRNA, complete cds.
/PROD=coatomer protein complex, subunit gamma 1
/FL=gb:AF187079.1 gb:BC024686.1 gb:NM_017477.1
gb:BC024896.1

Genome Version Constant:  May 2004 (NCBI 33)
Alignments

Denomalized field that could be parsed into additional fields. 
For example: 

chr6:88224568-88250219 (+) // 97.36 // qD1

Unigene Cluster Type "full length" (count=18643), "est" (count=33), "est /// full length" (count=4), "---" (count=4010)
LocusLink

Usually a single ID, but can show repeating groups:
For example:

14017 /// 216984

SwissProt

Often shows repeating groups in this single field:

P41230 /// Q6ZQF8 /// Q80XQ9 /// Q8CGG4

RefSeq Protein ID Can show repeating groups:  NP_034291 /// NP_666135
RefSeq Transcript ID Can show repeating groups:  NM_010161 /// NM_146023

FlyBase
AGI
WormBase

constant:  "---"
Gene Ontology Biological Process

Denomalized field that could be parsed into additional fields. 
For example: 

6886 // intracellular protein transport // inferred from sequence or structural similarity /// 15031 // protein transport // inferred from sequence or structural similarity /// 6810 // transport // inferred from sequence or structural similarity

Gene Ontology Cellular Component

Denomalized field that could be parsed into additional fields. 
For example: 

5794 // Golgi apparatus // inferred from sequence or structural similarity /// 16020 // membrane // inferred from sequence or structural similarity

Gene Ontology Molecular Function

Denomalized field that could be parsed into additional fields. 
For example: 

5525 // GTP binding // inferred from electronic annotation /// 3743 // translation initiation factor activity // inferred from electronic annotation

Pathway

Sparse information, but is a denormalized field:

Example:  Nucleotide_Metabolism // GenMAPP

Protein Families

Sparse information, but is a denormalized field:

Example:
ec // TDH_XANCP // TDH_XANCP EC:1.1.1.103:THREONINE 3-DEHYDROGENASE (EC 1.1.1.103). // 2.0E-14 /// ec // TDH_XANCP // TDH_XANCP EC:1.1.1.103:THREONINE 3-DEHYDROGENASE (EC 1.1.1.103). // 3.0E-15

Protein Domains

Sparse information, but is a denormalized field:

Example:

scop // a.4.3.ARID domain // All alpha proteins; DNA/RNA-binding 3-helical bundle; ARID-like; ARID domain // 2.0E-31

InterPro

Very sparse information.  Defined for only 10 probesets.

Example:

IPR002586 // Cobyrinic acid a,c-diamide synthase

Trans Membrane
QTL
Constant:  "---"
Annotaton Description

78 possible values.  Most common (count=14249)

This probe set was annotated using the Matching Probes based pipeline to a Locus Link identifier using 1 transcripts. // false // Matching Probes // A

Annotation Transcript Cluster

Field contains repeating groups:

Example:  AB017026(11),AK050641(11),BC046466(9),NM_207530(11)

Transcript Assignments

Field contains subfields that could be parsed:

Example: 
AB091827 // Mus musculus Peg10 mRNA for paternally expressed 10, complete cds. // gb // 11 // ---

Annotation Notes

"---" for over half (count=11820)

Field contains subfields that could be parsed:

Example: 

AB054987 // gb // 3 // Cross Hyb Matching Probes

 


Loading the BLASTP and BLASTX tables.  These CSV files also cannot be loaded into Access databases using default parameters.  The "Hit Description" must be made into a Memo field.  The "E-value" can be stored as a "double" instead of a character string.  The Probe Set IDs are not unique, so this field cannot be made a primary key, but making it an index is probably a good idea to speed future processing.


 



Loading the Orthologs table.  This one is easy since the Microsoft Access defaults work fine.

 

The Probe Set IDs are not unique in this table, but an index can be added to this field to speed joins with other tables.  This table has 18,663 unique Probe Set IDs.

 

Ortholog Chip Total Curated
Ortholog
Putative
 Ortholog
AtGenome1 715   715
ATH1-121501 1,525   1,525
C. elegans 2,062   2,062
DrosGenome1 4,293   4,293
HC-G110 2,613 2,544 69
HG-Focus 11,844 11,205 639
HG-U133A 27,617 25,519 2,098
HG-U133B 13,680 11,502 2,178
HG-U133-PLUS 44,992 40,150 4,842
HG-U95Av2 16,040 14,984 1,056
HG-U95B 8,514 7,314 1,200
HG-U95C 6,665 5,679 986
HG-U95D 4,762 4,149 613
HG-U95E 7,294 6,330 964
Hu35KsubA 9,542 8,452 1,090
Hu35KsubB 4,974 4,272 702
Hu35KsubC 6,056 5,372 684
Hu35KsubD 6,070 5,390 680
HuGeneFL 8,493 8,039 454
RAE230A 8,054 550 7,504
RAE230B 610 47 563
RG-U34A 7,433 520 6,913
RG-U34B 1,631 125 1,506
RG-U34C 1,932 109 1,823
RN-U34 1,453 79 1,374
RT-U34 939 44 895
TOTAL 209,803 162,375 47,428

Loading Lists of Probe Set IDs for Consensus, Control and Target Sequences.  The following files contain FASTA sequence information, so they cannot be loaded into Access tables directly:

  • MOE430A_consensus
  • MOE430A_control
  • MOE430A_target

A short UNIX script can be used to extract the probeset IDs from these files, which can then be loaded into Access:

cat create_probeset_lists
#! /bin/sh

echo "Probe Set ID" > consensus.txt
echo "Probe Set ID" > control.txt
echo "Probe Set ID" > target.txt

grep '>' MOE430A_consensus | cut -d':' -f3 | cut -d';' -f1 >> consensus.txt
grep '>' MOE430A_control   | cut -d':' -f3 | cut -d';' -f1 >> control.txt
grep '>' MOE430A_target    | cut -d':' -f3 | cut -d';' -f1 >> target.txt

The consensus.txt file can be loaded into an Access table, named Consensus, like this:

File | Get External Data | Import | consensus.txt | Import | Delimited | Next | Comma Delimiter |

First Row Contains Field Names | Next

In a New Table | Next | Next | "Probe Set ID" as primary key | Next

Import to Table:  Consensus

Finish

Repeat the above procedure for the control.txt  and target.txt files and change the table names accordingly. 

This process would also work with the MOE430A_probe_fasta file, which is in FASTA format.  However, the MOE430A_probe_tab file is already tab delimited delimited, so that file can be loaded directly into Access.  To get Access to recognize the file a little easier, let's add a ".txt" suffix to MOE430_probe_tab.  Use these similar steps to load the file:

File | Get External Data | Import | MOE430A_probe_tab.txt | Import | Delimited | Next | Tab Delimiter |

First Row Contains Field Names | Next

In a New Table | Next | Next | No Primary Key | Next

Import to Table:  Probe

Finish

In Design Mode, change the field "Probe Set Name" to "Probe Set ID" for consistency with other tables.  This field is not unique so it cannot be a primary key, but make it an indexed field with "duplicates OK".

Here's what's in the Probe table:


 

Each probe set has a number of rows in this table.  For example the first probe set, 1415670_at, has eleven probe pairs.  Here is the distribution of the number of probe pairs per probe set for MOE430A:

 

Number of Probe Pairs
Per Probeset
Frequency Comments
8 1 1424262_at
9 4 1439244_a_at, 1420464_s_at, 1434127_a_at, 1437835_a_at
10 11  
11 22,631  
20 40 controls
21 3 controls
Total 22,690  

 

The (Probe X, Probe Y) coordinates in this table specify the location of the probe pair in the Affy .DAT file.


Load the list of "mask" maintenance gene probeset IDs.   From http://www.affymetrix.com/support/technical/mask_files.affx

Each GeneChip® expression array listed below features approximately 100 maintenance genes which serve as a tool to normalize and scale your data prior to performing data comparisons.

To facilitate the use of these genes in scaling or normalization, the mask files that identify these maintenance probe sets on each array are available below.

Download the file http://www.affymetrix.com/Auth/support/downloads/mask_files/moe430anorm.zip.  Unzip this file to create the file moe430anorm.msk.  Copy this file to moe430A_maintenance.txt.  Replace the first three lines ("MOE430A", "[Call]", "[Comp]") with "Probe Set ID". 

Load this file into an Access table and call it "Maintenance":  File | Get External Data | Import | moe4380Aa_maintenance.txt | Import | Delimited | Next | First Row Contains Data | Next | Next | Next | select Probe Set ID as primary key | Next | Maintenance | Finish

A "join" of the Annotation and Maintenance tables can be used to see additional info about these maintenance genes.

 

 

For MOE430A, all of the maintenance probe sets had exactly 11 probes.

NOTE:  The "Control" genes have a Probe Set ID starting with "AFFX" so there is no overlap between the "Control" genes and the "Maintenance" genes.


Summary.  The notes above show how to load these MOE430A sequence and annotation tables into an Access database:

Sequence Files

Table Records Unique ProbeSet IDs Comments
Consensus 26,626 26,626 ID List only
Control 64 64 ID List only
Target 22,690 22,690 ID List only
Probe 249,958 22,690  

Annotation Files

Table Records Unique ProbeSet IDs Comments
Annotation 22,690 22,690  
blastp 36,115 17,743  
blastx 33,336 9,861  
Orthologs 209,803 18,663  

Other Files

Table Records Unique ProbeSet IDs Comments
Maintenance 100 100 ID List only

 

Also see:  Affy GeneChip MOE430A:  Probe Sequences in a ProbeSet Example

 


e f g @ s t o w e r s - i n s t i t u t e . o r g

Updated
7 Oct 2005