Instructions for using tRMA:

tools for R Microarray Analysis (Version 1.7.0)

 

Dale Wilson   Mike Buckley

CSIRO Mathematical and Information Sciences

 

 

February 21st, 2002

 

İ COPYRIGHT Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia 2001.

 

 

All rights in this software and its documentation are reserved to CSIRO.  You are only permitted to have this software and documentation in your possession and to make use of it if you have agreed to a Licence Agreement with CSIRO.

 

All enquiries for the use of this software should be referred to: trma@cmis.csiro.au

 

 

1.   Introduction

 

This document outlines the use and details of the functions contained in tRMA (tools for R Microarray  Analysis). The document consists of a broad overview of the current functionality of tRMA, installation instructions, a tutorial using the example datasets provided with the software, and details of the individual functions available in tRMA.

 

The functions in tRMA currently allow you to load certain types of (cDNA) microarray data files, examine the data through display and plot functions, normalise the data, average the data, and extract genes that are highly differentially expressed. The functions run inside a freely available statistical software packaged called R (4), which must be installed separately by you. The R software package can be downloaded from http://www.r-project.org/. Note that, in order to run tRMA you must also download and install the cluster package, the multiv package, and the GeneSOM package. All of these are available from pages linked to from the R home page. (The links you go to from the r-project page are “CRAN”,  “Precompiled Binary Distributions – Windows”, and then “contrib”. Then look for the appropriate zip files.). Once you have downloaded cluster.zip, multiv.zip and GeneSOM.zip you need to install them by double clicking on the executable file rwinst.exe. You would have needed to use this file when you installed R on your PC, so it should still be floating around your directories somewhere. You need to use the “add a package” option when the executable’s running, and select the “library” directory inside the R directory structure on your C drive as the place to install the two packages. Note that you should not unzip the two packages yourself – they will be unpacked correctly by rwinst.exe.

 

tRMA currently allows you to load three different formats of data, gpr files created using GenePix image analysis software, text files downloaded from the Stanford microarray database, and dat files created using the Spot image analysis software. There is provision made to cope with slides in which the dyes had been swapped. When the GenePix files are created (inside GenePix) there is a facility for you to enter a list of gene names and gene IDs. These two columns must be present in the gpr files loaded into tRMA for tRMA to work correctly. The gene names may or may not be known at the time the gpr file is created. The gene IDs should be generated by you inside GenePix and can consist of both letters and numbers. You should give each spot that came from the same cDNA aliquot the same gene ID. Spots with the same gene ID, both within and across slides, may be combined in the tRMA software. Hence, if two spots did not originate from the same aliquot of cDNA they should have different gene IDs. For other information on the formats of gpr files, Stanford database files and the Spot dat files please refer to documents specifically related to those file types.

 

The functions available in tRMA for displaying data include the following:

·         plotting the data on various types of axes (e.g. R versus G, log(R) versus log(G), etc), with an optional interactive feature where gene IDs for individual selected points on the plot can be displayed;

·         displaying images of the data on various axes, where each pixel in the image represents the individual data for each spot; and

·         printing subsets of the data to the screen for examination of individual spot values.

 

There are currently four different normalisation methods available in tRMA, slide normalisation, pin normalisation, spatial normalisation, and a normalisation based on housekeeping genes. All of the methods perform non-linear normalisation for an individual slide by removing biases in differential gene expression within that slide due to average expression level and, in the pin and spatial methods, biases due to the physical location of each spot on the slide. The resulting dataset will be centred and possibly scaled.

 

The slide, pin and spatial normalisation methods require each slide to contain a large proportion of the genes to be non-differentially expressed. This proportion may be of the order of 80% of the spots on the slide. This is because these normalisation methods estimate the biases from the levels of differential expression in the mass of non-differentially expressed genes. Hence, these methods will not work on “boutique arrays” where there are only a small number of genes on each slide (300 perhaps), all of which have been selected because they are highly differentially expressed. A further requirement on the use of the pin and spatial methods is that the highly differentially expressed genes should be spaced randomly across the slide. This is difficult to do accurately, because for most of the genes on the slide you will not know in advance what their differential expression is likely to be. However, some effort must be made to avoid having all of the differentially expressing genes clustered together in one (or more) physical regions on the slide(s). If these genes are clumped together they will introduce biases into the normalisation procedures and final results may be skewed. The housekeeping gene normalisation is specifically designed for use on specialist, boutique arrays where most of the genes on the slide are expected to be highly differentially expressed.

 

Two functions are provided in tRMA for averaging cDNA microarray data. You may wish to average over gene replicates (specified as spots with the same gene IDs) and/or over slide replicates. The averaging is performed over normalised data only, so that biases are removed prior to averaging.

 

A function is available in tRMA for determining highly differentially expressed genes within an individual slide (which may be the result of averaging over several slides). The selection method is an empirical one, based on the dataset being analysed (i.e. no pre-specified cut-off is used[1]), using assumptions on the distribution of the data after normalisation has taken place. Results from the selection are output to files.

 

Comparison of results from selecting highly differentially expressed genes across several slides is possible within tRMA, whereby genes that are found to be highly differentially expressed in a (user-)specified percentage of a set of slides can be found. Results from this selection are also output to files.

 

There are several multivariate functions available in tRMA, including principal components analysis, hierarchical clustering and K-Means clustering.

1.1. Installation

 

To install the tRMA functions into R you must do the following:

1.       Save the zipped file in a directory of your choice.

2.       Unzip the files.

3.       Start R.

4.       From inside R, source the function SourceFunctions.r by either clicking on the yellow (open folder) button on the tool-bar, or selecting “Source R code” from the “File” menu. You may need to browse through your directory structure to find the SourceFunctions.r file.

5.       On the command line type SourceFunctions("C:/functions/"), where the text inside the quotes will correspond to the directory where you have unzipped the tRMA functions. 

 

When you have completed these steps you should be able to use any of the functions described below in this help document.

 

To get a statement of how to use a particular function you can type the function name followed by left and right round brackets (with no spaces). Note that there is no usage statement for functions that take no parameters. (However, these functions are usually very simple to use.)

 

Note that, because when you type SourceFunctions("C:/functions/") on the command line, the two required libraries (multiv and cluster) get loaded, you must type SourceFunctions("C:/functions/") on the command line every time you start a new R session. If you don’t rerun the SourceFunctions() command in every new R session, none of the clustering functions in tRMA will work.

 

A number of functions provided in tRMA save results to a directory on you hard drive (or local network). You can specify which directory these results go to by performing the following steps:

  1. Create an R icon on your desktop or on the tool bar
  2. Right-click on the R icon and select the “Properties” option
  3. On the Shortcut tab you will find a “Start in” box, where you should type in the directory (D:\tRMA\results, for example) where you would like the files to be saved
  4. Click on the “OK” button

 

1.2. Warnings

 

Strange errors will occur if your data file names begin with a number. Please change all data file names so that they start with a letter. The filenames can contain numbers, but just not start with one. Also, do not have spaces or the underscore character (_) in file names.

 

You should close all Excel files that you think may need to be overwritten by any of the R functions. An example might be the interesting.genes.compared.txt file, which often gets written to by the analysis functions. If you do have a file open in Excel and you run an R function that tries to write to that file, the function will terminate with an error and the results will not be saved. If this happens, you should close the relevant file in Excel and try running the R function again.

 

2.   Tutorial

 

This tutorial outlines a sample analysis that might be done on the example datasets that were provided with this software and documentation, slide1.gpr, slide2.gpr, slide3.gpr and slide4.gpr. Specific details of the parameter settings for each function are not given here and can be found in the latter sections of this help document.

 

1.       Firstly you would want load the data files:

Ĝ      LoadGenePixFile("D:/tRMA/slide1.gpr",F,4,4)

Ĝ      LoadGenePixFile("D:/tRMA/slide2.gpr",F,4,4)

Ĝ      LoadGenePixFile("D:/tRMA/slide3.gpr",T,4,4)

Ĝ      LoadGenePixFile("D:/tRMA/slide4.gpr",T,4,4)

Note that the location of the files will depend on where you have stored the example datasets. The last two slides, slide3.gpr and slide4.gpr, have had their dyes swapped, so the swap.dyes parameter has been set to TRUE for these two datasets. (N.B. T/F can be used instead of TRUE/FALSE.) An extra column called “Regions” has been manually added (in Excel) to slide3.gpr. Details of what this column represents are given in the explanation of the LoadGenePixFile()function. The column is given in one of the example files to illustrate what such a column could look like.

 

2.       After loading the data, you may want to examine it using some of the display functions:

Ĝ      PlotData(3,slide1.gpr,"slide1.gpr",T)

Ĝ      ImageData(3,slide3.gpr,"slide3.gpr")

Ĝ      DisplayData("slide2.gpr",ID="1226")

Note both the dependence of the differential expression of average expression (which can be seen in the plot) and the dependence of the differential expression on spatial location (which can be seen in the images).

 

3.       The next step would be to normalise the data. Here we use the spatial normalisation method, although similar calls would be made if slide or pin normalisation were desired:

 

Ĝ      SpatiallyNormalise(slide1.gpr,"slide1.gpr",T)

Ĝ      SpatiallyNormalise(slide2.gpr,"slide2.gpr",T)

Ĝ      SpatiallyNormalise(slide3.gpr,"slide3.gpr",T)

Ĝ      SpatiallyNormalise(slide4.gpr,"slide4.gpr",T)

 

4.       The example datasets each contain two replicates of each of 3872 gene, as determined by the gene ID. You may want to generate another dataset (“med.slide1”) that contains specified information for a slide (e.g. median)within which the gene replicate data have been combined:

 

Ĝ      MedianData("spat",T,1,"med.slide1","slide1.gpr")

 

Alternatively, you may want to generate a new dataset (“medall”) that is the average (e.g. median) over all of the example slide replicates:

 

Ĝ      MedianData("spat",T,4,"medall","slide1.gpr","slide2.gpr","slide3.gpr","slide4.gpr")

 

Here we are putting the averaged result in an object called “medall”, which can be plotted and displayed like other datasets. In this example averaging is performed over gene replicates and slide replicates.

 

5.       Having performed some form of normalisation on the data, you would now be able to generate a list of highly differentially expressed genes, or “interesting” genes, for each individual slide:

 

Ĝ      FindDiffExpGenes("spat",slide1.gpr,"slide1.gpr",T)

Ĝ      FindDiffExpGenes("spat",slide2.gpr,"slide2.gpr",T)

Ĝ      FindDiffExpGenes("spat",slide3.gpr,"slide3.gpr",T)

Ĝ      FindDiffExpGenes("spat",slide4.gpr,"slide4.gpr",T)

The results are stored within R and also output to files. The text written to the screen as the function is completed states how to access these results.

 

You may also want to look for “interesting” genes in the dataset that you created from the average of the four example slides:

 

Ĝ      FindDiffExpGenes("spat",medall,"medall",T)

 

6.       After finding “interesting” genes in the four individual slides, you may want to compare the results, to see how many genes appeared as interesting in at least 75% (for example) of your slides. This can be done as follows:

Ĝ      CompareInterestingGenes("spat",4,75,T,"slide1.gpr","slide2.gpr","slide3.gpr","slide4.gpr")

 

You may also want to compare the outputs from the previous function with the genes that were found to be interesting in the averaged dataset. This can be done by displaying each of the results to the screen:

 

 

 

 

Ĝ      DisplayData("interesting.genes.compared")

9

8

7

6

5

4

3

2

1

 

910

2840

800

206

91

459

3430

1582

2275

   ID

unknown

unknown

unknown

unknown

unknown

unknown

unknown

unknown

unknown

Name

8.475254

8.707975

7.261781

7.950969

8.494704

11.830051

8.373859

11.088503

9.891909

MedianAvgLogRLogG

-6.665852

-5.002861

-4.924113

-4.564483

-3.769152

5.881142

6.168855

7.022197

8.199394

MedianSpatNormLogRdivG

3.801347

7.869473

4.958922

4.845781

1.913393

1.751360

3.677305

7.453606

5.552771

SpatNormStdDev

0.4215012

0.3918926

0.6208846

0.6632662

0.5960748

1.7163293

2.6460047

3.5544505

4.3003407

MedianSpatNorm.BacktoRatio

3

4

3

4

3

3

4

4

3

NumberOfSlides

YES

YES

YES

YES

YES

YES

YES

YES

YES

FrequentlyInteresting

 

 

Ĝ      DisplayData("interesting.genes.medall")

 

 

ID

Name

SpatNormLogRdivG  

StdDev

SpatNorm.BacktoRatio

Flags

Regions

Significant

1

2275

unknown        

7.054288

5.690537    

3.2354274

0

1

YES

2

1582

unknown        

7.022197

7.453606    

3.5544505

0

1

YES

3

3430

unknown        

6.168855

3.677305    

2.6460047

0

1

YES

4

800

unknown       

-4.924113

4.750433    

0.6208846

0

1

YES

5

2840

unknown       

-5.002861

7.869473    

0.3918926

0

1

YES

6

910

unknown       

-5.523238

4.279068    

0.4997441

0

1

YES

 

That concludes the tutorial. There are useful functions available in tRMA that were not included in this tutorial, so it would be worth reading the Function Specifications section to find out what other features are available to you.

 

There is often some text and plots output to the console and graphics window when some of the functions are called, and it is worth examining this information when you perform analyses. You will not need to read most of the text after you have used the programme a few times.

 

3.   Function Specifications

3.1. Loading, viewing, saving and removing data

 

LoadGenePixFile(“filename”,swap.dyes,num.blocks.cols,num.blocks.rows)

e.g. LoadGenePixFile("D:/trma/slide1.gpr",TRUE,4,4)

 

This function is used to load GenePix files (*.gpr files) into R. The swap.dyes parameter should be TRUE if you would like the dyes to be swapped, and FALSE otherwise. The parameters num.blocks.cols and num.blocks.rows should be given as the number of columns of blocks (i.e. columns of print tips) and the numbers of rows of blocks (i.e. rows of print tips). 

 

Note that spots that were manually flagged as bad in Genepix (flagged as -100 in flag column of the gpr file) are removed at this stage.

 

This load function either (a) reads in the last column of the data file as the “spot regions”, which have been manually entered by the user prior to starting tRMA, or  (b) it adds a “spot regions” column to the data as it is loaded, so that all of the spots fall into the one region. The spot region labels that could have been manually entered into the original data file can be used as labels of different re-suspension solutions for the cDNA clones (for example).  The labels may be 1's and 2's, for example. Note that the labels must be numbers. If they are not, errors may occur in later analyses. These labels do not have to be put into the data file by the user prior to loading the data into tRMA. If there are no labels, all spots are automatically labelled as being in the same region. This labelling is used later in the SlideNormalise() function.

 

 LoadStanfordFile(“filename”,swap.dyes,num.blocks.cols,num.blocks.rows)

e.g. LoadStanfordFile("D:/trma/SMD1234.txt",TRUE,4,8)

 

This function is used to load files down-loaded from one of the Stanford databases into R. The swap.dyes parameter should be TRUE if you would like the dyes to be swapped, and FALSE otherwise. The parameters num.blocks.cols and num.blocks.rows should be given as the number of columns of blocks (i.e. columns of print tips) and the numbers of rows of blocks (i.e. rows of print tips).

 

Note that spots that were manually flagged as bad in Genepix (flagged as -100 in flag column of the gpr file) are removed at this stage.

 

This load function either (a) reads in the last column of the data file as the “spot regions”, which have been manually entered by the user prior to starting tRMA, or (b) it adds a “spot regions” column to the data as it is loaded, so that all of the spots fall into the one region. The spot region labels that could have been  manually entered into the original data file can be used as labels of different re-suspension solutions for the cDNA clones (for example).  The labels may be 1's and 2's, for example. Note that the labels must be numbers. If they are not, errors may occur in later analyses. These labels do not have to be put into the data file by the user prior to loading the data into tRMA. If there are no labels, all spots are automatically labelled as being in the same region. This labelling is used later in the SlideNormalise() function.

 

LoadSpotFile(“filename”,swap.dyes,num.blocks.cols,num.blocks.rows)

e.g. LoadSpotFile("D:/trma/slide10.dat",TRUE,4,4)

 

This function is used to load Spot files (*.dat files) (Spot being a microarray image analysis package) into R. The swap.dyes parameter should be TRUE if you would like the dyes to be swapped, and FALSE otherwise. The parameters num.blocks.cols and num.blocks.rows should be given as the number of columns of blocks (i.e. columns of print tips) and the numbers of rows of blocks (i.e. rows of print tips). 

 

Note that spots that were manually flagged as bad in Spot (flagged as 1 in badspot column of the dat file) are removed at this stage.

 

This load function adds a “spot regions” column to the data as it is loaded, so that all of the spots fall into the one region. This labelling is used later in the SlideNormalise() function.

 

ListDataAttributes(data)

e.g. ListDataAttributes(slide1.gpr)

 

This function can be used to print out the normalisation attributes of a particular dataset (one that has either been loaded, or is the result of using the MeanData() or MedianData() functions). The slide, pin and spatial normalisation attributes will be listed. The first two will be either “none” or “lowess” depending on whether the respective normalisation procedure has already been called, and the spatial normalisation attribute will be either “none” or “spat”.

 

ListData()

e.g. ListData()

 

This function can be used to list all of the current data objects stored in the R workspace. Note that all of the tRMA functions are also contained in the R workspace, but these will not be listed by the ListData() function.

 

DisplayData(“data”,first,last,ID)

e.g. DisplayData(“slide1.gpr”,1,10,ID=“2037”)

 

This function can be used to display a (single) aded using the LoadGenePixFile() function, or a multivariate dataset generated using the CreateMultivariateData() function. The parameters first, last and ID are all optional, with the default values being the first row of the dataset, the last row of the dataset, and all of the genes in the dataset (i.e. all of the IDs), respectively.  If you choose to put in a value for ID, but leave first and last as their default values, you must include ID= in the command line call to the function (e.g. DisplayData(“slide1.gpr”,ID=“1204”)).

 

Be aware that if you only include a dataset name in the parameter list to the function call, the entire contents of that dataset will be printed to the screen. If the dataset is large, you may not be able to scroll back to view the first rows of the dataset. Hence, using the first and last options may prove to be more useful.

 

RemoveGene(“data”,“ID”)

e.g. RemoveGene(“slide1.gpr”,“1204”)

 

This function can be used to remove all genes with the specified gene ID from the input dataset (stated inside quotes). Note that replicates of the gene will all be removed (since they all have the same gene ID). The dataset will be permanently changed in the R workspace.  To restore your data to the original you would need to reload it.

 

SaveData(“data”,multivariate)

e.g. SaveData(“saved.data”,FALSE)

 

This function can be used to save a dataset to a file (possibly after some normalisation or averaging procedures have been used). The multivariate parameter should be set to TRUE of FALSE. If the parameter is set to FALSE, the file into which the data is saved will be called new.*.txt (e.g. new.saved.data.txt), and will contain the entire dataset. It will be stored in the R working directory[2]. The file can then be opened outside R using text editor software or in Excel. If the multivariate parameter is set to TRUE, the input dataset should be a file created using the CreateMultivariateData() function. If this is not the case an error will occur. If you are saving multivariate data, two files will be saved with file names new.*.normdiff.txt and new.*.avg.txt  (e.g. new.saved.data.normdiff.txt and new.saved.data.avg.txt). These two files contain the matrix of normalised log(R/G) (multiple slide) data and the (log(R)+log(G))/2 (multiple slide) data, respectively.

 

Datasets that are saved using the SaveData() function (and then subsequently deleted, using the CleanWorkspace() function, for example) can be reloaded into the R workspace using the LoadData() function.

 

LoadData(“data”,multivariate)

e.g. LoadData(“saved.data”,FALSE)

 

The LoadData() function is a general function for loading data that has been saved using the SaveData() function. The data that has been saved will be a text file with no header information, but with column names and row labels. This function looks for the file(s) in the R working directory2. If the file(s) do(es) not exist there, an error will result. The parameter multivariate should be set to FALSE if the saved data was a simple dataset, and should be set to TRUE if the saved data was a multivariate data (generated using the CreateMultivariateData() function (see later text)).

 

If multivariate was set to FALSE when the dataset had been previously saved, the name of the file would have been new.*.txt, where the * was the input data name (e.g. “saved.data”). To reload this data into the R workspace, the “data” parameter in the LoadData() function should be set to the name matching that used in the SaveData() function (in this case, “saved.data”). If multivariate was set to TRUE when the dataset had been previously saved, the names of the two saved files would have been new.*.normdiff.txt and new.*.avg.txt, where the * was the input data name (e.g. “saved.multivariate.data”). To reload these two datasets into the R workspace, the “data” parameter in the LoadData() function should be set to the name matching that used in the SaveData() function (in this case, “saved.multivariate.data”).

 

The LoadData() function will cause errors if you try to use it to load a GenePix file, or most other file types output from array image analysis packages. The load functions designed for specific file types should be used in these instances. The main use of the LoadData() function, in conjunction with the SaveData() function, is to get around some of the memory problems that have been encountered in R. With the increasingly large numbers of slides (and genes) that are being analysed, the memory requirements can be too large to handle. Hence, some of the data may need to be analysed then saved to a file outside R. The R workspace can then be cleaned (using the CleanWorkspace() function, for example) and a new set of data analysed. If previously analysed data is required again it can easily be reloaded into the R workspace.

 

PlotData(type.of.plot,data,“data”,interactive)

e.g. PlotData(1,slide1.gpr,“slide1.gpr”,TRUE)

 

This function can be used to plot the data that has been loaded using one of the load functions. There are 7 options for the plotting, where the parameter type.of.plot can be set to a number between 1 and 7. The types of plots that result are as follows:

1.       R versus G

2.       log(R) versus log(G)

3.       (log(R)+log(G))/2 versus log(R/G)

4.       (log(R)+log(G))/2 versus slide normalised log(R/G)

5.       (log(R)+log(G))/2 versus pin normalised log(R/G)

6.       (log(R)+log(G))/2 versus spatially normalised log(R/G)

7.       log(R-Rbg) versus log(G-Gbg)

8.       (log(R-Rbg)+log(G-Gbg))/2 versus log((R-Rbg)/(G-Gbg))

 

Note that many of the plots are not possible on averaged data. This is because only (slide, pin or spatially) normalised data can be averaged, and the original R, G, log(R), log(G), etc. values do not exist as averaged values. A message will appear if you try to attempt certain unavailable plots for averaged data and no plot will be displayed.

 

The interactive parameter is an optional one, for which the default is set to FALSE. If you call PlotData() with interactive set to TRUE, you will be able to click on points in the plot to label them with their gene ID. To label the point, click on it with the left mouse button. The plot will remain interactive (and you won't be able to use the command line) until you click the right mouse button and select stop. If you happen to choose to kill the interactive plotting function by hitting the ESC key when the command line window is active you will succeed in stopping the interactive nature of the plot window. However, you will also mess up the menus that appear at the top of the R window that allow you to save the plot to a file. You will not be able to rectify this problem until you quit out of R and restart it. Hence, the best way to terminate the interactive plotting window is by using the right mouse button and selecting stop.

 

ImageData(type.of.image,data,“data”)

e.g. ImageData(1,slide1.gpr,“slide1.gpr”)

 

This function can be used to display an image of the data that has been loaded using one of the load functions. There are 7 options for the image display, where the parameter type.of.image can be set to a number between 1 and 7. The types of image displays that result are as follows:

 

1.       an R image and a G image

2.       a log(R) image and a log(G) image

3.       a (log(R)+log(G))/2 image and a log(R/G) image

4.       a (log(R)+log(G))/2 image and a slide normalised log(R/G) image

5.       a (log(R)+log(G))/2 image and a pin normalised log(R/G) image

6.       a (log(R)+log(G))/2 image and a spatially normalised log(R/G) image

7.       a log(R-Rbg) image and a log(G-Gbg) image

8.       a (log(R-Rbg)+log(G-Gbg))/2 image and a log((R-Rbg)/(G-Gbg)) image

 

Note that some of the images are not possible on averaged data, in particular, the images that use R (or R-Rbg) or G (or G-Gbg) data. This is because only (slide, pin or spatially) normalised data can be averaged, and the original R, G, log(R), log(G), etc. values do not exist as averaged values. Similarly, if gene replicates have been averaged, the x and y locations of the spots on the slide become meaningless (because spots from different locations have been averaged). Hence, displaying images of this data cannot be done.  A message will appear if you try to attempt certain unavailable images and no images will be displayed.

 

Note that appropriate colours for the images are used. The colours should be red for R data, green for G data, both red and green for log(R/G) data (and the normalised data) (for which red means higher values in the red channel and green means higher values in the green channel), and grey intensities for the (log(R)+log(G))/2 plots. Note that the colour is meaningless in the images of the average level of expression - it is only the intensity that is important. In the grey images, intensities closer to white have higher values and intensities closer to black have lower values. Pixels that are pure white, in any of the images, represent data that was removed. A spot could be removed if it has a flag of less than –50 (as manually flagged as bad in GenePix) or, for the images involving background subtraction, the spot value was negative after the background subtraction in either of the channels.

 

PlotProfile(("id",multivariate,add,norm,"data1","data2",...)

e.g. PlotProfile(“9999”,FALSE,FALSE,“spat”,“slide1.gpr”,“slide2.gpr”,“slide3.gpr”,“slide4.gpr”)

 

This function can be used to plot designated genes (determined by the input parameter id) across arrays. Only one gene ID can be plotted at a time, but the add parameter allows you to add as many genes as you like to an existing plot. Note that if you choose to add genes to a plot, but you don’t have an existing (corresponding) plot already on the graphics window, you will get some strange looking plot in the window. You can list as many as 100 datasets.

 

You can choose to list the datasets from the plotted data should come, or you can list one multivariate dataset that you have already created using the CreateMultivariateData() function. If you are plotting data from a multivariate dataset, you should set the multivariate parameter to TRUE.

 

The norm parameter determines the type of normalised data that will be plotted. If you are listing the datasets individually in the function call (i.e. multivariate is set to FALSE) you will get an error if you one or more of the listed datasets have not been normalised according to the norm parameter. This parameter is obviously redundant when you are plotting data from a multivariate dataset (since the type of normalised data that exists in the multivariate dataset was determined when you called the CreateMultivariateData() function). However, the parameter must still be included when you call the PlotProfile() function.

 

The first gene profile that is plotted onto a set of axes (i.e. when the add parameter is set to FALSE) will always be in black. As you add more gene profiles to the plot, these will be plotted in randomly selected colours. There is no way to ensure that the colours will not be the same for each gene that you add, because there is no memory of what was plotted onto the graph in the previous function call.

 

If gene replicates exist in the dataset(s) that you are plotting, all of the replicates will be plotted onto the graph.

 

If you make repeated calls to the PlotProfile() function adding genes to the current plot, and you change the list of datasets that you are plotting, then the graph will become messy. That is, labels will be overwritten, or lines may appear to extend off the graph. There is no message printed to the screen to alert you that you are making a mistake. The onus is on you to make sure that you are consistent with the data that you are plotting.

 

CleanWorkspace()

e.g. CleanWorkspace()

 

This function can be used to remove all of the files that are currently sitting in your R workspace, except the functions for tRMA. Be warned - this is a brutal function and will remove all of your current results and data. It should only be used when you have a large amount of data inside R, which may cause R to crash if it runs out of memory.

 

 

 

 

3.2. Averaging datasets

 

MeanData(norm,combine-genes,number,“newdataname”,“data1”,“data2”,...)

e.g. MeanData(“spat”,TRUE,3,“meanslide”,“slide1.gpr”,“slide2.gpr”,“slide3.gpr”)

 

Once all of the relevant slides have been loaded using one of the load functions, this function can be used to average the data from those slides (and/or possibly the gene replicates within those slides) together. The function can compute the mean for 1 slide or as many as 20 slides and will put the resulting values in a new dataset named with the text you have input as newdataname. The number of slides listed in the function call must match the number input as the third parameter in the function call. Note when quotes are needed for the input parameters.

 

The averaging is performed with normalised data, either slide, pin, spatially or housekeeping gene normalised, as given in the norm parameter. (Note that if you have already performed housekeeping gene normalisation using the SlideNormalise() function, you should set norm to “slide”.) Normalisation does not have to have already been performed on the input datasets - the function will check to see whether the appropriate normalisation has already been done and will perform it automatically as necessary. However, only the slide, pin or spatial normalisations can be done automatically. If you would like housekeeping gene normalised data you must have performed the normalisation prior to calling the MeanData() function.

 

Averaging is done with the normalised log(R/G) data (which will, by default, have been rescaled) for each spot, using the conventional method for computing means. The standard deviation of the values for each spot is also computed and listed in one of the columns of the newly created dataset. Averaging is also carried out on normalised log(R/G) data that has not been rescaled, and from this data a set of back-transformed values are computed. The back-transformed value for some number x (which is on the log (base 2) scale) will be 2x. There are complexities associated with performing the back-transformation on the averaged unscaled data, which will not be detailed here. However, the back-transformed values will be roughly on the same scale as the original ratio values and may be used to compare the current results with historical results where only (un-normalised) ratio values are reported.

 

If the combine-genes parameter is set to TRUE, the mean is computed over gene replicates (within a slide) as well as over slide replicates. If the combine-genes parameter is set to FALSE, the individual gene replicates will only be averaged across slides. The identifier of an individual gene replicate in this case is the gene ID and the x and y location of the spot on the slide. Hence, the assumption is that gene replicates are always placed in the same locations on the slide replicates. Note that if the function is called with only one dataset, and the combine-genes parameter is set to TRUE, the result will contain the data from the original dataset with the gene replicates averaged.

 

When gene replicates are not combined (i.e. when the combine-genes parameter is set to FALSE), the output dataset (given as “newdataname”) will contain a column of gene IDs that resemble 1234/2/34. The first set of characters before the / is the gene ID, the second set is the x location of the gene replicate and the third set is the y location of the gene replicate. If the gene replicates are combined, the gene IDs listed in the output dataset are simply the original gene IDs.

 

The minimum flag value out of all of the values averaged for each spot is listed in the dataset (and file) of “interesting genes”. Hence, it is possible to see if an “average” gene selected as “interesting” was influenced by any spots flagged with -50.

 

Recall that the spots flagged as -100 were removed when the data were loaded. Hence, it may be possible to have genes present in one slide replicate but not others. This can cause NAs to appear in the column of standard deviations, even when more than one slide is being averaged (since only one valid value may exist for a particular gene ID).

 

MedianData(norm,combine-genes,number,“newdataname”,“data1”,“data2”,...)

e.g. MedianData(“spat”,TRUE,3,“medslide”,“slide1.gpr”,“slide2.gpr”,“slide3.gpr”)

 

Once all of the relevant slides have been loaded using one of the load functions, this function can be used to compute the median of each of the spots on those slides, across the slides and possibly across the gene replicates within the slides. This function is very similar to the MeanData() function, except that the median is computed instead of the mean.

 

The function can compute the medians for 1 slide or as many as 20 slides, and will put the resulting values in a new dataset corresponding to what was input as newdataname. The number of slides listed in the function call must match the number input as the third parameter in the function call. Note when quotes are needed for the input parameters.

 

The median computation is performed with normalised data, either slide, pin, spatially or housekeeping gene normalised, as given in the norm parameter. (Note that if you have already performed housekeeping gene normalisation using the SlideNormalise() function, you should set norm to “slide”.) Normalisation does not have to have already been performed on the input datasets - the function will check to see whether the appropriate normalisation has already been done and will perform it automatically as necessary. However, only the slide, pin or spatial normalisations can be done automatically. If you would like housekeeping gene normalised data you must have performed the normalisation prior to calling the MedianData() function.

 

The median is computed with the normalised log(R/G) data for each spot, using the conventional method for computing medians. The standard deviation of the values for each spot is also computed and listed in one of the columns of the newly created dataset. Note that the standard deviation is not computed in a robust manner (unlike the median), and will therefore be the same as the standard deviations computed in the MeanData() function. As with the MeanData() function the median is also computed for the normalised log(R/G) data that has not been rescaled, and from this data a set of back-transformed values are computed. The back-transformed value for some number x (which is on the log (base 2) scale) will be 2x. There are complexities associated with performing the back-transformation on the averaged unscaled data, which will not be detailed here. However, the back-transformed values will be roughly on the same scale as the original ratio values and may be used to compare the current results with historical results where only (un-normalised) ratio values are reported.

 

If the combine-genes parameter is set to TRUE, the median is computed over gene replicates (within a slide) as well as over slide replicates. If the combine-genes parameter is set to FALSE, the individual gene replicates will only be averaged across slides. The identifier of an individual gene replicate in this case is the gene ID and the x and y location of the spot on the slide. Hence, the assumption is that gene replicates are always placed in the same locations on the slide replicates. Note that if the function is called with only one dataset, and the combine-genes parameter is set to TRUE, the result will contain the data from the original dataset with the gene replicates averaged.

 

When gene replicates are not combined (i.e. when the combine-genes parameter is set to FALSE), the output dataset (given as “newdataname”) will contain a column of gene IDs that resemble 1234/2/34. The first set of characters before the / is the gene ID, the second set is the x location of the gene replicate and the third set is the y location of the gene replicate. If the gene replicates are combined, the gene IDs listed in the output dataset are simply the original gene IDs.

 

The minimum flag value out of all of the values averaged for each spot is listed in the dataset (and file) of “interesting genes”. Hence, it is possible to see if an “average” gene selected as “interesting” was influenced by any spots flagged with -50.

 

Recall that the spots flagged as -100 were removed when the data were loaded. Hence, it may be possible to have genes present in one slide replicate but not others. This can cause NAs to appear in the column of standard deviations, even when more than one slide is being averaged (since only one valid value may exist for a particular gene ID).

 

3.3. Normalisation Functions

 

SlideNormalise(data,“data”,housekeeping,scale)

e.g. SlideNormalise(slide1.gpr,“slide1.gpr”,FALSE,TRUE)

 

This function will add a column to the input data file (given as data), which represents the slide normalised data. Slide normalisation is performed by robustly[3] fitting a curve (called a “lowess” curve) to the input data (or a subset of the data) as plotted on the (log(R)+log(G))/2 versus log(R/G) axes. This method of fitting a single lowess curve to the data on these axes to centre them was first introduced by Dudoit et al. (1) and Yang et al. (7). Centring of the data is always performed, but scaling (adjusting the variance of the data) is only done if scale is set to TRUE. If no value for scale is given it takes the default value of TRUE.

 

The centring is achieved by fitting the lowess curve and computing the residuals between the data and the curve, thus centring the data about zero. The scaling is achieved by dividing the centred data by a robust estimate of the standard deviation of the data on the (log(R)+log(G))/2 versus log(R/G) axes, called the mean absolute deviation. This gives the data a standard deviation of 1.

 

Note that if the function ListDataAttributes() is now called, the slide normalisation attribute of the dataset should now be “lowess” instead of “none”.

 

The input parameter housekeeping can be set to TRUE or FALSE, depending on whether you would like to perform the normalisation on pre-specified housekeeping genes, or on the entire set of data. If housekeeping is set to TRUE, the lowess curve is fit only to those genes with a * as the last character of their gene ID. Specifying these genes should be done at some stage prior to loading the data into tRMA. When the lowess curve has been fitted it is extrapolated to cover the entire fluorescence range of the input dataset and then used to centre all of the spots in the dataset. If housekeeping is set to FALSE, the lowess curve (or curves, in the case of several spot regions (see below)) is simply fit to all of the points in the input dataset, regardless of their gene ID. If housekeeping is set to TRUE, the specified housekeeping genes are highlighted in the results plots.

 

If you specified “spot regions” in your original dataset (prior to loading it into tRMA), the lowess curve fits will be done to each of the sets of “spot regions”. The identifier can be used to categorise spots for various reasons, however we have provided this facility here primarily because of the differing bias effect of different re-suspension fluids used for the cDNA samples before they were spotted onto the slides. Note that if the “spot regions” are specified in the original data file the labels should be numbers only (for example, with the numbers 1 and 2 for specifying two different re-suspension fluids). If other characters are included they may cause unusual errors in certain functions.

 

This function also generates a further two columns (besides the normalised data) in the input data file (given as data): one that is the normalised log(R/G) data that has not been rescaled, and one that is the unscaled normalised data that has been back-transformed. The back-transformed value for some number x (which is on the log (base 2) scale) will be 2x. The back-transformed values will be roughly on the same scale as the original ratio values and may be used to compare the current results with historical results where only (un-normalised) ratio values are reported.

 

A few words of warning should be issued for the situation when housekeeping genes are being used for normalisation purposes. Housekeeping genes are usually to be employed when the input dataset is a “boutique array”, where a large proportion of the genes on the slide are highly differentially expressed, hence rendering the standard slide normalisation as spurious. If you are trying to normalise a “boutique array”, the housekeeping genes that are used should have ideally been calculated from a pilot set of slides using the SelectHousekeepingGenes() function. This pilot set should be slides whereby only a small proportion of genes are highly differentially expressed (say, less than 20%), and where the probes are “similar” to those used in the boutique array. The definition of the term “similar” is extremely subjective. The idea is that the genes that were selected as housekeeping genes in the pilot slides will have similar expression patterns in the boutique arrays, and hence can be used to perform the normalisation. Vigilance of the user is key here: if genes that were selected as housekeeping genes appear to be highly differentially expressed in the array you are trying to normalise, then those genes should be removed from the housekeeping gene list (i.e. the * removed from their gene ID).

 

The genes that are selected in the SelectHousekeepingGenes() function should span most of the fluorescence range of the pilot slide(s). These genes may not span the entire fluorescence range of the boutique arrays. Because of this, it may be advisable to use negative controls in the boutique arrays, to boost the number and fluorescence range of the genes that are being used as housekeeping genes. As for other types of housekeeping genes, if the negative controls appear to be highly differentially expressed in the array you are trying to normalise, then those genes should be removed from the housekeeping gene list (i.e. the * removed from their gene ID).

 

UNDOSlideNormalise(data,“data”)

e.g. UNDOSlideNormalise(slide1.gpr,“slide1.gpr”)

 

This function will undo slide normalisation on a dataset. If slide normalisation has not already been performed, no change is made to other parts of the dataset. You should not normally need to use this function.

 

PinNormalise(data,“data”,scale)

e.g. PinNormalise(slide1.gpr,“slide1.gpr”,TRUE)

 

This function works in a very similar way to the slide normalisation function. The only difference being that the centring is done with not just a single curve fitted to the data, but a curve fitted to each subset of the data corresponding to each pin. This method of fitting a lowess curve to the data for each pin on these axes was first introduced by Dudoit et al. (1) and Yang et al. (7). This centring technique should remove any effects of the different pins in the differential expression, and also remove some of the spatial effects across the slide.

 

Scaling in this function is done in the same way as in the slide normalisation function. As with slide normalisation, centring of the data is always performed, but scaling is only done if scale is set to TRUE. If no value for scale is given it takes the default value of TRUE.

 

Note that if the function ListDataAttributes() is called after pin normalisation has been done, the pin normalisation attribute of the dataset should now be “lowess” instead of “none”.

 

Note that the pin normalisation can cause blocking effects in the data. These effects can be seen if the data is viewed as an image (similar to the original GenePix images). These effects will not appear if the SpatiallyNormalise() function is used instead.

 

This function also generates a further two columns in the input data file (given as data): one that is the normalised log(R/G) data that has not been rescaled, and one that is the unscaled normalised data that has been back-transformed. The back-transformed value for some number x (which is on the log (base 2) scale) will be 2x. The back-transformed values will be roughly on the same scale as the original ratio values and may be used to compare the current results with historical results where only (un-normalised) ratio values are reported.

 

UNDOPinNormalise(data,“data”)

e.g. UNDOPinNormalise(slide1.gpr,“slide1.gpr”)

 

This function will undo pin normalisation on a dataset. If pin normalisation has not already been performed, no change is made to the dataset.  You should not normally need to use this function.

 

SpatiallyNormalise(data,“data”,scale)

e.g. SpatiallyNormalise(slide1.gpr,“slide1.gpr”,TRUE)

 

This function will add a column to the input data file (given as data), which represents the spatially normalised data. Spatial normalisation is performed by firstly performing slide normalisation, but generating data that is centred only and has not been scaled. (See the entry on the SlideNormalise() function for more details.) The datum for each spot is then represented  as a pixel in an image. That is, the single value for each spot can be given as a single pixel in an image. Any spatial biases in the data can then be picked up. A smoothed version of the image is then created by using what is known as a “median filter”[4], which essentially estimates the general spatial trend over the image. This smoothed image can be subtracted from the original data (when it is represented as an image) to give the new spatially normalised image. Finally, the data may be rescaled, depending on whether scale is set to TRUE (the default value), by dividing the new data by a robust estimate of the standard deviation of the data on the (log(R)+log(G))/2 versus log(R/G) axes, called the mean absolute deviation. This gives the final data a standard deviation of 1.

 

This spatial normalisation method should remove biases in the log(R/G) data due to the average expression level, and due to the spatial location of the spots in the slide. This method is recommended over both the slide and pin normalisation methods.

 

Note that if the function ListDataAttributes() is now called, the spatial normalisation attribute of the dataset should now be “spat” instead of “none”.

 

This function also generates a further two columns in the input data file (given as data): one that is the normalised log(R/G) data that has not been rescaled, and one that is the unscaled normalised data that has been back-transformed. The back-transformed value for some number x (which is on the log (base 2) scale) will be 2x. The back-transformed values will be roughly on the same scale as the original ratio values and may be used to compare the current results with historical results where only (un-normalised) ratio values are reported.

 

The details of this algorithm are discussed in a paper that will be in press shortly (6).

 

UNDOSpatiallyNormalise(data,“data”)

e.g. UNDOSpatiallyNormalise(slide1.gpr,“slide1.gpr”)

 

This function will undo spatial normalisation on a dataset. If spatial normalisation has not already been performed, no change is made to the dataset.  You should not normally need to use this function.

 

SelectHousekeepingGenes(norm,number,"data1","data2",....)

e.g. SelectHousekeepingGenes(“spat”,4,“slide1.gpr”,“slide2.gpr”,“slide3.gpr”,“slide4.gpr”)

 

This function will select a set of genes as “housekeeping genes” from a set of experiments where there are a large number of genes on each slide (say more than 3000), where only a maximum of approximately 10% of the genes on each slide are differentially expressed. The idea is to find genes that have low differential expression after normalisation, and whose variability across the slides is small. These genes could then hopefully be used as housekeeping genes on boutique arrays, where there are a much smaller number of genes on the slides (say, 500) and a large percentage of genes will be differentially expressed. Because of this large percentage, the normalisation functions that are usually used (eg SpatiallyNormalise() ) will not work, and there is a need for the specially selected housekeeping genes. It is assumed that the RNA samples used in the boutique arrays are from similar conditions as those from the slides used to select the housekeeping genes. This means that ideally the low differentially expressed, low variability genes selected in the original slides maintain those properties in the boutique arrays and can therefore be used for normalisation. The selection of housekeeping genes is only made from the set of genes that have a flag (as determined from a software package such as GenePix) greater than –50 in all of the input data sets.

 

The first parameter to the SelectHousekeepingGenes() function is a normalisation parameter, norm, which can be set to “slide”, “pin”, or “spat”. This determines which normalisation is used before selecting the housekeeping genes. The number parameter specifies the number of slides that will be used to determine the housekeeping genes. This number must match the number of data sets specified (in quotes) as the rest of the function parameters. Note that the more slides that are used to select the potential housekeeping genes, the better the genes are likely to be for normalisation genes in future boutique arrays.

 

The method works by firstly normalising all of the input slides according to the normalisation method specified by the norm parameter. This may have already been done, in which case it is not repeated. We know that the bias in the differential expression of a spot is dependent on the “average” fluorescence of that spot. Hence, we need to select future housekeeping genes across the entire range of fluorescence in order to accurately estimate the bias in future boutique arrays. To achieve this, the “average” fluorescence ((log(R)+log(G))/2) range is divided into 20 segments, ranging from the lowest fluorescing spots to the highest fluorescing spots. We will try and select a maximum of 3 housekeeping genes from each segment, to span the range.

 

The genes are selected according to the following criteria. Look at only the genes within a single segment at a time. On the normalised log(R/G) values for these genes compute the mean and standard deviation for each gene across the input slides. Then, for each gene compute a confidence interval for the mean. That is, compute meanħ(1.96*stddev)/number.of.slides for each gene. For the genes in that segment, select only those genes for which the confidence interval encompasses zero (i.e. the lower confidence limit is below zero, and the upper confidence limit is above zero), and then choose the 3 genes with the smallest confidence interval (i.e. the smallest variance). Hence, we’re trying to select 3 housekeeping genes for each fluorescence segment that, after normalisation, have low differential expression and the lowest variance across the input slides. Note that in some segments there may be less than 3 genes selected. This may be because there are simply less than 3 genes in that segment, or that less than 3 genes fit into the criteria of having a confidence interval that encompasses zero. Sometimes there may be no genes that fit the selection criteria (including the restriction that the flag on all spots of each gene must be greater than –50), particularly for segments at very low or very high fluorescence levels.

 

The results of the selection are output to a file in the R working directory called “housekeeping.genes.*.txt”, where the * is replaced by a list of the input slide names. The results are also saved to a dataset in the R workspace called “housekeeping.genes” (which will be overwritten each time you run the SelectHousekeepingGenes() function). This dataset can be viewed using the DisplayData() function.

 

3.4. Detection of differentially expressed genes

 

FindDiffExpGenes(norm,data,“data”,combine-genes,cut-off)

e.g.  FindDiffExpGenes(“spat”,medslides,“medslides”,TRUE,3)

 

The results of this function are saved to two files called “interesting.genes.*.txt” (where the * corresponds to your input data name) and “all.genes.*.txt”. In the “interesting.genes” file only those genes that were found to have high enough differential expression are listed. In the “all.genes” file all genes are listed. In both files there is a column entitled “Significant” which has a YES if a gene was found to be highly differentially expressed and a NO if the differential expression was not high enough. (In the “interesting.genes” files all of the entries in the “Significant” column are YES.) The files are saved into your R working directory. The files will have a date stamp and the name of the dataset from which the results were derived.

 

In the files there are columns for gene names, IDs, normalised (centred, and possibly scaled) log(R/G) values, and the standard deviations of the log(R/G) values, where the log(R/G) values may be the values resulting from an averaging of datasets. If this is the case, the standard deviations column will list the standard deviations of the spot values across the averaged slides and possibly over averaged gene replicates. If the data are from a single slide then the standard deviation column could be NAs (if there are no gene replicates averaged), or the standard deviations over gene replicates with the single slide. Normalisation of the data must be done prior to selecting differentially expressed genes. There is also a column of back-transformed data values, as they were listed in the input file.[5]

 

In each of the output files there is also a column of flags, indicating the flag for each particular spot when it was extracted from the original image file (using some software such as GenePix). The next column shows the spot region for each spot. This data either came from the spot region list added to the original data file by the user (possibly labelling for different re-suspension fluids), or one that was automatically generated when the data file was originally loaded. If the latter is the case then all of the spots will be labelled as coming from the same region. The final column is the “Significant” column.

 

The combine-genes parameter should be set to TRUE or FALSE, depending on whether or not you would like the gene replicates within a slide to be averaged. By “averaged” we imply that the median of the gene replicates for each individual gene ID on the slide will be computed. If you set combine-genes to TRUE and the gene replicates have already been combined, no further averaging will be done. Also, if the dataset given as a parameter for this function is an averaged dataset (i.e. an output from MedianData() or MeanData()) the gene replicates will not be combined, regardless of what value the combine-genes parameter takes. This is because, in order to combine the gene replicates, the information on a spot's location in the slide is required and this data is lost during the MedianData() and MeanData() functions. 

 

The last parameter, cut-off, is an optional manual over-ride parameter. If a suitable value is given in the parameter list only those genes with a (normalised) log(R/G) absolute value greater than the cut-off will be listed in the results file. The value of cut-off is considered suitable if at least one gene has a (normalised) log(R/G) absolute value greater than it. If it is too large an error will occur. If no value for cut-off is given when the function is called, the automatic method of determining the “interesting” genes, as given below, will be used.

 

The gene selection method works by comparing the log(R/G) data (after some form of normalisation) with a normal distribution (with mean 0 and standard deviation 1). Note that, because the data distribution is compared to a distribution with a standard deviation of 1, it is important that scaling is included in the normalisation of the data. The outliers, or highly differentially expressed genes, will make the tails of the distribution of the log(R/G) data appear heavier than the normal distribution. The outliers are removed one by one, starting with the outlier with the largest absolute p-value (i.e. the outlier furthest from the mean, which is zero). As each data value is removed, the data are again compared to the (standard) normal distribution. At some point the distribution of the subset of log(R/G) data will have tails that are not heavy enough when compared to the normal distribution. When this occurs the algorithm terminates, suggesting that all of the data values that have been removed so far can be considered as outliers, and hence represent highly differentially expressed genes.

 

Recall that the spots flagged as -100 were removed when the data were loaded. If the input data for FindDiffExpGenes() is an averaged dataset (from using either MeanData() or MedianData()) it may be possible to have genes in the gene list that were only present in one of the slide replicates. If you are searching for interesting genes in this averaged data you may find that some NAs appear in the column of standard deviations. This can occur even when the data is the average of more than one slide, because only one valid value may exist for a particular gene ID and standard deviations will not be computed over that one value.

 

CompareInterestingGenes(norm,number.to.compare,percent,keep-inconsistent-genes,“data1”,“data2”,...)

e.g.  CompareInterestingGenes(“spat”,3,100,FALSE,“slide1.gpr”,“slide2.gpr”,“slide3.gpr”)

 

This function enables you to compare the results for several datasets from the FindDiffExpGenes() function. Up to 20 sets of results can be compared. Note that the number.to.compare parameter must match the number of datasets listed in the function call.

 

The function FindDiffExpGenes() must have been called prior to calling CompareInterestingGenes(). You will receive an error if you try to run CompareInterestingGenes() without having run the FindDiffExpGenes() function. Similarly, if you have reloaded any data since running FindDiffExpGenes(), running CompareInterestingGenes()on the reloaded data will not execute and you will be asked to re-run the appropriate functions.

 

If you have run the FindDiffExpGenes() function and not combined the gene replicates for data being input into CompareInterestingGenes(), an error will be reported and you will be asked to re-run the FindDiffExpGenes() function. This is because we would like the comparisons between lists of “interesting” genes from different slides to be consistent, and not have some lists which may contain gene replicates and other lists which may have only one instance of a gene ID, which represents an averaged value over gene replicates.

 

The results from the comparisons show the spots that appear in the specified percent of the lists of “interesting genes”. (Note that the percentage given as a parameter when the function is called will be rounded, so that the actual number compared will vary depending on the number of datasets listed.) The percent parameter must be between 0 and 100.

 

The keep-inconsistent-genes parameter can be set to TRUE or FALSE. If it is set to FALSE, if a gene appears as “interesting” in the required number of slides, but the sign of the normalised log ratio changes across the slides, that gene will not be listed in the results file. This might be used if you assumed that the differential gene expression is unstable (with high biological variation) and is not of current interest. If the keep-inconsistent-genes parameter is set to TRUE the genes that are “interesting” in a sufficient number of slides but flip signs across those slides will appear in the results file.

 

The results are saved to two files called “interesting.genes.compared.*.txt” and “all.genes.compared.*.txt” in the R working directory, where the names of the datasets included in the comparisons are inserted instead of the *. The “interesting.genes” file contains only those genes found to be interesting in a large enough number of slides, whereas the “all.genes” slide contains all of the genes that appear in the slides. The files will have a date stamp, a list of all of the files used in the comparisons, and the number of slides in which a gene must be “interesting” before it will be selected in the CompareInterestingGenes() function. The files will also contain columns for the spot IDs, spot names, median average log values, median log ratio values, standard deviations, and the number of slides in which each gene ID was estimated to be “interesting”. In the “interesting.genes.compared.*.txt” file the median values and standard deviations are computed using only those spots deemed “interesting” by the FindDiffExpGenes() function. That is, if 4 slides are to be compared, and the percent is set as 75, then the medians and standard deviations will be computed over 3 or 4 slides (and the gene replicates within those slides), depending on how many times each particular gene ID was estimated to be differentially expressed. In the “all.genes.compared.*.txt” file the median values and standard deviations are computed over all of the slides for which there is a useable spot value for the relevant gene ID. This implies that the medians and standard deviations recorded in the two files may not be exactly the same. There is also a column of flags, indicating the minimum flag for each selected gene (as stated in the output file from a software such as GenePix) across the “interesting.genes.*” files being compared. The final column indicates a YES if the gene ID appeared as “interesting” in enough of the slides and a NO otherwise. Note that all of the entries in this column of the “interesting.genes.compared.*” file will be YES.

 

The output file will also contain a column of back-transformed data values. These values are computed by firstly calculating the median over unscaled normalised log(R/G) values for only those spots for each gene that are deemed “interesting” by the FindDiffExpGenes() function. These median values are then back-transformed:  for some number x (which is on the log (base 2) scale) its back-transformed value will be 2x. There are complexities associated with performing the back-transformation on the averaged unscaled data, which will not be detailed here. However, the back-transformed values will be roughly on the same scale as the original ratio values and may be used to compare the current results with historical results where only (un-normalised) ratio values are reported.

 

The results are also stored as the dataset within R, called “interesting.genes.compared” and “all.genes.compared”, which can be displayed using the DisplayData() function.

 

SaveInterestingGenes(norm,number.to.compare,“data1”,“data2”,...)

e.g.  SaveInterestingGenes(“spat”,3,“slide1.gpr”,“slide2.gpr”,“slide3.gpr”)

 

This function creates an output file of the results for several datasets from the FindDiffExpGenes() function. If a gene has been found to be “interesting” in any of the input datasets it will appear in the output file. Up to 20 sets of results can be compared. Note that the number.to.compare parameter must match the number of datasets listed in the function call.

 

The function FindDiffExpGenes() must have been called prior to calling SaveInterestingGenes(). You will receive an error if you try to run SaveInterestingGenes() without having run the FindDiffExpGenes() function. Similarly, if you have reloaded any data since running the FindDiffExpGenes(), running SaveInterestingGenes() on the reloaded data will not execute and you will be asked to re-run the appropriate functions.

 

If you have run the FindDiffExpGenes() function and not combined the gene replicates for data being input into SaveInterestingGenes(), an error will be reported and you will be asked to re-run the FindDiffExpGenes() function. This is because we would like the comparisons between lists of “interesting” genes from different slides to be consistent, and not have some lists which may contain gene replicates and other lists which may have only one instance of a gene ID, which represents an averaged value over gene replicates.

 

The results from the comparisons show the spots that appear in at least one of the lists of “interesting genes”. The results are saved to a file called “minimal.interesting.genes.*.txt” in the R working directory, where the names of the datasets included in the comparisons are inserted where the * is. The file will contain a date stamp and a list of all of the files used in the comparisons. The file will also contain columns for the spot Ids and a column for each of the datasets being compared. The data in these latter columns are the normalised log(R/G) data for each respective gene ID.

 

The results are also stored as a dataset within R, called “minimal.interesting.genes.compared”, which can be displayed using the DisplayData() function. The results are displayed in the R plot window as a plot of the median data (over all of the datasets being compared), with the gene IDs that are “interesting” in at least one of the slides being plotted in a different colour.

 

3.5. Multivariate functions

 

CreateMultivariateData(norm,number,“newdataname”,“data1”,“data2”,...)

e.g. CreateMultivariateData(“spat”,4,“combined.slides”,“slide1.gpr”,“slide2.gpr”,“slide3.gpr”)

 

This function combines the normalised data columns from the input files into one dataset or appends the relevant columns to the specified existing multivariate dataset. It also combines the average gene expression columns ((log(R)+log(G))/2) from the input files into another dataset (or appends then to an already existing dataset). These two datasets are stored within R and are called “newdataname.normdiff” and “newdataname.avg” respectively. Each of the new/appended datasets also contains a column of gene names and a column of gene IDs. Currently only the data in “newdataname.normdiff” is used in other multivariate analyses in tRMA. However, the “newdataname.avg” dataset is created here for potential use in the future functionality of tRMA. If you have previously created a multivariate dataset of the same name you will be asked if you would like to overwrite it or append to it the data from your input slides. The datasets that are input into the CreateMultivariateData() function can be either original data files (such as a *.gpr or *.dat file), after normalisation, or the output from one of the data averaging functions such as MedianData() or MeanData().

 

The norm parameter must be either “slide”, “pin”, or “spat” depending on the type of normalisation that is desired. The normalisation must have been done prior to calling the CreateMultivariateData() function. If the requested normalisation has not been previously performed an error will result and you will be asked to run the desired normalisation before re-running the CreateMultivariateData() function.

 

The number parameter is the number of datasets that you wish to be combined. It must match the number of dataset names you list when you call the CreateMultivariateData() function. The maximum number of datasets that can be combined is 100. However, this is a constraint we have put into the CreateMultivariateData() function. It is possible that R (or your PC) will not have the memory capacity to cope with this many datasets, and analyses may have to be done separately on subsets of the input files.

 

If you try to create, overwrite, or append to a multivariate dataset whereby there are slides of the same name trying to be listed, you will get an error. All of the slide columns that are put into the resulting multivariate dataset must have unique names.

 

Slides with gene replicates cannot be put into a multivariate dataset. This is because the multivariate dataset will be used in functions that require unique identifiers for the gene IDs. There are two ways around this problem. Both involve either the MedianData() function or the MeanData() function. The gene replicates can be combined using either of these two averaging functions, or either of these two functions can be run on the individual slide, but where the combine-genes parameter is set to FALSE (see the help information on the MedianData() and MeanData() functions). This would result in a new dataset of the same number of rows as the original slide. However, all of the gene IDs would be amended to include the gene ID and the x and y locations of each spot on the slide. In this way, the new dataset could be combined into a multivariate dataset and clustering could be performed on the individual spots (with the gene replicates remaining labelled as those individual spots).

 

If you try to combine columns from two or more slides where there are the same gene IDs included in both the slides, but the corresponding gene names in each of the slides are different, then an error message will be displayed and the CreateMultivariateData() function terminated. When combined data from different slides, the gene-ID/gene-name  pairs in each slide must match exactly.

 

The CreateMultivariateData() function includes the facility to be able to add more columns to an already existing multivariate dataset because of memory problem that have been encountered in R. As the number of slides in your experiment increase, so too do the memory requirements of R for your analyses. What may be necessary is to load and perform the normalisation on some of your datasets, add them to the multivariate dataset and then save that dataset outside R (using the SaveData() function). The R workspace can be cleaned (using the CleanWorkspace() function) and a new set of slides loaded an normalised. The multivariate dataset can then be reloaded in the R workspace (using the LoadData() function) and the new slide data appended to the multivariate dataset. These steps could be repeated as many times as necessary to get all of the required data into a multivariate dataset. This is a bit of a fiddle, but may be necessary under the current memory restrictions on R.

 

ComputeKMeans("data",genes.or.slides,number.of.clusters,

number.of.genes,ordering)

e.g. ComputeKMeans(“combined.slides”,“genes”,5,“compared”,“median”)

 

This function uses the K-means algorithm (3) to compute clusters in the input multivariate dataset. The input dataset must be the name of one that was created using the CreateMultivariateData() function. The genes.or.slides parameter must be set to “genes” or “slides”, depending on what you would like to cluster on. For example, if you are looking to group genes with similar patterns toegther, then you would set genes.or.slides to “genes”. You must specify the desired number of clusters with the number.of.clusters parameter. If the genes.or.slides parameter is set to “genes”, then the number.of.clusters must be set to a number between 1 and the total subset of genes used for clustering (exclusively). Alternatively, if the genes.or.slides parameter is set to “slides”, then the number.of.clusters must be set to a number between 1 and the total number of slides (exclusively). That is, the algorithm will not run if you attempt to put all the samples into one large cluster, or if you attempt to put each sample into an individual cluster.

 

The fourth parameter, number.of.genes, determines the subset of the input data that is clustered upon. The parameter can be set to either a number between 1 and the total number of genes in the input data set (exclusively), the text “compared”, or the text “minimal”. This is similar to the number.of.genes parameter in ComputeHierarchicalClustering(). If number.of.genes is set to “minimal”, the genes used in the clustering are those that were selected in the SaveInterestingGenes() function. That is, these are genes that appeared as “interesting” in at least one of the slides being analysed. If the SaveInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function,  ComputeHierarchicalClustering() will not run and an error will be returned. If number.of.genes is set to “compared”, the genes used in the clustering are those that were selected in the CompareInterestingGenes() function. That is, these are genes that appeared as “interesting” in a specified percentage of the slides being analysed. If the CompareInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function,  ComputeHierarchicalClustering() will not run and an error will be returned. If number.of.genes is set to a number, then the median of each of the (absolute normalised log(R/G)) gene values across the slides is computed. The genes are sorted according to these median values, and the top number.of.genes are used for the clustering. The selection of the gene subset according to this parameter is performed regardless of whether you have chosen to cluster on “genes” or on “slides”. There is no parameter to select a subset of the slides, as this would already have been done when the multivariate dataset was created using the CreateMultivariateData() function.

 

To illustrate the results, there are three displays produced, a silhouette plot of the results, a plot of means (centroids) for each of the clusters, and an image with the genes/slides (depending on what you entered for the genes.or.slides parameter) ordered into their respective clusters. The silhouette plot is a bar plot of the estimates of the degree of “clusteredness” of a point within its cluster in relation to the other clusters. Observations with a large silhouette value (almost 1) are very well clustered, observations with a small silhouette value (around 0) means that the observation lies between clusters, and observations with a negative silhouette value have probably been placed in the wrong cluster. When a gene/slide has been clustered into a group with only one element, the silhouette value cannot be computed and will appear as an NaN (“not a number”) in the output file. However, in order to provide suitable values for the silhouette plot, these genes/slides will appear in the plot with a value of 1 and the bar representing them will be coloured black. Note that this situation will generally only occur when the genes.or.slides parameter is set to “slides”. If this does occur when the genes.or.slides parameter is set to “genes”, the entry for these genes in the silhouette plot will be difficult to tell apart from the other genes because of the density of the plot. Note also that in very dense silhouette plots (where there are many values to be plotted) some of the vertical lines may appear to be coloured black, when in fact they are not. This is simply due to the restriction of the resolution of your computer screen. If you are in any doubt as to the silhouette value of an observation, you should check the output file.

 

The final parameter, ordering, is used to order the genes within each cluster in the image that is used to display the results. There are two options: typing the text “median” (in quotes) or typing the text for one of the slide names in quotes (e.g. “slide1.gpr”). An error will be produced if the text you enter does not match any of these options. Either option can be used when you are clustering on “genes”, but only the “median” option can be used when you are clustering on “slides”. If you enter the “median” option, the group of genes/slides (depending on what you entered as the genes.or.slides parameter) in each cluster will be ordered according to increasing median (absolute normalised log(R/G)) values. If you enter one of the slide names, the group of genes in each cluster (since can only cluster on “genes” with this option) will be ordered according to increasing (absolute normalised log(R/G)) values for the specified slide.

 

The results from the clustering algorithm are output to a file in the R working directory called “kmeans.*.txt”, where the * is replaced by the input data name. The file contains columns for gene names, and gene IDs, and a column for the normalised log ratio data for each of the slides involved in the analysis. If the genes.or.slides parameter was set to “genes”, then the last column of the data lists the cluster to which each gene was assigned, and the cluster means are listed in the last rows of the file. If the genes.or.slides parameter was set to “slides”, then the last row of the data lists the cluster to which each slide was assigned, and the cluster means are listed in the last columns of the file. If the genes.or.slides parameter was set to “genes”, then the order of the genes in the output file will be done according to what you specified for the ordering parameter, and the slides will be the same as in the file from which the data was taken (such as the “interesting.genes.compared” file). If the genes.or.slides parameter was set to “slides”, then the order of the slides in the output file will be done according to what you specified for the ordering parameter, and the genes will be the same as in the file from which the data was taken.

 

The K-means algorithm does not work if there are any missing values in the dataset. If the genes.or.slides parameter is set to “genes” the ComputeKMeans() algorithm removes all genes that have at least one missing value across the slides. If the genes.or.slides parameter is set to “slides” the ComputeKMeans() algorithm removes all slides that have at least one missing value across the genes. If it is the case, for example, that every slide has at least one missing value, then all of the slides will be removed from the input dataset. Hence the function would not be able to run. If this occurs you are given the option of removing all of the genes that contain at least one missing value. If you choose this option you would be removing data that could possibly be useful in the clustering algorithm. However, when the alternative is not being able to run the clustering algorithm at all, clustering on the reduced dataset can at least give you some results about your data.

 

The K-means algorithm will return an error if, at any point in the algorithm iterations, it reaches a configuration where there are empty clusters. The ComputeKMeans() function will terminate and no output file will be generated. This problem may possibly be resolved by re-running the algorithm and trying to cluster the data into a smaller number of groups.

 

Note that the K-means algorithm is an iterative algorithm in which the results are initially randomly selected and then optimised over the algorithm iterations. Because of this random initialisation, calling the ComputeKMeans() function repeatedly with the same input parameters may result in varying results over the separate function calls.

 

Occasionally the ComputeKMeans() algorithm will terminate because of a lack of memory in the R workspace. If this happens your first step should be to exit from R, saving your workspace, and then start R up again. You will need to run the SourceFunctions() function before resuming to the ComputeKMeans() function. If this does not work, we would suggest that you manually remove some of the objects stored in R workspace. This can be done by firstly using the ListData() function to find out what objects exist in the workspace, and then running the rm() function (which is a function that exists in R), with the text rm(objectname), where objectname is the name of the object you wish to remove. It is advisable that you do not run the CleanWorkspace() function in this instance, because it will remove all your data from the R workspace, including the multivariate dataset that you are trying to cluster on.

 

The basic idea of the K-means algorithm is to begin by estimating number.of.clusters initial cluster means. The genes/slides are then assigned to the cluster with the nearest mean, and the values of the means are updated according to the current elements in the cluster. The assignment and updating steps are iterated until the means only shift by some minimal amount between iterations.

 

ComputeSOMs("data",number.of.xclusters,number.of.yclusters,

number.of.genes,ordering)

e.g. ComputeSOMs(“combined.slides”,4,5,“compared”,“median”)

 

This function uses the self-organising maps algorithm (5) to compute clusters in the input multivariate dataset. The input dataset must be the name of one that was created using the CreateMultivariateData() function. The clustering is currently only done on the genes. The algorithm clusters the data into groups positioned on a hypothetical two-dimensional grid. The grid has length number.of.xclusters on the horizontal axis and length number.of.yclusters on the vertical axis. The total number of clusters (number.of.xclusters * number.of.yclusters) must be between 1 and the total number of genes (exclusively). That is, the algorithm will not run if you attempt to put all the genes into one large cluster, or if you attempt to put each gene into an individual cluster. If one of either number.of.xclusters or number.of.yclusters is given the value 1, then the clustering results may not be very satisfactory because of the particular structure of the self-organising maps algorithm.

 

The fourth parameter, number.of.genes, determines the subset of the input data that is clustered upon. The parameter can be set to either a number between 1 and the total number of genes in the input data set (exclusively), the text “compared”, or the text “minimal”. This is similar to the number.of.genes parameter in ComputeHierarchicalClustering(). If number.of.genes is set to “minimal”, the genes used in the clustering are those that were selected in the SaveInterestingGenes() function. That is, these are genes that appeared as “interesting” in at least one of the slides being analysed. If the SaveInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function, ComputeHierarchicalClustering() will not run and an error will be returned. If number.of.genes is set to “compared”, the genes used in the clustering are those that were selected in the CompareInterestingGenes() function. That is, these are genes that appeared as “interesting” in a specified percentage of the slides being analysed. If the CompareInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function, ComputeHierarchicalClustering() will not run and an error will be returned. If number.of.genes is set to a number, then the median of each of the (absolute normalised log(R/G)) gene values across the slides is computed. The genes are sorted according to these median values, and the top number.of.genes are used for the clustering. The selection of the gene subset according to this parameter is performed regardless of whether you have chosen to cluster on “genes” or on “slides”. There is no parameter to select a subset of the slides, as this would already have been done when the multivariate dataset was created using the CreateMultivariateData() function.

 

To illustrate the results, there are three displays produced, a silhouette plot of the results, a plot of means (centroids) for each of the clusters (arranged in their hypothetical two-dimensional grid), and an image with the genes/slides (depending on what you entered for the genes.or.slides parameter) ordered into their respective clusters. On the plot of the cluster means, there will only be means plotted for clusters that have data in them. The cluster number and the number of elements of each cluster are given at the top of each plot in the two-dimensional hypothetical grid.

 

The silhouette plot is a bar plot of the estimates of the degree of “clusteredness” of a point within its cluster in relation to the other clusters. Observations with a large silhouette value (almost 1) are very well clustered, observations with a small silhouette value (around 0) means that the observation lies between clusters, and observations with a negative silhouette value have probably been placed in the wrong cluster. When a gene/slide has been clustered into a group with only one element, the silhouette value cannot be computed and will appear as an NaN (“not a number”) in the output file. However, in order to provide suitable values for the silhouette plot, these genes/slides will appear in the plot with a value of 1 and the bar representing them will be coloured black. If this occurs the entry for these genes in the silhouette plot will be difficult to tell apart from the other genes because of the density of the plot. Note also that in very dense silhouette plots (where there are many values to be plotted) some of the vertical lines may appear to be coloured black, when in fact they are not. This is simply due to the restriction of the resolution of your computer screen. If you are in any doubt as to the silhouette value of an observation, you should check the output file.

 

The final parameter, ordering, is used to order the genes within each cluster in the image that is used to display the results. There are two options: typing the text “median” (in quotes) or typing the text for one of the slide names in quotes (e.g. “slide1.gpr”). An error will be produced if the text you enter does not match any of these options. If you enter the “median” option, the group of genes in each cluster will be ordered according to increasing median (absolute normalised log(R/G)) values. If you enter one of the slide names, the group of genes in each cluster will be ordered according to increasing (absolute normalised log(R/G)) values for the specified slide.

 

The results from the clustering algorithm are output to a file in the R working directory called “soms.*.txt”, where the * is replaced by the input data name. The file contains columns for gene names, and gene IDs, and a column for the normalised log ratio data for each of the slides involved in the analysis. The last column of the data lists the cluster to which each gene was assigned, and the cluster means are listed in the last rows of the file. The order of the genes in the output file will be done according to what you specified for the ordering parameter, and the slides will be the same as in the file from which the data was taken (such as the “interesting.genes.compared” file).

 

The self-organising maps algorithm does not work if there are any missing values in the dataset. To cope with this the algorithm removes all genes that have at least one missing value across the slides.

 

The self-organising map algorithm is very similar to the basic K-means algorithm: initialise means for each of the clusters, assign genes to the closest cluster, update the cluster means, and iterate until the means are in stable locations. The self-organising maps algorithm is a variant on the K-means algorithm in that the cluster means are adjusted according to a weighting scheme after the update step. The cluster mean adjustment step is included at every iteration.

 

Each cluster label is mapped to a location in a hypothetical two-dimensional grid. At each iteration of the algorithm, a cluster mean is adjusted to be a weighted average of the neighbouring means in hypothetical grid. The weights of the neighbouring means are dependent on the number of elements in the respective cluster and the distance between the clusters in grid. A “cooling schedule” is employed so that the apparent distance between the centroids in the grid increases to infinity, so that the eventual influence of neighbouring clusters tends to zero as the algorithm iterates.

 

The clustering results can be displayed on a plot of the cluster means in the hypothetical two-dimensional grid. This plot is shown as the second display in the graphics window. The plot is important because the algorithm is constructed so that elements in neighbouring clusters (in the grid) should be similar. For example, a data set containing two main classes of genes may be clustered into 20 clusters using self-organising maps. Although the clustering results may suggest that the genes fall into many more than two groups, it may be found that the clusters occur in two “regions” in the grid. For example, there may be 5 clusters containing genes in the top left of the grid and 3 clusters containing genes in the bottom right of the grid. This grouping pattern could only be discovered by plotting the clustering results on the grid. Note that such a grouping is invariant to a swapping of the horizontal and vertical axes of the grid.

 

ComputeHierarchicalClustering(“data”,number.of.clusters,

number.of.genes,ordering)

e.g. ComputeHierarchicalClustering(“combined.slides”,5,“minimal”,“median”)

 

This function uses one of the hierarchical clustering routines provided in the R package, cluster, to perform agglomerative, hierarchical clustering on the genes of a multi-slide dataset. Based on the results of the hierarchical clustering, groups of genes with similar profiles across the slides are determined. The number of groups that are generated is given by the input parameter, number.of.clusters. The input dataset upon which the clustering is performed is given in the parameter data. This dataset must have been constructed using the CreateMultivariateData() function. Note that there is currently only functionality to cluster on genes. Clustering on slides in this function has not yet been implemented.

 

The third parameter, number.of.genes, can take one of three different types: the text “minimal”, the text “compared”, or a number between 1 and the total number of genes on the slide. However, choosing a large number of genes (say, over 1000), will possibly crash the function because R may run out of memory. If number.of.genes is set to “minimal”, the genes used in the clustering are those that were selected in the SaveInterestingGenes() function. That is, these are genes that appeared as “interesting” in at least one of the slides being analysed. If the SaveInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function,  ComputeHierarchicalClustering() will not run and an error will be returned. Note that if the number of genes selected in the SaveInterestingGenes() function is too large, again the function may crash because of memory problems. If number.of.genes is set to “compared”, the genes used in the clustering are those that were selected in the CompareInterestingGenes() function. That is, these are genes that appeared as “interesting” in a specified percentage of the slides being analysed. If the CompareInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function,  ComputeHierarchicalClustering() will not run and an error will be returned. If number.of.genes is set to a number, then the median of each of the (absolute normalised log(R/G)) gene values across the slides is computed. The genes are sorted according to these median values, and the top number.of.genes are used for the clustering.

 

The results are output into a file called “hier.clust.*.txt”, where * is replaced by the text input as the “data” parameter. Listed in the results file are the gene IDs and names for the genes that were used for the clustering, their normalised values for each of the slides involved in the analysis, and a cluster label indicating which group each gene was assigned to. At the bottom of the file the means (centroids) for each cluster is given. That is, the mean (a set of numbers, where there is a number for each slide) for each cluster is computed across all of the genes assigned to that cluster.

 

To illustrate the results there are four displays produced, a dendogram of the hierarchical clustering outcome, a silhouette plot of the results, a plot of means (centroids) for each of the clusters, and an image with the genes ordered into their respective clusters. If there are many genes used in the analysis the dendogram may be fairly unreadable. The silhouette plot is a bar plot of the estimates of the degree of “clusteredness” of a point within its cluster in relation to the other clusters. Observations with a large silhouette value (almost 1) are very well clustered, observations with a small silhouette value (around 0) means that the observation lies between clusters, and observations with a negative silhouette value have probably been placed in the wrong cluster. When a gene has been clustered into a group with only one element, the silhouette value cannot be computed and will appear as an NaN (“not a number”) in the output file. However, in order to provide suitable values for the silhouette plot, these genes will appear in the plot with a value of 1 and the bar representing them will be coloured black. If this occurs the entry for these genes in the silhouette plot will be difficult to tell apart from the other genes because of the density of the plot. Note also that in very dense silhouette plots (where there are many values to be plotted) some of the vertical lines may appear to be coloured black, when in fact they are not. This is simply due to the restriction of the resolution of your computer screen. If you are in any doubt as to the silhouette value of an observation, you should check the output file.

 

The final parameter, ordering, is used to order the genes within each cluster in the image that is used to display the results. There are three options: leaving the parameter out (i.e. not putting anything), typing the text “median” (in quotes) or typing the text for one of the slide names in quotes (e.g. “slide1.gpr”). An error will be produced if the text you enter does not match any of these options. If you leave the parameter blank, the ordering of the genes in the image will be the same as the ordering of the genes in the dendogram. If you enter the “median” option, the group of genes in each cluster will be ordered according to increasing median (absolute normalised log(R/G)) values. If you enter one of the slide names, the group of genes in each cluster will be ordered according to increasing (absolute normalised log(R/G)) values for the specified slide.

 

ComputeBiplot(“data”,genes.or.slides,number.of.clusters,number.of.genes)

e.g. ComputeBiplot(“combined.slides”,“genes”,5,“minimal”)

 

This function produces a display known as a “biplot” (2) which can be used to examine your data. The biplot is a method for displaying the results of a principal components analysis. This analysis is a transformation of the data so that, instead of slides (if genes.or.slides is set to “genes”), you have a set of “new samples” in which the maximal variance across the genes (within a slide/“new sample”) is captured in the first “new sample”, the next largest amount of variance across the genes is captured in the second “new sample”, etc. There are as many “new samples” as there were original slides (again, if genes.or.slides is set to “genes”) in the multivariate data set. However, only the data points in the first two “new samples” are displayed in the biplot.

 

If genes.or.slides is set to “slides” the analysis is a transformation of the data so that, instead of genes you have a set of “new genes” in which the maximal variance across the slides is captures in the first “new gene”, the next largest amount of variance across the slides is captured in the second “new gene”, etc. There are as many “new genes” as there were original genes in the multivariate data set. However, only the data points in the first two “new genes” are displayed in the biplot.

 

The biplot is a plot both of the results of the transformation and of how the transformation itself was done. The horizontal axis at the bottom of the biplot represents the first “new sample” (if genes.or.slides was set to “genes”, or “new gene” if genes.or.slides was set to “slides), and the vertical axis at the left of the biplot represents the second “new sample” (or “new gene”). On these axes the transformed genes (or slides) are plotted for these two “new samples” (or “new genes”), or “principal components”. There are two biplots displayed using the ComputeBiplot() function. The left plot has the genes (slides) labelled with their gene IDs (slide names). For the right hand plot, the genes (slides) (before the principal components transformation) have been clustered according the clustering method described in the ComputeKMeans() function. The clustering labels assigned to each gene (slide) are then used as the labels on the biplot. Note that, because the biplot represents the genes (slides) in a way that maximises the variance across them, the gene (slide) clusters can sometimes be seen quite clearly, particularly when there are only a small number of genes (slides) and a small number of clusters. Note also that the two biplots have exactly the same points plotted on them. However, they can sometimes look different (eg more dense points on the left plot) because the labels of the points on the left plot are larger than those on the right plot.

 

The other feature of the biplot is the vectors, which appear in red and are labelled with the slide names (gene IDs). The value on the top horizontal axis of a vector for each slide (gene) shows the weight of that slide (gene) that was used to transform the slide (gene) values to their first “new sample” (“new gene”. The value on the right vertical axis of a vector for each slide (gene) shows the weight of that slide (gene) that was used to transform the values to their second “new sample” (“new gene”). Hence, if a slide (gene) vector has a large value for the horizontal axis it means that the variability across the genes (slides) for that slide (gene) was large, and therefore that slide (gene) had a large weight in the transformation of the data points to their first “new sample” (“new gene”). If a slide (gene) vector has a small value for the horizontal axis it means that the variability across the genes (slides) for that slide (gene) was small, and therefore that slide (gene) only had a small weight in the transformation of the data points to their first “new sample” (“new gene”). A similar description can be given for the values of the slide (gene) vectors on the right vertical axis.

 

In the “new sample” (“new gene”) transformation, the largest variance across the genes (slides) is captured in the first “new sample” (“new gene”) (which is a combination of all the slides (genes), the weights of each slide (gene) being displayed by the vectors on the biplots). The percentage of the total variance across the genes (slides) that is captured in the first “new sample” (“new gene”) after the transformation is printed to the R screen. Similarly, the percentage of the total variance that is captured in the second “new sample” (“new gene”) after the transformation is also printed to the R screen.

 

The input dataset upon which the clustering is performed is given in the parameter data. This dataset must have been constructed using the CreateMultivariateData() function. The number of groups that are generated and used to assign gene labels for the second biplot is given by the input parameter, number.of.clusters. Note that there is only functionality to cluster on genes.

 

The third parameter, number.of.genes, can take one of three different types: the text “minimal”, the text “compared”, or a number between 1 and the total number of genes on the slide. However, choosing a large number of genes (say, over 1000), will possibly crash the function because R may run out of memory. If number.of.genes is set to “minimal”, the genes used in the clustering are those that were selected in the SaveInterestingGenes() function. That is, these are genes that appeared as “interesting” in at least one of the slides being analysed. If the SaveInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function,  ComputeHierarchicalClustering() will not run and an error will be returned. Note that if the number of genes selected in the SaveInterestingGenes() function is too large, again the function may crash because of memory problems. If number.of.genes is set to “compared”, the genes used in the clustering are those that were selected in the CompareInterestingGenes() function. That is, these are genes that appeared as “interesting” in a specified percentage of the slides being analysed. If the CompareInterestingGenes() function has not been run, or if it has been run most recently on a different set of slides from those input into the CreateMultivariateData() function,  ComputeHierarchicalClustering() will not run and an error will be returned. If number.of.genes is set to a number, then the median of each of the (absolute normalised log(R/G)) gene values across the slides is computed. The genes are sorted according to these median values, and the top number.of.genes are used for the clustering.

 

Currently no results from the ComputeBiplot() function are saved to a file.

 

 

 

 

Reference List

 

   1.   Dudoit, S., Yang, Y. H., Callow, M. J., and Speed, T. P. Statistical methods for identifying expressed genes in replicated cDNA microarray experiments.  2001.
Notes: Technical report 578, Stanford University.

   2.   Gabriel, K. R. (1971) Biometrika 58, 453-467.

   3.   Hartigan, J. A. & Wong, M. A. (1979) Applied Statistics 28, 100-108.

   4.   Ihaka, R. & Gentleman, R. (1996)  Journal of Computational and Graphical Statistics 5, 299-314.

   5.   Kohonen, T. (1990) Proceedings of the IEEE 78, 1464-1480.

   6.   Wilson, D. L. , Buckley, M., Helliwell, C., and Wilson, I. New normalization methods for cDNA microarray data.  2002.
Notes: Submitted for publication.

   7.   Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., & Speed, T. P. (2002) Nucleic Acids Research 30, e15.

 



[1] However, note that using a cut-off is provided as an option.

[2] The working directory can be reset by specifying the directory of your choice in the Target box in the R properties dialog box.

[3] Robust estimation implies that the estimation method will not be greatly affected by outliers.

[4] The size of the median filter is currently set to a 3*3 window.

[5] The back-transformed value for some number x (which is on the log (base 2) scale) will be 2x. The back-transformed values will be roughly on the same scale as the original ratio values and may be used to compare the current results with historical results where only (un-normalised) ratio values are reported.