February 21st, 2002
İ COPYRIGHT Commonwealth Scientific and
Industrial Research Organisation (CSIRO), Australia 2001.
All rights in this software and its
documentation are reserved to CSIRO.
You are only permitted to have this software and documentation in your
possession and to make use of it if you have agreed to a Licence Agreement with
CSIRO.
All enquiries for the use of this software
should be referred to: trma@cmis.csiro.au
This document outlines the use and details of the functions
contained in tRMA (tools for R Microarray Analysis). The document consists of a
broad overview of the current functionality of tRMA, installation
instructions, a tutorial using the example datasets provided with the software,
and details of the individual functions available in tRMA.
The functions in tRMA currently allow you to load certain
types of (cDNA) microarray data files, examine the data through display and
plot functions, normalise the data, average the data, and extract genes that
are highly differentially expressed. The functions run inside a freely
available statistical software packaged called R (4), which must be installed separately by
you. The R software package can be downloaded from http://www.r-project.org/. Note that, in order to run tRMA
you must also download and install the cluster package, the multiv
package, and the GeneSOM package. All of these are available from pages
linked to from the R home page. (The links you go to from the r-project
page are CRAN, Precompiled Binary
Distributions Windows, and then contrib. Then look for the appropriate zip
files.). Once you have downloaded cluster.zip, multiv.zip and GeneSOM.zip
you need to install them by double clicking on the executable file rwinst.exe.
You would have needed to use this file when you installed R on your PC,
so it should still be floating around your directories somewhere. You need to
use the add a package option when the executables running, and select the
library directory inside the R directory structure on your C drive as
the place to install the two packages. Note that you should not unzip the two
packages yourself they will be unpacked correctly by rwinst.exe.
tRMA
currently allows you to load three different formats of data, gpr files created
using GenePix image analysis software, text files downloaded from the Stanford
microarray database, and dat files created using the Spot image analysis
software. There is provision made to cope with slides in which the dyes had
been swapped. When the GenePix files are created (inside GenePix) there is a
facility for you to enter a list of gene names and gene IDs. These two columns
must be present in the gpr files loaded into tRMA for tRMA to
work correctly. The gene names may or may not be known at the time the gpr file
is created. The gene IDs should be generated by you inside GenePix and can
consist of both letters and numbers. You should give each spot that came from
the same cDNA aliquot the same gene ID. Spots with the same gene ID, both
within and across slides, may be combined in the tRMA software. Hence,
if two spots did not originate from the same aliquot of cDNA they should
have different gene IDs. For other information on the formats of gpr files,
Stanford database files and the Spot dat files please refer to documents
specifically related to those file types.
The functions available in tRMA for displaying data include
the following:
·
plotting the
data on various types of axes (e.g. R versus G, log(R) versus log(G), etc),
with an optional interactive feature where gene IDs for individual selected
points on the plot can be displayed;
·
displaying
images of the data on various axes, where each pixel in the image represents
the individual data for each spot; and
·
printing
subsets of the data to the screen for examination of individual spot values.
There are currently four different normalisation methods available
in tRMA, slide normalisation, pin normalisation, spatial normalisation,
and a normalisation based on housekeeping genes. All of the methods perform
non-linear normalisation for an individual slide by removing biases in
differential gene expression within that slide due to average expression level
and, in the pin and spatial methods, biases due to the physical location of
each spot on the slide. The resulting dataset will be centred and possibly scaled.
The slide, pin and spatial normalisation methods require each
slide to contain a large proportion of the genes to be non-differentially
expressed. This proportion may be of the order of 80% of the spots on the
slide. This is because these normalisation methods estimate the biases from the
levels of differential expression in the mass of non-differentially expressed
genes. Hence, these methods will not work on boutique arrays where there are
only a small number of genes on each slide (300 perhaps), all of which have
been selected because they are highly differentially expressed. A further
requirement on the use of the pin and spatial methods is that the highly
differentially expressed genes should be spaced randomly across the slide. This
is difficult to do accurately, because for most of the genes on the slide you
will not know in advance what their differential expression is likely to be.
However, some effort must be made to avoid having all of the differentially
expressing genes clustered together in one (or more) physical regions on the
slide(s). If these genes are clumped together they will introduce biases into
the normalisation procedures and final results may be skewed. The housekeeping
gene normalisation is specifically designed for use on specialist, boutique
arrays where most of the genes on the slide are expected to be highly
differentially expressed.
Two functions are provided in tRMA for averaging cDNA
microarray data. You may wish to average over gene replicates (specified as
spots with the same gene IDs) and/or over slide replicates. The averaging is
performed over normalised data only, so that biases are removed prior to
averaging.
A function is available in tRMA for determining highly
differentially expressed genes within an individual slide (which may be the
result of averaging over several slides). The selection method is an empirical
one, based on the dataset being analysed (i.e. no pre-specified cut-off is used[1]), using assumptions on the distribution
of the data after normalisation has taken place. Results from the selection are
output to files.
Comparison of results from selecting highly differentially
expressed genes across several slides is possible within tRMA, whereby
genes that are found to be highly differentially expressed in a
(user-)specified percentage of a set of slides can be found. Results from this
selection are also output to files.
There are several multivariate functions available in tRMA,
including principal components analysis, hierarchical clustering and K-Means
clustering.
To install the tRMA functions into R you must do the
following:
1.
Save the
zipped file in a directory of your choice.
2.
Unzip the
files.
3.
Start R.
4.
From inside
R, source the function SourceFunctions.r by either clicking on the yellow
(open folder) button on the tool-bar, or selecting Source R code from the
File menu. You may need to browse through your directory structure to find
the SourceFunctions.r file.
5.
On the
command line type SourceFunctions("C:/functions/"), where the text inside the quotes will
correspond to the directory where you have unzipped the tRMA
functions.
When you have completed these steps you should be able to use any
of the functions described below in this help document.
To get a statement of how to use a particular function you can
type the function name followed by left and right round brackets (with no
spaces). Note that there is no usage statement for functions that take no
parameters. (However, these functions are usually very simple to use.)
Note that, because when you type SourceFunctions("C:/functions/") on the command line, the two
required libraries (multiv and cluster) get loaded, you must type
SourceFunctions("C:/functions/") on the command line every time you start a new R
session. If you dont
rerun the SourceFunctions() command in every new R session, none
of the clustering functions in tRMA will work.
A number of functions provided in tRMA save results to a directory on you hard drive (or local network). You can specify which directory these results go to by performing the following steps:
Strange errors will occur if your data file names begin with a
number. Please change all data file names so that they start with a letter. The
filenames can contain numbers, but just not start with one. Also, do not have
spaces or the underscore character (_) in file names.
You should close all Excel files that you think may need to be
overwritten by any of the R functions. An example might be the interesting.genes.compared.txt
file, which often gets written to by the analysis functions. If you do have a
file open in Excel and you run an R function that tries to write to that file,
the function will terminate with an error and the results will not be saved. If
this happens, you should close the relevant file in Excel and try running the R
function again.
This tutorial outlines a sample analysis that might be done on the
example datasets that were provided with this software and documentation,
slide1.gpr, slide2.gpr, slide3.gpr and slide4.gpr. Specific details of the
parameter settings for each function are not given here and can be found in the
latter sections of this help document.
1. Firstly you would want load the data
files:
Ĝ LoadGenePixFile("D:/tRMA/slide1.gpr",F,4,4)
Ĝ LoadGenePixFile("D:/tRMA/slide2.gpr",F,4,4)
Ĝ LoadGenePixFile("D:/tRMA/slide3.gpr",T,4,4)
Ĝ LoadGenePixFile("D:/tRMA/slide4.gpr",T,4,4)
Note that the location of the files will
depend on where you have stored the example datasets. The last two slides,
slide3.gpr and slide4.gpr, have had their dyes swapped, so the swap.dyes
parameter has been set to TRUE for these two datasets. (N.B. T/F can be used
instead of TRUE/FALSE.) An extra column called Regions has been manually added
(in Excel) to slide3.gpr. Details of what this column represents are given in
the explanation of the LoadGenePixFile()function.
The column is given in one of the example files to illustrate what such a
column could look like.
2.
After
loading the data, you may want to examine it using some of the display
functions:
Ĝ PlotData(3,slide1.gpr,"slide1.gpr",T)
Ĝ ImageData(3,slide3.gpr,"slide3.gpr")
Ĝ DisplayData("slide2.gpr",ID="1226")
Note both the dependence of the
differential expression of average expression (which can be seen in the plot)
and the dependence of the differential expression on spatial location (which
can be seen in the images).
3.
The next
step would be to normalise the data. Here we use the spatial normalisation
method, although similar calls would be made if slide or pin normalisation were
desired:
Ĝ SpatiallyNormalise(slide1.gpr,"slide1.gpr",T)
Ĝ SpatiallyNormalise(slide2.gpr,"slide2.gpr",T)
Ĝ SpatiallyNormalise(slide3.gpr,"slide3.gpr",T)
Ĝ SpatiallyNormalise(slide4.gpr,"slide4.gpr",T)
4.
The example
datasets each contain two replicates of each of 3872 gene, as determined by the
gene ID. You may want to generate another dataset (med.slide1) that contains
specified information for a slide (e.g. median)within which the gene replicate
data have been combined:
Ĝ MedianData("spat",T,1,"med.slide1","slide1.gpr")
Alternatively, you may want to generate a
new dataset (medall) that is the average (e.g. median) over all of the
example slide replicates:
Ĝ MedianData("spat",T,4,"medall","slide1.gpr","slide2.gpr","slide3.gpr","slide4.gpr")
Here we are putting the averaged result in
an object called medall, which can be plotted and displayed like other
datasets. In this example averaging is performed over gene replicates and slide
replicates.
5.
Having
performed some form of normalisation on the data, you would now be able to
generate a list of highly differentially expressed genes, or interesting
genes, for each individual slide:
Ĝ FindDiffExpGenes("spat",slide1.gpr,"slide1.gpr",T)
Ĝ FindDiffExpGenes("spat",slide2.gpr,"slide2.gpr",T)
Ĝ FindDiffExpGenes("spat",slide3.gpr,"slide3.gpr",T)
Ĝ FindDiffExpGenes("spat",slide4.gpr,"slide4.gpr",T)
The results are stored within R and also
output to files. The text written to the screen as the function is completed
states how to access these results.
You may also want to look for
interesting genes in the dataset that you created from the average of the
four example slides:
Ĝ FindDiffExpGenes("spat",medall,"medall",T)
6.
After
finding interesting genes in the four individual slides, you may want to
compare the results, to see how many genes appeared as interesting in at least
75% (for example) of your slides. This can be done as follows:
Ĝ CompareInterestingGenes("spat",4,75,T,"slide1.gpr","slide2.gpr","slide3.gpr","slide4.gpr")
You may also want to compare the outputs
from the previous function with the genes that were found to be interesting in
the averaged dataset. This can be done by displaying each of the results to the
screen:
Ĝ DisplayData("interesting.genes.compared")
|
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
|
|
910 |
2840 |
800 |
206 |
91 |
459 |
3430 |
1582 |
2275 |
ID |
|
unknown |
unknown |
unknown |
unknown |
unknown |
unknown |
unknown |
unknown |
unknown |
Name |
|
8.475254 |
8.707975 |
7.261781 |
7.950969 |
8.494704 |
11.830051 |
8.373859 |
11.088503 |
9.891909 |
MedianAvgLogRLogG |
|
-6.665852 |
-5.002861 |
-4.924113 |
-4.564483 |
-3.769152 |
5.881142 |
6.168855 |
7.022197 |
8.199394 |
MedianSpatNormLogRdivG |
|
3.801347 |
7.869473 |
4.958922 |
4.845781 |
1.913393 |
1.751360 |
3.677305 |
7.453606 |
5.552771 |
SpatNormStdDev |
|
0.4215012 |
0.3918926 |
0.6208846 |
0.6632662 |
0.5960748 |
1.7163293 |
2.6460047 |
3.5544505 |
4.3003407 |
MedianSpatNorm.BacktoRatio |
|
3 |
4 |
3 |
4 |
3 |
3 |
4 |
4 |
3 |
NumberOfSlides |
|
YES |
YES |
YES |
YES |
YES |
YES |
YES |
YES |
YES |
FrequentlyInteresting |
Ĝ DisplayData("interesting.genes.medall")
|
|
ID |
Name |
SpatNormLogRdivG |
StdDev |
SpatNorm.BacktoRatio |
Flags |
Regions |
Significant |
|
1 |
2275 |
unknown |
7.054288 |
5.690537 |
3.2354274 |
0 |
1 |
YES |
|
2 |
1582 |
unknown |
7.022197 |
7.453606 |
3.5544505 |
0 |
1 |
YES |
|
3 |
3430 |
unknown |
6.168855 |
3.677305 |
2.6460047 |
0 |
1 |
YES |
|
4 |
800 |
unknown |
-4.924113 |
4.750433 |
0.6208846 |
0 |
1 |
YES |
|
5 |
2840 |
unknown |
-5.002861 |
7.869473 |
0.3918926 |
0 |
1 |
YES |
|
6 |
910 |
unknown |
-5.523238 |
4.279068 |
0.4997441 |
0 |
1 |
YES |
That concludes the tutorial. There are useful functions available
in tRMA that were not included in this tutorial, so it would be worth
reading the Function Specifications section to find out what other features are
available to you.
There is often some text and plots output to the console and
graphics window when some of the functions are called, and it is worth
examining this information when you perform analyses. You will not need to read
most of the text after you have used the programme a few times.
e.g. LoadGenePixFile("D:/trma/slide1.gpr",TRUE,4,4)
This function is used to load GenePix files (*.gpr files) into R.
The swap.dyes parameter should be TRUE if you would like the dyes to be
swapped, and FALSE otherwise. The parameters num.blocks.cols and num.blocks.rows
should be given as the number of columns of blocks (i.e. columns of print tips)
and the numbers of rows of blocks (i.e. rows of print tips).
Note that spots that were manually flagged as bad
in Genepix (flagged as -100 in flag column of the gpr file) are removed at this stage.
This load function either (a) reads in the last column of the data
file as the spot regions, which have been manually entered by the user prior
to starting tRMA, or (b) it adds
a spot regions column to the data as it is loaded, so that all of the spots
fall into the one region. The spot region labels that could have been manually
entered into the original data file can be used as labels of different
re-suspension solutions for the cDNA clones (for example). The labels may be 1's and 2's, for example.
Note that the labels must be numbers. If they are not, errors may occur in
later analyses. These labels do not have to be put into the data file by the
user prior to loading the data into tRMA. If there are no labels, all
spots are automatically labelled as being in the same region. This labelling is
used later in the SlideNormalise()
function.
e.g. LoadStanfordFile("D:/trma/SMD1234.txt",TRUE,4,8)
This function is used to load files down-loaded from one of the
Stanford databases into R. The swap.dyes parameter should be TRUE if you
would like the dyes to be swapped, and FALSE otherwise. The parameters num.blocks.cols
and num.blocks.rows should be given as the number of columns of blocks
(i.e. columns of print tips) and the numbers of rows of blocks (i.e. rows of
print tips).
Note that spots that were manually flagged as bad
in Genepix (flagged as -100 in flag column of the gpr file) are removed at this stage.
This load function either (a) reads in the last column of the data
file as the spot regions, which have been manually entered by the user prior
to starting tRMA, or (b) it adds a spot regions column to the data as
it is loaded, so that all of the spots fall into the one region. The spot
region labels that could have been
manually entered into the original data file can be used as labels of
different re-suspension solutions for the cDNA clones (for example). The labels may be 1's and 2's, for example.
Note that the labels must be numbers. If they are not, errors may occur in
later analyses. These labels do not have to be put into the data file by the
user prior to loading the data into tRMA. If there are no labels, all
spots are automatically labelled as being in the same region. This labelling is
used later in the SlideNormalise()
function.
e.g. LoadSpotFile("D:/trma/slide10.dat",TRUE,4,4)
This function is used to load Spot files (*.dat files) (Spot being
a microarray image analysis package) into R. The swap.dyes parameter
should be TRUE if you would like the dyes to be swapped, and FALSE otherwise.
The parameters num.blocks.cols and num.blocks.rows should be
given as the number of columns of blocks (i.e. columns of print tips) and the
numbers of rows of blocks (i.e. rows of print tips).
Note that spots that were manually flagged as bad
in Spot (flagged as 1 in badspot column of the dat file) are removed at this stage.
This load function adds a spot regions column to the data as it
is loaded, so that all of the spots fall into the one region. This labelling is
used later in the SlideNormalise()
function.
e.g. ListDataAttributes(slide1.gpr)
This function can be used to print out the normalisation
attributes of a particular dataset (one that has either been loaded, or is the
result of using the MeanData()
or MedianData() functions).
The slide, pin and spatial normalisation attributes will be listed. The first
two will be either none or lowess depending on whether the respective
normalisation procedure has already been called, and the spatial normalisation
attribute will be either none or spat.
e.g. ListData()
This function can be used to list all of the current data objects
stored in the R workspace. Note that all of the tRMA functions are also
contained in the R workspace, but these will not be listed by the ListData() function.
e.g. DisplayData(slide1.gpr,1,10,ID=2037)
This function can be used to display a (single) aded using the LoadGenePixFile() function, or a multivariate dataset
generated using the CreateMultivariateData() function. The parameters first,
last and ID are all optional, with the default values being the
first row of the dataset, the last row of the dataset, and all of the genes in
the dataset (i.e. all of the IDs), respectively. If you choose to put in a value for ID, but leave first
and last as their default values, you must include ID= in the
command line call to the function (e.g. DisplayData(slide1.gpr,ID=1204)).
Be aware that if you only include a dataset name in the parameter
list to the function call, the entire contents of that dataset will be printed
to the screen. If the dataset is large, you may not be able to scroll back to
view the first rows of the dataset. Hence, using the first and last
options may prove to be more useful.
e.g. RemoveGene(slide1.gpr,1204)
This function can be used to remove all genes with the specified
gene ID from the input dataset (stated inside quotes). Note that replicates of
the gene will all be removed (since they all have the same gene ID). The
dataset will be permanently changed in the R workspace. To restore your data to the original
you would need to reload it.
e.g. SaveData(saved.data,FALSE)
This function can be used to save a dataset to a file (possibly
after some normalisation or averaging procedures have been used). The multivariate
parameter should be set to TRUE of FALSE. If the parameter is set to FALSE, the
file into which the data is saved will be called new.*.txt (e.g. new.saved.data.txt),
and will contain the entire dataset. It will be stored in the R working
directory[2]. The file can then be opened outside R using text editor software
or in Excel. If the multivariate parameter is set to TRUE, the input
dataset should be a file created using the CreateMultivariateData()
function. If this is not the case an error will occur. If you are saving
multivariate data, two files will be saved with file names new.*.normdiff.txt
and new.*.avg.txt (e.g. new.saved.data.normdiff.txt
and new.saved.data.avg.txt). These two files contain the matrix of
normalised log(R/G) (multiple slide) data and the (log(R)+log(G))/2 (multiple
slide) data, respectively.
Datasets that are saved using the SaveData() function (and
then subsequently deleted, using the CleanWorkspace() function, for
example) can be reloaded into the R workspace using the LoadData()
function.
e.g. LoadData(saved.data,FALSE)
The LoadData() function is a general function for loading
data that has been saved using the SaveData() function. The data that
has been saved will be a text file with no header information, but with column
names and row labels. This function looks for the file(s) in the R working
directory2. If the file(s) do(es) not exist there, an
error will result. The parameter multivariate should be set to FALSE if
the saved data was a simple dataset, and should be set to TRUE if the saved
data was a multivariate data (generated using the CreateMultivariateData()
function (see later text)).
If multivariate was set to FALSE when the dataset had been
previously saved, the name of the file would have been new.*.txt, where
the * was the input data name (e.g. saved.data). To reload this data into the
R workspace, the data parameter in the LoadData() function should be
set to the name matching that used in the SaveData() function (in this
case, saved.data). If multivariate was set to TRUE when the dataset
had been previously saved, the names of the two saved files would have been new.*.normdiff.txt
and new.*.avg.txt, where the * was the input data name (e.g.
saved.multivariate.data). To reload these two datasets into the R workspace,
the data parameter in the LoadData() function should be set to the
name matching that used in the SaveData() function (in this case,
saved.multivariate.data).
The LoadData() function will cause errors if you try to use
it to load a GenePix file, or most other file types output from array image
analysis packages. The load functions designed for specific file types should
be used in these instances. The main use of the LoadData() function, in
conjunction with the SaveData() function, is to get around some of the
memory problems that have been encountered in R. With the increasingly large
numbers of slides (and genes) that are being analysed, the memory requirements
can be too large to handle. Hence, some of the data may need to be analysed
then saved to a file outside R. The R workspace can then be cleaned (using the CleanWorkspace()
function, for example) and a new set of data analysed. If previously analysed
data is required again it can easily be reloaded into the R workspace.
e.g. PlotData(1,slide1.gpr,slide1.gpr,TRUE)
This function can be used to plot the data that has been loaded
using one of the load functions. There are 7 options for the plotting, where
the parameter type.of.plot can be set to a number between 1 and 7. The
types of plots that result are as follows:
1.
R versus G
2.
log(R)
versus log(G)
3.
(log(R)+log(G))/2
versus log(R/G)
4.
(log(R)+log(G))/2
versus slide normalised log(R/G)
5.
(log(R)+log(G))/2
versus pin normalised log(R/G)
6.
(log(R)+log(G))/2
versus spatially normalised log(R/G)
7.
log(R-Rbg)
versus log(G-Gbg)
8.
(log(R-Rbg)+log(G-Gbg))/2
versus log((R-Rbg)/(G-Gbg))
Note that many of the plots are not possible on averaged data.
This is because only (slide, pin or spatially) normalised data can be averaged,
and the original R, G, log(R), log(G), etc. values do not exist as averaged
values. A message will appear if you try to attempt certain unavailable plots
for averaged data and no plot will be displayed.
The interactive parameter is an optional one, for which the
default is set to FALSE. If you call PlotData() with interactive
set to TRUE, you will be able to click on points in the plot to label them with
their gene ID. To label the point, click on it with the left mouse button. The
plot will remain interactive (and you won't be able to use the command line)
until you click the right mouse button and select stop. If you happen to
choose to kill the interactive plotting function by hitting the ESC key
when the command line window is active you will succeed in stopping the
interactive nature of the plot window. However, you will also mess up the menus
that appear at the top of the R window that allow you to save the plot to a
file. You will not be able to rectify this problem until you quit out of R and
restart it. Hence, the best way to terminate the interactive plotting window is
by using the right mouse button and selecting stop.
e.g. ImageData(1,slide1.gpr,slide1.gpr)
This function can be used to display an image of the data that has
been loaded using one of the load functions. There are 7 options for the image
display, where the parameter type.of.image can be set to a number
between 1 and 7. The types of image displays that result are as follows:
1.
an R image
and a G image
2.
a log(R)
image and a log(G) image
3.
a
(log(R)+log(G))/2 image and a log(R/G) image
4.
a (log(R)+log(G))/2
image and a slide normalised log(R/G) image
5.
a
(log(R)+log(G))/2 image and a pin normalised log(R/G) image
6.
a
(log(R)+log(G))/2 image and a spatially normalised log(R/G) image
7.
a log(R-Rbg)
image and a log(G-Gbg) image
8.
a (log(R-Rbg)+log(G-Gbg))/2
image and a log((R-Rbg)/(G-Gbg)) image
Note that some of the images are not possible on averaged data, in
particular, the images that use R (or R-Rbg) or G (or G-Gbg)
data. This is because only (slide, pin or spatially) normalised data can be
averaged, and the original R, G, log(R), log(G), etc. values do not exist as
averaged values. Similarly, if gene replicates have been averaged, the x and y
locations of the spots on the slide become meaningless (because spots from
different locations have been averaged). Hence, displaying images of this data
cannot be done. A message will appear
if you try to attempt certain unavailable images and no images will be
displayed.
Note that appropriate colours for the images are used. The colours
should be red for R data, green for G data, both red and green for log(R/G)
data (and the normalised data) (for which red means higher values in the red
channel and green means higher values in the green channel), and grey
intensities for the (log(R)+log(G))/2 plots. Note that the colour is
meaningless in the images of the average level of expression - it is only the
intensity that is important. In the grey images, intensities closer to white
have higher values and intensities closer to black have lower values. Pixels
that are pure white, in any of the images, represent data that was removed. A
spot could be removed if it has a flag of less than 50 (as manually flagged as
bad in GenePix) or, for the images involving background subtraction, the spot
value was negative after the background subtraction in either of the channels.
e.g.
PlotProfile(9999,FALSE,FALSE,spat,slide1.gpr,slide2.gpr,slide3.gpr,slide4.gpr)
This function can be used to plot designated genes (determined by
the input parameter id) across arrays. Only one gene ID can be plotted
at a time, but the add parameter allows you to add as many genes as you
like to an existing plot. Note that if you choose to add genes to a plot, but
you dont have an existing (corresponding) plot already on the graphics window,
you will get some strange looking plot in the window. You can list as many as
100 datasets.
You can choose to list the datasets from the plotted data should
come, or you can list one multivariate dataset that you have already created
using the CreateMultivariateData() function. If you are plotting data
from a multivariate dataset, you should set the multivariate parameter
to TRUE.
The norm parameter determines the type of normalised data
that will be plotted. If you are listing the datasets individually in the
function call (i.e. multivariate is set to FALSE) you will get an error
if you one or more of the listed datasets have not been normalised according to
the norm parameter. This parameter is obviously redundant when you are
plotting data from a multivariate dataset (since the type of normalised data
that exists in the multivariate dataset was determined when you called the CreateMultivariateData()
function). However, the parameter must still be included when you call the PlotProfile()
function.
The first gene profile that is plotted onto a set of axes (i.e.
when the add parameter is set to FALSE) will always be in black. As you
add more gene profiles to the plot, these will be plotted in randomly selected
colours. There is no way to ensure that the colours will not be the same for
each gene that you add, because there is no memory of what was plotted onto the
graph in the previous function call.
If gene replicates exist in the dataset(s) that you are plotting,
all of the replicates will be plotted onto the graph.
If you make repeated calls to the PlotProfile() function
adding genes to the current plot, and you change the list of datasets that you
are plotting, then the graph will become messy. That is, labels will be
overwritten, or lines may appear to extend off the graph. There is no message
printed to the screen to alert you that you are making a mistake. The onus is
on you to make sure that you are consistent with the data that you are
plotting.
e.g. CleanWorkspace()
This function can be used to remove all of the files that are
currently sitting in your R workspace, except the functions for tRMA. Be
warned - this is a brutal function and will remove all of your current
results and data. It should only be used when you have a large amount of data
inside R, which may cause R to crash if it runs out of memory.
e.g. MeanData(spat,TRUE,3,meanslide,slide1.gpr,slide2.gpr,slide3.gpr)
Once all of the relevant slides have been loaded using one of the
load functions, this function can be used to average the data from those slides
(and/or possibly the gene replicates within those slides) together. The
function can compute the mean for 1 slide or as many as 20 slides and will put
the resulting values in a new dataset named with the text you have input as newdataname.
The number of slides listed in the function call must match the number input as
the third parameter in the function call. Note when quotes are needed for the
input parameters.
The averaging is performed with normalised data, either slide,
pin, spatially or housekeeping gene normalised, as given in the norm
parameter. (Note that if you have already performed housekeeping gene
normalisation using the SlideNormalise() function, you should set norm
to slide.) Normalisation does not have to have already been performed on the
input datasets - the function will check to see whether the appropriate
normalisation has already been done and will perform it automatically as
necessary. However, only the slide, pin or spatial normalisations can be done
automatically. If you would like housekeeping gene normalised data you must have
performed the normalisation prior to calling the MeanData() function.
Averaging is done with the normalised log(R/G) data (which will,
by default, have been rescaled) for each spot, using the conventional method
for computing means. The standard deviation of the values for each spot is also
computed and listed in one of the columns of the newly created dataset.
Averaging is also carried out on normalised log(R/G) data that has not been
rescaled, and from this data a set of back-transformed values are computed. The
back-transformed value for some number x (which is on the log (base 2)
scale) will be 2x. There are complexities associated with
performing the back-transformation on the averaged unscaled data, which will
not be detailed here. However, the back-transformed values will be roughly on
the same scale as the original ratio values and may be used to compare the
current results with historical results where only (un-normalised) ratio values
are reported.
If the combine-genes parameter is set to TRUE, the mean is
computed over gene replicates (within a slide) as well as over slide
replicates. If the combine-genes parameter is set to FALSE, the
individual gene replicates will only be averaged across slides. The identifier
of an individual gene replicate in this case is the gene ID and the x
and y location of the spot on the slide. Hence, the assumption is
that gene replicates are always placed in the same locations on the slide
replicates. Note that if the function is called with only one dataset,
and the combine-genes parameter is set to TRUE, the result will contain
the data from the original dataset with the gene replicates averaged.
When gene replicates are not combined (i.e. when the combine-genes
parameter is set to FALSE), the output dataset (given as newdataname)
will contain a column of gene IDs that resemble 1234/2/34. The first set of
characters before the / is the gene ID, the second set is the x location
of the gene replicate and the third set is the y location of the gene
replicate. If the gene replicates are combined, the gene IDs listed in the
output dataset are simply the original gene IDs.
The minimum flag value out of all of the values averaged for each
spot is listed in the dataset (and file) of interesting genes. Hence, it is possible
to see if an average gene selected as interesting was influenced by any
spots flagged with -50.
Recall that the spots flagged as -100 were removed when the data
were loaded. Hence, it may be possible to have genes present in one slide
replicate but not others. This can cause NAs to appear in the column of
standard deviations, even when more than one slide is being averaged (since
only one valid value may exist for a particular gene ID).
e.g.
MedianData(spat,TRUE,3,medslide,slide1.gpr,slide2.gpr,slide3.gpr)
Once all of the relevant slides have been loaded using one of the
load functions, this function can be used to compute the median of each of the
spots on those slides, across the slides and possibly across the gene
replicates within the slides. This function is very similar to the MeanData() function, except that the median is
computed instead of the mean.
The function can compute the medians for 1 slide or as many as 20
slides, and will put the resulting values in a new dataset corresponding to
what was input as newdataname. The number of slides listed in the
function call must match the number input as the third parameter in the
function call. Note when quotes are needed for the input parameters.
The median computation is performed with normalised data, either
slide, pin, spatially or housekeeping gene normalised, as given in the norm
parameter. (Note that if you have already performed housekeeping gene normalisation
using the SlideNormalise() function, you should set norm to
slide.) Normalisation does not have to have already been performed on the
input datasets - the function will check to see whether the appropriate
normalisation has already been done and will perform it automatically as
necessary. However, only the slide, pin or spatial normalisations can be done
automatically. If you would like housekeeping gene normalised data you must
have performed the normalisation prior to calling the MedianData() function.
The median is computed with the normalised log(R/G) data for each
spot, using the conventional method for computing medians. The standard
deviation of the values for each spot is also computed and listed in one of the
columns of the newly created dataset. Note that the standard deviation is not
computed in a robust manner (unlike the median), and will therefore be the same
as the standard deviations computed in the MeanData() function. As with the MeanData() function the median is also computed for the normalised
log(R/G) data that has not been rescaled, and from this data a set of
back-transformed values are computed. The back-transformed value for some
number x (which is on the log (base 2) scale) will be 2x.
There are complexities associated with performing the back-transformation on
the averaged unscaled data, which will not be detailed here. However, the
back-transformed values will be roughly on the same scale as the original ratio
values and may be used to compare the current results with historical results
where only (un-normalised) ratio values are reported.
If the combine-genes parameter is set to TRUE, the median
is computed over gene replicates (within a slide) as well as over slide
replicates. If the combine-genes parameter is set to FALSE, the
individual gene replicates will only be averaged across slides. The identifier
of an individual gene replicate in this case is the gene ID and the x
and y location of the spot on the slide. Hence, the assumption is
that gene replicates are always placed in the same locations on the slide
replicates. Note that if the function is called with only one dataset,
and the combine-genes parameter is set to TRUE, the result will contain
the data from the original dataset with the gene replicates averaged.
When gene replicates are not combined (i.e. when the combine-genes
parameter is set to FALSE), the output dataset (given as newdataname)
will contain a column of gene IDs that resemble 1234/2/34. The first set of
characters before the / is the gene ID, the second set is the x location
of the gene replicate and the third set is the y location of the gene
replicate. If the gene replicates are combined, the gene IDs listed in the
output dataset are simply the original gene IDs.
The minimum flag value out of all of the values averaged for each
spot is listed in the dataset (and file) of interesting genes. Hence, it is
possible to see if an average gene selected as interesting was influenced
by any spots flagged with -50.
Recall that the spots flagged as -100 were removed when the data
were loaded. Hence, it may be possible to have genes present in one slide
replicate but not others. This can cause NAs to appear in the column of
standard deviations, even when more than one slide is being averaged (since
only one valid value may exist for a particular gene ID).
e.g. SlideNormalise(slide1.gpr,slide1.gpr,FALSE,TRUE)
This function will add a column to the input data file (given as data),
which represents the slide normalised data. Slide normalisation is performed by
robustly[3] fitting a curve (called a lowess curve)
to the input data (or a subset of the data) as plotted on the (log(R)+log(G))/2
versus log(R/G) axes. This method of fitting a single lowess curve to the data
on these axes to centre them was first introduced by Dudoit et al. (1) and Yang et al. (7). Centring of the data is always
performed, but scaling (adjusting the variance of the data) is only done if scale
is set to TRUE. If no value for scale is given it takes the default
value of TRUE.
The centring is achieved by fitting the lowess curve and computing
the residuals between the data and the curve, thus centring the data about
zero. The scaling is achieved by dividing the centred data by a robust estimate
of the standard deviation of the data on the (log(R)+log(G))/2 versus log(R/G)
axes, called the mean absolute deviation. This gives the data a standard
deviation of 1.
Note that if the function ListDataAttributes() is now called, the slide
normalisation attribute of the dataset should now be lowess instead of
none.
The input parameter housekeeping can be set to TRUE or
FALSE, depending on whether you would like to perform the normalisation on
pre-specified housekeeping genes, or on the entire set of data. If housekeeping
is set to TRUE, the lowess curve is fit only to those genes with a * as the
last character of their gene ID. Specifying these genes should be done at some
stage prior to loading the data into tRMA. When the lowess curve has
been fitted it is extrapolated to cover the entire fluorescence range of the
input dataset and then used to centre all of the spots in the dataset. If housekeeping
is set to FALSE, the lowess curve (or curves, in the case of several spot
regions (see below)) is simply fit to all of the points in the input dataset,
regardless of their gene ID. If housekeeping is set to TRUE, the
specified housekeeping genes are highlighted in the results plots.
If you specified spot regions in your original dataset (prior to
loading it into tRMA), the lowess curve fits will be done to each of the
sets of spot regions. The identifier can be used to categorise spots for
various reasons, however we have provided this facility here primarily because
of the differing bias effect of different re-suspension fluids used for the
cDNA samples before they were spotted onto the slides. Note that if the spot
regions are specified in the original data file the labels should be numbers
only (for example, with the numbers 1 and 2 for specifying two different
re-suspension fluids). If other characters are included they may cause unusual
errors in certain functions.
This function also generates a further two columns (besides the
normalised data) in the input data file (given as data): one that is the
normalised log(R/G) data that has not been rescaled, and one that is the
unscaled normalised data that has been back-transformed. The back-transformed
value for some number x (which is on the log (base 2) scale) will be 2x.
The back-transformed values will be roughly on the same scale as the original
ratio values and may be used to compare the current results with historical
results where only (un-normalised) ratio values are reported.
A few words of warning should be issued for the situation when
housekeeping genes are being used for normalisation purposes. Housekeeping
genes are usually to be employed when the input dataset is a boutique array,
where a large proportion of the genes on the slide are highly differentially
expressed, hence rendering the standard slide normalisation as spurious. If you
are trying to normalise a boutique array, the housekeeping genes that are
used should have ideally been calculated from a pilot set of slides using the SelectHousekeepingGenes()
function. This pilot set should be slides whereby only a small proportion of
genes are highly differentially expressed (say, less than 20%), and where the
probes are similar to those used in the boutique array. The definition of the
term similar is extremely subjective. The idea is that the genes that were
selected as housekeeping genes in the pilot slides will have similar expression
patterns in the boutique arrays, and hence can be used to perform the
normalisation. Vigilance of the user is key here: if genes that were selected
as housekeeping genes appear to be highly differentially expressed in the array
you are trying to normalise, then those genes should be removed from the
housekeeping gene list (i.e. the * removed from their gene ID).
The genes that are selected in the SelectHousekeepingGenes()
function should span most of the fluorescence range of the pilot slide(s).
These genes may not span the entire fluorescence range of the boutique arrays.
Because of this, it may be advisable to use negative controls in the boutique
arrays, to boost the number and fluorescence range of the genes that are being
used as housekeeping genes. As for other types of housekeeping genes, if the
negative controls appear to be highly differentially expressed in the array you
are trying to normalise, then those genes should be removed from the
housekeeping gene list (i.e. the * removed from their gene ID).
e.g. UNDOSlideNormalise(slide1.gpr,slide1.gpr)
This function will undo slide normalisation on a dataset. If slide
normalisation has not already been performed, no change is made to other parts
of the dataset. You should not normally need to use this function.
e.g. PinNormalise(slide1.gpr,slide1.gpr,TRUE)
This function works in a very similar way to the slide
normalisation function. The only difference being that the centring is done
with not just a single curve fitted to the data, but a curve fitted to each
subset of the data corresponding to each pin. This method of fitting a lowess
curve to the data for each pin on these axes was first introduced by Dudoit et
al. (1) and Yang et al. (7). This centring technique should remove
any effects of the different pins in the differential expression, and also
remove some of the spatial effects across the slide.
Scaling in this function is done in the same way as in the slide
normalisation function. As with slide normalisation, centring of the data is
always performed, but scaling is only done if scale is set to TRUE. If
no value for scale is given it takes the default value of TRUE.
Note that if the function ListDataAttributes() is called after pin normalisation
has been done, the pin normalisation attribute of the dataset should now be
lowess instead of none.
Note that the pin normalisation can cause blocking effects in the
data. These effects can be seen if the data is viewed as an image (similar to
the original GenePix images). These effects will not appear if the SpatiallyNormalise() function is used instead.
This function also generates a further two columns in the input
data file (given as data): one that is the normalised log(R/G) data that
has not been rescaled, and one that is the unscaled normalised data that
has been back-transformed. The back-transformed value for some number x
(which is on the log (base 2) scale) will be 2x. The
back-transformed values will be roughly on the same scale as the original ratio
values and may be used to compare the current results with historical results
where only (un-normalised) ratio values are reported.
e.g. UNDOPinNormalise(slide1.gpr,slide1.gpr)
This function will undo pin normalisation on a dataset. If pin
normalisation has not already been performed, no change is made to the
dataset. You should not normally need
to use this function.
e.g. SpatiallyNormalise(slide1.gpr,slide1.gpr,TRUE)
This function will add a column to the input data file (given as data),
which represents the spatially normalised data. Spatial normalisation is
performed by firstly performing slide normalisation, but generating data that
is centred only and has not been scaled. (See the entry on the SlideNormalise() function for more details.) The
datum for each spot is then represented
as a pixel in an image. That is, the single value for each spot can be
given as a single pixel in an image. Any spatial biases in the data can then be
picked up. A smoothed version of the image is then created by using what is
known as a median filter[4], which essentially estimates the general
spatial trend over the image. This smoothed image can be subtracted from the
original data (when it is represented as an image) to give the new spatially
normalised image. Finally, the data may be rescaled, depending on whether scale
is set to TRUE (the default value), by dividing the new data by a robust
estimate of the standard deviation of the data on the (log(R)+log(G))/2 versus
log(R/G) axes, called the mean absolute deviation. This gives the final
data a standard deviation of 1.
This spatial normalisation method should remove biases in the
log(R/G) data due to the average expression level, and due to the spatial
location of the spots in the slide. This method is recommended over both the
slide and pin normalisation methods.
Note that if the function ListDataAttributes() is now called, the spatial
normalisation attribute of the dataset should now be spat instead of none.
This function also generates a further two columns in the input
data file (given as data): one that is the normalised log(R/G) data that
has not been rescaled, and one that is the unscaled normalised data that
has been back-transformed. The back-transformed value for some number x
(which is on the log (base 2) scale) will be 2x. The
back-transformed values will be roughly on the same scale as the original ratio
values and may be used to compare the current results with historical results
where only (un-normalised) ratio values are reported.
The details of this algorithm are discussed in a paper that will
be in press shortly (6).
e.g. UNDOSpatiallyNormalise(slide1.gpr,slide1.gpr)
This function will undo spatial normalisation on a dataset. If
spatial normalisation has not already been performed, no change is made to the
dataset. You should not normally need
to use this function.
e.g.
SelectHousekeepingGenes(spat,4,slide1.gpr,slide2.gpr,slide3.gpr,slide4.gpr)
This function will select a set of genes as housekeeping genes
from a set of experiments where there are a large number of genes on each slide
(say more than 3000), where only a maximum of approximately 10% of the genes on
each slide are differentially expressed. The idea is to find genes that have
low differential expression after normalisation, and whose variability across
the slides is small. These genes could then hopefully be used as housekeeping
genes on boutique arrays, where there are a much smaller number of genes on the
slides (say, 500) and a large percentage of genes will be differentially
expressed. Because of this large percentage, the normalisation functions that
are usually used (eg SpatiallyNormalise() ) will not work, and there is
a need for the specially selected housekeeping genes. It is assumed that the
RNA samples used in the boutique arrays are from similar conditions as those
from the slides used to select the housekeeping genes. This means that ideally
the low differentially expressed, low variability genes selected in the
original slides maintain those properties in the boutique arrays and can
therefore be used for normalisation. The selection of housekeeping genes is
only made from the set of genes that have a flag (as determined from a software
package such as GenePix) greater than 50 in all of the input data sets.
The first parameter to the SelectHousekeepingGenes()
function is a normalisation parameter, norm, which can be set to
slide, pin, or spat. This determines which normalisation is used before
selecting the housekeeping genes. The number parameter specifies the
number of slides that will be used to determine the housekeeping genes. This
number must match the number of data sets specified (in quotes) as the rest of
the function parameters. Note that the more slides that are used to select the
potential housekeeping genes, the better the genes are likely to be for
normalisation genes in future boutique arrays.
The method works by firstly normalising all of the input slides
according to the normalisation method specified by the norm parameter.
This may have already been done, in which case it is not repeated. We know that
the bias in the differential expression of a spot is dependent on the average
fluorescence of that spot. Hence, we need to select future housekeeping genes
across the entire range of fluorescence in order to accurately estimate the
bias in future boutique arrays. To achieve this, the average fluorescence
((log(R)+log(G))/2) range is divided into 20 segments, ranging from the lowest
fluorescing spots to the highest fluorescing spots. We will try and select a
maximum of 3 housekeeping genes from each segment, to span the range.
The genes are selected according to the following criteria. Look
at only the genes within a single segment at a time. On the normalised log(R/G)
values for these genes compute the mean and standard deviation for each gene
across the input slides. Then, for each gene compute a confidence interval for
the mean. That is, compute meanħ(1.96*stddev)/number.of.slides for each gene. For the genes in that
segment, select only those genes for which the confidence interval encompasses
zero (i.e. the lower confidence limit is below zero, and the upper confidence
limit is above zero), and then choose the 3 genes with the smallest confidence
interval (i.e. the smallest variance). Hence, were trying to select 3
housekeeping genes for each fluorescence segment that, after normalisation,
have low differential expression and the lowest variance across the input
slides. Note that in some segments there may be less than 3 genes selected.
This may be because there are simply less than 3 genes in that segment, or that
less than 3 genes fit into the criteria of having a confidence interval that
encompasses zero. Sometimes there may be no genes that fit the selection
criteria (including the restriction that the flag on all spots of each gene
must be greater than 50), particularly for segments at very low or very high
fluorescence levels.
The results of the selection are output to a file in the R
working directory called housekeeping.genes.*.txt, where the * is replaced by
a list of the input slide names. The results are also saved to a dataset in the
R workspace called housekeeping.genes (which will be overwritten each
time you run the SelectHousekeepingGenes() function). This dataset can
be viewed using the DisplayData() function.
e.g.
FindDiffExpGenes(spat,medslides,medslides,TRUE,3)
The results of this function are saved to two files called
interesting.genes.*.txt (where the * corresponds to your input data name) and
all.genes.*.txt. In the interesting.genes file only those genes that were
found to have high enough differential expression are listed. In the
all.genes file all genes are listed. In both files there is a column entitled
Significant which has a YES if a gene was found to be highly differentially
expressed and a NO if the differential expression was not high enough. (In the
interesting.genes files all of the entries in the Significant column are
YES.) The files are saved into your R working directory. The files will have a
date stamp and the name of the dataset from which the results were derived.
In the files there are columns for gene names, IDs, normalised
(centred, and possibly scaled) log(R/G) values, and the standard deviations of
the log(R/G) values, where the log(R/G) values may be the values resulting from
an averaging of datasets. If this is the case, the standard deviations column
will list the standard deviations of the spot values across the averaged slides
and possibly over averaged gene replicates. If the data are from a single slide
then the standard deviation column could be NAs (if there are no gene
replicates averaged), or the standard deviations over gene replicates with the
single slide. Normalisation of the data must be done prior to selecting
differentially expressed genes. There is also a column of back-transformed data
values, as they were listed in the input file.[5]
In each of the output files there is also a column of flags,
indicating the flag for each particular spot when it was extracted from the
original image file (using some software such as GenePix). The next column
shows the spot region for each spot. This data either came from the spot region
list added to the original data file by the user (possibly labelling for
different re-suspension fluids), or one that was automatically generated when
the data file was originally loaded. If the latter is the case then all of the
spots will be labelled as coming from the same region. The final column is the
Significant column.
The combine-genes parameter should be set to TRUE or FALSE,
depending on whether or not you would like the gene replicates within a slide
to be averaged. By averaged we imply that the median of the gene replicates
for each individual gene ID on the slide will be computed. If you set combine-genes
to TRUE and the gene replicates have already been combined, no further
averaging will be done. Also, if the dataset given as a parameter for this
function is an averaged dataset (i.e. an output from MedianData() or
MeanData()) the gene
replicates will not be combined, regardless of what value the combine-genes
parameter takes. This is because, in order to combine the gene replicates, the
information on a spot's location in the slide is required and this data is lost
during the MedianData() and MeanData() functions.
The last parameter, cut-off, is an optional manual
over-ride parameter. If a suitable value is given in the parameter list only
those genes with a (normalised) log(R/G) absolute value greater than the
cut-off will be listed in the results file. The value of cut-off is
considered suitable if at least one gene has a (normalised) log(R/G) absolute
value greater than it. If it is too large an error will occur. If no value for cut-off
is given when the function is called, the automatic method of determining the
interesting genes, as given below, will be used.
The gene selection method works by comparing the log(R/G) data
(after some form of normalisation) with a normal distribution (with mean 0 and
standard deviation 1). Note that, because the data distribution is compared to
a distribution with a standard deviation of 1, it is important that scaling is
included in the normalisation of the data. The outliers, or highly
differentially expressed genes, will make the tails of the distribution of the
log(R/G) data appear heavier than the normal distribution. The outliers are
removed one by one, starting with the outlier with the largest absolute p-value
(i.e. the outlier furthest from the mean, which is zero). As each data value is
removed, the data are again compared to the (standard) normal distribution. At
some point the distribution of the subset of log(R/G) data will have tails that
are not heavy enough when compared to the normal distribution. When this occurs
the algorithm terminates, suggesting that all of the data values that have been
removed so far can be considered as outliers, and hence represent highly
differentially expressed genes.
Recall that the spots flagged as -100 were removed when the data
were loaded. If the input data for FindDiffExpGenes() is an averaged dataset (from using either MeanData() or MedianData()) it may be possible to have genes in
the gene list that were only present in one of the slide replicates. If you are
searching for interesting genes in this averaged data you may find that some
NAs appear in the column of standard deviations. This can occur even when the
data is the average of more than one slide, because only one valid value may
exist for a particular gene ID and standard deviations will not be computed
over that one value.
e.g.
CompareInterestingGenes(spat,3,100,FALSE,slide1.gpr,slide2.gpr,slide3.gpr)
This function enables you to compare the results for several datasets
from the FindDiffExpGenes() function. Up to 20 sets of results can be
compared. Note that the number.to.compare parameter must match the
number of datasets listed in the function call.
The function FindDiffExpGenes()
must have been called prior to calling CompareInterestingGenes(). You will receive an error if you
try to run CompareInterestingGenes()
without having run the FindDiffExpGenes()
function. Similarly, if you have reloaded any data since running FindDiffExpGenes(), running CompareInterestingGenes()on the reloaded data will not execute
and you will be asked to re-run the appropriate functions.
If you have run the FindDiffExpGenes() function and not combined the gene replicates for data
being input into CompareInterestingGenes(),
an error will be reported and you will be asked to re-run the FindDiffExpGenes() function. This is because we would
like the comparisons between lists of interesting genes from different slides
to be consistent, and not have some lists which may contain gene replicates and
other lists which may have only one instance of a gene ID, which represents an
averaged value over gene replicates.
The results from the comparisons show the spots that appear in the
specified percent of the lists of interesting genes. (Note that the
percentage given as a parameter when the function is called will be rounded, so
that the actual number compared will vary depending on the number of datasets
listed.) The percent parameter must be between 0 and 100.
The keep-inconsistent-genes parameter can be set to TRUE or
FALSE. If it is set to FALSE, if a gene appears as interesting in the
required number of slides, but the sign of the normalised log ratio changes
across the slides, that gene will not be listed in the results file. This might
be used if you assumed that the differential gene expression is unstable (with
high biological variation) and is not of current interest. If the keep-inconsistent-genes
parameter is set to TRUE the genes that are interesting in a sufficient
number of slides but flip signs across those slides will appear in the results
file.
The results are saved to two files called
interesting.genes.compared.*.txt and all.genes.compared.*.txt in the R
working directory, where the names of the datasets included in the comparisons
are inserted instead of the *. The interesting.genes file contains only those
genes found to be interesting in a large enough number of slides, whereas the
all.genes slide contains all of the genes that appear in the slides. The files
will have a date stamp, a list of all of the files used in the comparisons, and
the number of slides in which a gene must be interesting before it will be
selected in the CompareInterestingGenes()
function. The files will also contain columns for the spot IDs, spot names,
median average log values, median log ratio values, standard deviations, and
the number of slides in which each gene ID was estimated to be interesting.
In the interesting.genes.compared.*.txt file the median values and standard deviations
are computed using only those spots deemed interesting by the FindDiffExpGenes() function. That is, if 4
slides are to be compared, and the percent is set as 75, then the medians and
standard deviations will be computed over 3 or 4 slides (and the gene
replicates within those slides), depending on how many times each particular
gene ID was estimated to be differentially expressed. In the
all.genes.compared.*.txt file the median values and standard deviations are
computed over all of the slides for which there is a useable spot value for the
relevant gene ID. This implies that the medians and standard deviations
recorded in the two files may not be exactly the same. There is also a column
of flags, indicating the minimum flag for each selected gene (as stated in the
output file from a software such as GenePix) across the interesting.genes.*
files being compared. The final column indicates a YES if the gene ID appeared
as interesting in enough of the slides and a NO otherwise. Note that all of the
entries in this column of the interesting.genes.compared.* file will be YES.
The output file will also contain a column of back-transformed
data values. These values are computed by firstly calculating the median over
unscaled normalised log(R/G) values for only those spots for each gene that are
deemed interesting by the FindDiffExpGenes() function. These median values are then
back-transformed: for some number x (which is on the log (base 2)
scale) its back-transformed value will be 2x. There are
complexities associated with performing the back-transformation on the averaged
unscaled data, which will not be detailed here. However, the back-transformed
values will be roughly on the same scale as the original ratio values and may
be used to compare the current results with historical results where only
(un-normalised) ratio values are reported.
The results are also stored as the dataset within R, called
interesting.genes.compared and all.genes.compared, which can be displayed
using the DisplayData()
function.
e.g.
SaveInterestingGenes(spat,3,slide1.gpr,slide2.gpr,slide3.gpr)
This function creates an output file of the results for several
datasets from the FindDiffExpGenes() function. If a gene has been found to be
interesting in any of the input datasets it will appear in the output file.
Up to 20 sets of results can be compared. Note that the number.to.compare parameter
must match the number of datasets listed in the function call.
The function FindDiffExpGenes()
must have been called prior to calling SaveInterestingGenes(). You will receive an error if you
try to run SaveInterestingGenes() without
having run the FindDiffExpGenes()
function. Similarly, if you have reloaded any data since running the FindDiffExpGenes(), running SaveInterestingGenes() on the reloaded data will not
execute and you will be asked to re-run the appropriate functions.
If you have run the FindDiffExpGenes() function and not combined the gene replicates for data
being input into SaveInterestingGenes(),
an error will be reported and you will be asked to re-run the FindDiffExpGenes() function.
This is because we would like the comparisons between lists of interesting
genes from different slides to be consistent, and not have some lists which may
contain gene replicates and other lists which may have only one instance of a
gene ID, which represents an averaged value over gene replicates.
The results from the comparisons show the spots that appear in at
least one of the lists of interesting genes. The results are saved to a file
called minimal.interesting.genes.*.txt in the R working directory, where the
names of the datasets included in the comparisons are inserted where the * is.
The file will contain a date stamp and a list of all of the files used in the
comparisons. The file will also contain columns for the spot Ids and a column
for each of the datasets being compared. The data in these latter columns are
the normalised log(R/G) data for each respective gene ID.
The results are also stored as a dataset within R, called
minimal.interesting.genes.compared, which can be displayed using the DisplayData() function.
The results are displayed in the R plot window as a plot of the median data (over
all of the datasets being compared), with the gene IDs that are interesting
in at least one of the slides being plotted in a different colour.
e.g. CreateMultivariateData(spat,4,combined.slides,slide1.gpr,slide2.gpr,slide3.gpr)
This function combines the normalised data columns from the input
files into one dataset or appends the relevant columns to the specified
existing multivariate dataset. It also combines the average gene expression
columns ((log(R)+log(G))/2) from the input files into another dataset (or
appends then to an already existing dataset). These two datasets are stored
within R and are called newdataname.normdiff and newdataname.avg
respectively. Each of the new/appended datasets also contains a column of gene
names and a column of gene IDs. Currently only the data in
newdataname.normdiff is used in other multivariate analyses in tRMA.
However, the newdataname.avg dataset is created here for potential use in the
future functionality of tRMA. If you have previously created a
multivariate dataset of the same name you will be asked if you would like to
overwrite it or append to it the data from your input slides. The datasets that
are input into the CreateMultivariateData() function can be either
original data files (such as a *.gpr or *.dat file), after normalisation, or
the output from one of the data averaging functions such as MedianData()
or MeanData().
The norm parameter must be either slide, pin, or spat
depending on the type of normalisation that is desired. The normalisation must
have been done prior to calling the CreateMultivariateData() function.
If the requested normalisation has not been previously performed an error will
result and you will be asked to run the desired normalisation before re-running
the CreateMultivariateData() function.
The number parameter is the number of datasets that you
wish to be combined. It must match the number of dataset names you list when
you call the CreateMultivariateData() function. The maximum number of
datasets that can be combined is 100. However, this is a constraint we have put
into the CreateMultivariateData() function. It is possible that R
(or your PC) will not have the memory capacity to cope with this many datasets,
and analyses may have to be done separately on subsets of the input files.
If you try to create, overwrite, or append to a multivariate
dataset whereby there are slides of the same name trying to be listed, you will
get an error. All of the slide columns that are put into the resulting
multivariate dataset must have unique names.
Slides with gene replicates cannot be put into a multivariate
dataset. This is because the multivariate dataset will be used in functions
that require unique identifiers for the gene IDs. There are two ways around
this problem. Both involve either the MedianData() function or the MeanData()
function. The gene replicates can be combined using either of these two
averaging functions, or either of these two functions can be run on the
individual slide, but where the combine-genes parameter is set to FALSE
(see the help information on the MedianData() and MeanData()
functions). This would result in a new dataset of the same number of rows as
the original slide. However, all of the gene IDs would be amended to include
the gene ID and the x and y locations of each spot on the slide.
In this way, the new dataset could be combined into a multivariate dataset and
clustering could be performed on the individual spots (with the gene replicates
remaining labelled as those individual spots).
If you try to combine columns from two or more slides where there
are the same gene IDs included in both the slides, but the corresponding gene
names in each of the slides are different, then an error message will be
displayed and the CreateMultivariateData() function terminated. When
combined data from different slides, the gene-ID/gene-name pairs in each slide must match exactly.
The CreateMultivariateData() function includes the facility
to be able to add more columns to an already existing multivariate dataset
because of memory problem that have been encountered in R. As the number of
slides in your experiment increase, so too do the memory requirements of R for
your analyses. What may be necessary is to load and perform the normalisation
on some of your datasets, add them to the multivariate dataset and then save
that dataset outside R (using the SaveData() function). The R workspace
can be cleaned (using the CleanWorkspace() function) and a new set of
slides loaded an normalised. The multivariate dataset can then be reloaded in
the R workspace (using the LoadData() function) and the new slide data
appended to the multivariate dataset. These steps could be repeated as many
times as necessary to get all of the required data into a multivariate dataset.
This is a bit of a fiddle, but may be necessary under the current memory
restrictions on R.
e.g.
ComputeKMeans(combined.slides,genes,5,compared,median)
This function uses the K-means algorithm (3) to compute clusters in the input
multivariate dataset. The input dataset must be the name of one that was
created using the CreateMultivariateData() function. The genes.or.slides
parameter must be set to genes or slides, depending on what you would like
to cluster on. For example, if you are looking to group genes with similar
patterns toegther, then you would set genes.or.slides to genes. You
must specify the desired number of clusters with the number.of.clusters
parameter. If the genes.or.slides parameter is set to genes, then the number.of.clusters
must be set to a number between 1 and the total subset of genes used for
clustering (exclusively). Alternatively, if the genes.or.slides
parameter is set to slides, then the number.of.clusters must be set to
a number between 1 and the total number of slides (exclusively). That is, the
algorithm will not run if you attempt to put all the samples into one large
cluster, or if you attempt to put each sample into an individual cluster.
The fourth parameter, number.of.genes, determines the
subset of the input data that is clustered upon. The parameter can be set to
either a number between 1 and the total number of genes in the input data set
(exclusively), the text compared, or the text minimal. This is similar to
the number.of.genes parameter in ComputeHierarchicalClustering().
If number.of.genes is set to minimal, the genes used in the clustering
are those that were selected in the SaveInterestingGenes() function.
That is, these are genes that appeared as interesting in at least one of the
slides being analysed. If the SaveInterestingGenes() function has not
been run, or if it has been run most recently on a different set of slides from
those input into the CreateMultivariateData() function, ComputeHierarchicalClustering() will
not run and an error will be returned. If number.of.genes is set to
compared, the genes used in the clustering are those that were selected in
the CompareInterestingGenes() function. That is, these are genes that
appeared as interesting in a specified percentage of the slides being
analysed. If the CompareInterestingGenes() function has not been run, or
if it has been run most recently on a different set of slides from those input
into the CreateMultivariateData() function, ComputeHierarchicalClustering() will not run and an error
will be returned. If number.of.genes is set to a number, then the median
of each of the (absolute normalised log(R/G)) gene values across the slides is
computed. The genes are sorted according to these median values, and the top number.of.genes
are used for the clustering. The selection of the gene subset according to this
parameter is performed regardless of whether you have chosen to cluster on
genes or on slides. There is no parameter to select a subset of the slides,
as this would already have been done when the multivariate dataset was created
using the CreateMultivariateData() function.
To illustrate the results, there are three displays produced, a
silhouette plot of the results, a plot of means (centroids) for each of the
clusters, and an image with the genes/slides (depending on what you entered for
the genes.or.slides parameter) ordered into their respective clusters.
The silhouette plot is a bar plot of the estimates of the degree of
clusteredness of a point within its cluster in relation to the other
clusters. Observations with a large silhouette value (almost 1) are very well
clustered, observations with a small silhouette value (around 0) means that the
observation lies between clusters, and observations with a negative silhouette
value have probably been placed in the wrong cluster. When a gene/slide has
been clustered into a group with only one element, the silhouette value cannot
be computed and will appear as an NaN (not a number) in the output file.
However, in order to provide suitable values for the silhouette plot, these
genes/slides will appear in the plot with a value of 1 and the bar representing
them will be coloured black. Note that this situation will generally only occur
when the genes.or.slides parameter is set to slides. If this does
occur when the genes.or.slides parameter is set to genes, the entry
for these genes in the silhouette plot will be difficult to tell apart from the
other genes because of the density of the plot. Note also that in very dense
silhouette plots (where there are many values to be plotted) some of the
vertical lines may appear to be coloured black, when in fact they are not. This
is simply due to the restriction of the resolution of your computer screen. If
you are in any doubt as to the silhouette value of an observation, you should
check the output file.
The final parameter, ordering, is used to order the genes
within each cluster in the image that is used to display the results. There are
two options: typing the text median (in quotes) or typing the text for one of
the slide names in quotes (e.g. slide1.gpr). An error will be produced if the
text you enter does not match any of these options. Either option can be used
when you are clustering on genes, but only the median option can be used
when you are clustering on slides. If you enter the median option, the
group of genes/slides (depending on what you entered as the genes.or.slides
parameter) in each cluster will be ordered according to increasing median
(absolute normalised log(R/G)) values. If you enter one of the slide names, the
group of genes in each cluster (since can only cluster on genes with this
option) will be ordered according to increasing (absolute normalised log(R/G))
values for the specified slide.
The results from the clustering algorithm are output to a file in
the R working directory called kmeans.*.txt, where the * is replaced
by the input data name. The file contains columns for gene names, and gene IDs,
and a column for the normalised log ratio data for each of the slides involved
in the analysis. If the genes.or.slides parameter was set to genes,
then the last column of the data lists the cluster to which each gene was
assigned, and the cluster means are listed in the last rows of the file. If the
genes.or.slides parameter was set to slides, then the last row of the
data lists the cluster to which each slide was assigned, and the cluster means
are listed in the last columns of the file. If the genes.or.slides
parameter was set to genes, then the order of the genes in the output file
will be done according to what you specified for the ordering parameter,
and the slides will be the same as in the file from which the data was taken
(such as the interesting.genes.compared file). If the genes.or.slides
parameter was set to slides, then the order of the slides in the output file
will be done according to what you specified for the ordering parameter,
and the genes will be the same as in the file from which the data was taken.
The K-means algorithm does not work if there are any missing
values in the dataset. If the genes.or.slides parameter is set to
genes the ComputeKMeans() algorithm removes all genes that have at
least one missing value across the slides. If the genes.or.slides
parameter is set to slides the ComputeKMeans() algorithm removes all
slides that have at least one missing value across the genes. If it is the
case, for example, that every slide has at least one missing value, then all of
the slides will be removed from the input dataset. Hence the function would not
be able to run. If this occurs you are given the option of removing all of the
genes that contain at least one missing value. If you choose this option you
would be removing data that could possibly be useful in the clustering
algorithm. However, when the alternative is not being able to run the
clustering algorithm at all, clustering on the reduced dataset can at least
give you some results about your data.
The K-means algorithm will return an error if, at any point in the
algorithm iterations, it reaches a configuration where there are empty
clusters. The ComputeKMeans() function will terminate and no output file
will be generated. This problem may possibly be resolved by re-running the
algorithm and trying to cluster the data into a smaller number of groups.
Note that the K-means algorithm is an iterative algorithm in which
the results are initially randomly selected and then optimised over the
algorithm iterations. Because of this random initialisation, calling the ComputeKMeans()
function repeatedly with the same input parameters may result in varying
results over the separate function calls.
Occasionally the ComputeKMeans() algorithm will terminate
because of a lack of memory in the R workspace. If this happens your first step
should be to exit from R, saving your workspace, and then start R up again. You
will need to run the SourceFunctions() function before resuming to the ComputeKMeans()
function. If this does not work, we would suggest that you manually remove some
of the objects stored in R workspace. This can be done by firstly using the ListData()
function to find out what objects exist in the workspace, and then running the rm()
function (which is a function that exists in R), with the text rm(objectname),
where objectname is the name of the object you wish to remove. It is
advisable that you do not run the CleanWorkspace() function in this
instance, because it will remove all your data from the R workspace, including
the multivariate dataset that you are trying to cluster on.
The basic idea of the K-means algorithm is to begin
by estimating number.of.clusters initial cluster means. The genes/slides
are then assigned to the cluster with the nearest mean, and the values of the
means are updated according to the current elements in the cluster. The
assignment and updating steps are iterated until the means only shift by some
minimal amount between iterations.
e.g. ComputeSOMs(combined.slides,4,5,compared,median)
This function uses the self-organising maps algorithm (5) to compute clusters in the input
multivariate dataset. The input dataset must be the name of one that was
created using the CreateMultivariateData() function. The clustering is
currently only done on the genes. The algorithm clusters the data into groups
positioned on a hypothetical two-dimensional grid. The grid has length number.of.xclusters
on the horizontal axis and length number.of.yclusters on the vertical
axis. The total number of clusters (number.of.xclusters * number.of.yclusters)
must be between 1 and the total number of genes (exclusively). That is, the
algorithm will not run if you attempt to put all the genes into one large
cluster, or if you attempt to put each gene into an individual cluster. If one
of either number.of.xclusters or number.of.yclusters is given the
value 1, then the clustering results may not be very satisfactory because of
the particular structure of the self-organising maps algorithm.
The fourth parameter, number.of.genes, determines the
subset of the input data that is clustered upon. The parameter can be set to
either a number between 1 and the total number of genes in the input data set
(exclusively), the text compared, or the text minimal. This is similar to
the number.of.genes parameter in ComputeHierarchicalClustering().
If number.of.genes is set to minimal, the genes used in the clustering
are those that were selected in the SaveInterestingGenes() function.
That is, these are genes that appeared as interesting in at least one of the
slides being analysed. If the SaveInterestingGenes() function has not
been run, or if it has been run most recently on a different set of slides from
those input into the CreateMultivariateData() function, ComputeHierarchicalClustering()
will not run and an error will be returned. If number.of.genes is set to
compared, the genes used in the clustering are those that were selected in
the CompareInterestingGenes() function. That is, these are genes that
appeared as interesting in a specified percentage of the slides being
analysed. If the CompareInterestingGenes() function has not been run, or
if it has been run most recently on a different set of slides from those input
into the CreateMultivariateData() function, ComputeHierarchicalClustering()
will not run and an error will be returned. If number.of.genes is set to
a number, then the median of each of the (absolute normalised log(R/G)) gene
values across the slides is computed. The genes are sorted according to these
median values, and the top number.of.genes are used for the clustering.
The selection of the gene subset according to this parameter is performed regardless
of whether you have chosen to cluster on genes or on slides. There is no
parameter to select a subset of the slides, as this would already have been
done when the multivariate dataset was created using the CreateMultivariateData()
function.
To illustrate the results, there are three displays produced, a
silhouette plot of the results, a plot of means (centroids) for each of the
clusters (arranged in their hypothetical two-dimensional grid), and an image
with the genes/slides (depending on what you entered for the genes.or.slides
parameter) ordered into their respective clusters. On the plot of the cluster
means, there will only be means plotted for clusters that have data in them.
The cluster number and the number of elements of each cluster are given at the
top of each plot in the two-dimensional hypothetical grid.
The silhouette plot is a bar plot of the estimates of the degree
of clusteredness of a point within its cluster in relation to the other
clusters. Observations with a large silhouette value (almost 1) are very well
clustered, observations with a small silhouette value (around 0) means that the
observation lies between clusters, and observations with a negative silhouette
value have probably been placed in the wrong cluster. When a gene/slide has
been clustered into a group with only one element, the silhouette value cannot
be computed and will appear as an NaN (not a number) in the output file.
However, in order to provide suitable values for the silhouette plot, these
genes/slides will appear in the plot with a value of 1 and the bar representing
them will be coloured black. If this occurs the entry for these genes in the
silhouette plot will be difficult to tell apart from the other genes because of
the density of the plot. Note also that in very dense silhouette plots (where
there are many values to be plotted) some of the vertical lines may appear to
be coloured black, when in fact they are not. This is simply due to the
restriction of the resolution of your computer screen. If you are in any doubt
as to the silhouette value of an observation, you should check the output file.
The final parameter, ordering, is used to order the genes
within each cluster in the image that is used to display the results. There are
two options: typing the text median (in quotes) or typing the text for one of
the slide names in quotes (e.g. slide1.gpr). An error will be produced if the
text you enter does not match any of these options. If you enter the median
option, the group of genes in each cluster will be ordered according to
increasing median (absolute normalised log(R/G)) values. If you enter one of
the slide names, the group of genes in each cluster will be ordered according
to increasing (absolute normalised log(R/G)) values for the specified slide.
The results from the clustering algorithm are output to a file in
the R working directory called soms.*.txt, where the * is replaced by
the input data name. The file contains columns for gene names, and gene IDs,
and a column for the normalised log ratio data for each of the slides involved
in the analysis. The last column of the data lists the cluster to which each
gene was assigned, and the cluster means are listed in the last rows of the
file. The order of the genes in the output file will be done according to what
you specified for the ordering parameter, and the slides will be the
same as in the file from which the data was taken (such as the
interesting.genes.compared file).
The self-organising maps algorithm does not work if there are any
missing values in the dataset. To cope with this the algorithm removes all
genes that have at least one missing value across the slides.
The self-organising map algorithm is very similar to the basic K-means algorithm: initialise means for each of the clusters, assign genes to the closest cluster, update the cluster means, and iterate until the means are in stable locations. The self-organising maps algorithm is a variant on the K-means algorithm in that the cluster means are adjusted according to a weighting scheme after the update step. The cluster mean adjustment step is included at every iteration.
Each cluster label is mapped to a location in a
hypothetical two-dimensional grid. At each iteration of the algorithm, a
cluster mean is adjusted to be a weighted average of the neighbouring means in
hypothetical grid. The weights of the neighbouring means are dependent on the
number of elements in the respective cluster and the distance between the
clusters in grid. A cooling schedule is employed so that the apparent
distance between the centroids in the grid increases to infinity, so that the
eventual influence of neighbouring clusters tends to zero as the algorithm
iterates.
The clustering results can be displayed on a plot of
the cluster means in the hypothetical two-dimensional grid. This plot is shown
as the second display in the graphics window. The plot is important because the
algorithm is constructed so that elements in neighbouring clusters (in the
grid) should be similar. For example, a data set containing two main classes of
genes may be clustered into 20 clusters using self-organising maps. Although
the clustering results may suggest that the genes fall into many more than two
groups, it may be found that the clusters occur in two regions in the grid.
For example, there may be 5 clusters containing genes in the top left of the
grid and 3 clusters containing genes in the bottom right of the grid. This
grouping pattern could only be discovered by plotting the clustering results on
the grid. Note that such a grouping is invariant to a swapping of the
horizontal and vertical axes of the grid.
e.g. ComputeHierarchicalClustering(combined.slides,5,minimal,median)
This function uses one of the hierarchical clustering routines
provided in the R package, cluster, to perform agglomerative,
hierarchical clustering on the genes of a multi-slide dataset. Based on the
results of the hierarchical clustering, groups of genes with similar profiles
across the slides are determined. The number of groups that are generated is
given by the input parameter, number.of.clusters. The input dataset upon
which the clustering is performed is given in the parameter data. This dataset
must have been constructed using the CreateMultivariateData() function.
Note that there is currently only functionality to cluster on genes. Clustering
on slides in this function has not yet been implemented.
The third parameter, number.of.genes, can take one of three
different types: the text minimal, the text compared, or a number between 1
and the total number of genes on the slide. However, choosing a large number of
genes (say, over 1000), will possibly crash the function because R may run out
of memory. If number.of.genes is set to minimal, the genes used in the
clustering are those that were selected in the SaveInterestingGenes()
function. That is, these are genes that appeared as interesting in at least
one of the slides being analysed. If the SaveInterestingGenes() function
has not been run, or if it has been run most recently on a different set of
slides from those input into the CreateMultivariateData() function, ComputeHierarchicalClustering() will
not run and an error will be returned. Note that if the number of genes
selected in the SaveInterestingGenes() function is too large, again the
function may crash because of memory problems. If number.of.genes is set
to compared, the genes used in the clustering are those that were selected in
the CompareInterestingGenes() function. That is, these are genes that
appeared as interesting in a specified percentage of the slides being
analysed. If the CompareInterestingGenes() function has not been run, or
if it has been run most recently on a different set of slides from those input
into the CreateMultivariateData() function, ComputeHierarchicalClustering() will not run and an error
will be returned. If number.of.genes is set to a number, then the median
of each of the (absolute normalised log(R/G)) gene values across the slides is
computed. The genes are sorted according to these median values, and the top number.of.genes
are used for the clustering.
The results are output into a file called hier.clust.*.txt,
where * is replaced by the text input as the data parameter. Listed in
the results file are the gene IDs and names for the genes that were used for
the clustering, their normalised values for each of the slides involved in the
analysis, and a cluster label indicating which group each gene was assigned to.
At the bottom of the file the means (centroids) for each cluster is given. That
is, the mean (a set of numbers, where there is a number for each slide) for
each cluster is computed across all of the genes assigned to that cluster.
To illustrate the results there are four displays produced, a
dendogram of the hierarchical clustering outcome, a silhouette plot of the
results, a plot of means (centroids) for each of the clusters, and an image
with the genes ordered into their respective clusters. If there are many genes
used in the analysis the dendogram may be fairly unreadable. The silhouette
plot is a bar plot of the estimates of the degree of clusteredness of a point
within its cluster in relation to the other clusters. Observations with a large
silhouette value (almost 1) are very well clustered, observations with a small
silhouette value (around 0) means that the observation lies between clusters,
and observations with a negative silhouette value have probably been placed in
the wrong cluster. When a gene has been clustered into a group with only one
element, the silhouette value cannot be computed and will appear as an NaN
(not a number) in the output file. However, in order to provide suitable
values for the silhouette plot, these genes will appear in the plot with a
value of 1 and the bar representing them will be coloured black. If this occurs
the entry for these genes in the silhouette plot will be difficult to tell
apart from the other genes because of the density of the plot. Note also that
in very dense silhouette plots (where there are many values to be plotted) some
of the vertical lines may appear to be coloured black, when in fact they are
not. This is simply due to the restriction of the resolution of your computer
screen. If you are in any doubt as to the silhouette value of an observation,
you should check the output file.
The final parameter, ordering, is used to order the genes
within each cluster in the image that is used to display the results. There are
three options: leaving the parameter out (i.e. not putting anything), typing
the text median (in quotes) or typing the text for one of the slide names in
quotes (e.g. slide1.gpr). An error will be produced if the text you enter
does not match any of these options. If you leave the parameter blank, the
ordering of the genes in the image will be the same as the ordering of the
genes in the dendogram. If you enter the median option, the group of genes in
each cluster will be ordered according to increasing median (absolute
normalised log(R/G)) values. If you enter one of the slide names, the group of
genes in each cluster will be ordered according to increasing (absolute
normalised log(R/G)) values for the specified slide.
e.g. ComputeBiplot(combined.slides,genes,5,minimal)
This function produces a display known as a biplot (2) which can be used to examine your data.
The biplot is a method for displaying the results of a principal components
analysis. This analysis is a transformation of the data so that, instead of
slides (if genes.or.slides is set to genes), you have a set of new
samples in which the maximal variance across the genes (within a slide/new
sample) is captured in the first new sample, the next largest amount of variance
across the genes is captured in the second new sample, etc. There are as many
new samples as there were original slides (again, if genes.or.slides
is set to genes) in the multivariate data set. However, only the data points
in the first two new samples are displayed in the biplot.
If genes.or.slides is set to slides the analysis is a
transformation of the data so that, instead of genes you have a set of new
genes in which the maximal variance across the slides is captures in the first
new gene, the next largest amount of variance across the slides is captured
in the second new gene, etc. There are as many new genes as there were
original genes in the multivariate data set. However, only the data points in
the first two new genes are displayed in the biplot.
The biplot is a plot both of the results of the transformation and
of how the transformation itself was done. The horizontal axis at the bottom of
the biplot represents the first new sample (if genes.or.slides was set
to genes, or new gene if genes.or.slides was set to slides), and
the vertical axis at the left of the biplot represents the second new sample
(or new gene). On these axes the transformed genes (or slides) are plotted
for these two new samples (or new genes), or principal components. There
are two biplots displayed using the ComputeBiplot() function. The left
plot has the genes (slides) labelled with their gene IDs (slide names). For the
right hand plot, the genes (slides) (before the principal components
transformation) have been clustered according the clustering method described
in the ComputeKMeans() function. The clustering labels assigned to each
gene (slide) are then used as the labels on the biplot. Note that, because the
biplot represents the genes (slides) in a way that maximises the variance
across them, the gene (slide) clusters can sometimes be seen quite clearly,
particularly when there are only a small number of genes (slides) and a small
number of clusters. Note also that the two biplots have exactly the same points
plotted on them. However, they can sometimes look different (eg more dense
points on the left plot) because the labels of the points on the left plot are
larger than those on the right plot.
The other feature of the biplot is the vectors, which appear in
red and are labelled with the slide names (gene IDs). The value on the top
horizontal axis of a vector for each slide (gene) shows the weight of that
slide (gene) that was used to transform the slide (gene) values to their first
new sample (new gene. The value on the right vertical axis of a vector for
each slide (gene) shows the weight of that slide (gene) that was used to
transform the values to their second new sample (new gene). Hence, if a
slide (gene) vector has a large value for the horizontal axis it means that the
variability across the genes (slides) for that slide (gene) was large, and
therefore that slide (gene) had a large weight in the transformation of the
data points to their first new sample (new gene). If a slide (gene) vector
has a small value for the horizontal axis it means that the variability across
the genes (slides) for that slide (gene) was small, and therefore that slide
(gene) only had a small weight in the transformation of the data points to
their first new sample (new gene). A similar description can be given for
the values of the slide (gene) vectors on the right vertical axis.
In the new sample (new gene) transformation, the largest
variance across the genes (slides) is captured in the first new sample (new
gene) (which is a combination of all the slides (genes), the weights of each
slide (gene) being displayed by the vectors on the biplots). The percentage of
the total variance across the genes (slides) that is captured in the first new
sample (new gene) after the transformation is printed to the R screen.
Similarly, the percentage of the total variance that is captured in the second
new sample (new gene) after the transformation is also printed to the R
screen.
The input dataset upon which the clustering is performed is given
in the parameter data. This dataset must have been constructed using the
CreateMultivariateData() function. The number of groups that are
generated and used to assign gene labels for the second biplot is given by the
input parameter, number.of.clusters. Note that there is only
functionality to cluster on genes.
The third parameter, number.of.genes, can take one of three
different types: the text minimal, the text compared, or a number between 1
and the total number of genes on the slide. However, choosing a large number of
genes (say, over 1000), will possibly crash the function because R may run out
of memory. If number.of.genes is set to minimal, the genes used in the
clustering are those that were selected in the SaveInterestingGenes()
function. That is, these are genes that appeared as interesting in at least
one of the slides being analysed. If the SaveInterestingGenes() function
has not been run, or if it has been run most recently on a different set of
slides from those input into the CreateMultivariateData() function, ComputeHierarchicalClustering() will
not run and an error will be returned. Note that if the number of genes
selected in the SaveInterestingGenes() function is too large, again the
function may crash because of memory problems. If number.of.genes is set
to compared, the genes used in the clustering are those that were selected in
the CompareInterestingGenes() function. That is, these are genes that
appeared as interesting in a specified percentage of the slides being
analysed. If the CompareInterestingGenes() function has not been run, or
if it has been run most recently on a different set of slides from those input
into the CreateMultivariateData() function, ComputeHierarchicalClustering() will not run and an error
will be returned. If number.of.genes is set to a number, then the median
of each of the (absolute normalised log(R/G)) gene values across the slides is
computed. The genes are sorted according to these median values, and the top number.of.genes
are used for the clustering.
Currently no results from the ComputeBiplot() function are
saved to a file.
1. Dudoit,
S., Yang, Y. H., Callow, M. J., and Speed, T. P. Statistical methods for
identifying expressed genes in replicated cDNA microarray experiments. 2001.
Notes: Technical report 578, Stanford University.
2. Gabriel,
K. R. (1971) Biometrika 58, 453-467.
3. Hartigan,
J. A. & Wong, M. A. (1979) Applied Statistics 28, 100-108.
4. Ihaka,
R. & Gentleman, R. (1996) Journal of Computational and Graphical Statistics 5,
299-314.
5. Kohonen,
T. (1990) Proceedings of the IEEE 78, 1464-1480.
6. Wilson,
D. L. , Buckley, M., Helliwell, C., and Wilson, I. New normalization methods for
cDNA microarray data. 2002.
Notes: Submitted for publication.
7. Yang,
Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., & Speed, T. P.
(2002) Nucleic Acids Research 30, e15.
[1] However, note that using a cut-off is provided as an option.
[2] The working directory can be reset by specifying the directory of your choice in the Target box in the R properties dialog box.
[3] Robust estimation implies that the estimation method will not be greatly affected by outliers.
[4] The size of the median filter is currently set to a 3*3 window.
[5]
The
back-transformed value for some number x (which is on the log (base 2)
scale) will be 2x. The back-transformed values will be
roughly on the same scale as the original ratio values and may be used to
compare the current results with historical results where only (un-normalised)
ratio values are reported.