ExpressionSieve TM  Product Manual
Getting Started
How to obtain and install JVM(Java virtual machine)
System recommendations
Start ExpressionSieve


Basic function
Load in data
Preprocess the data
Select data analysis settings
Pattern recognition analysis
View previous analysis results
Color PCA dot by annotations
Color hierarchical tree by annotations
Save analysis results - general
Save analysis results - Save hierarchical clustering results
Save analysis results - Save PCA analysis results
Save analysis results - Save KMean analysis results
Save analysis results - Save nearest neighbours analysis results


Advanced functions
Pattern annotations - Gene Ontology
Pattern annotations - Yeast Function Catalog
Pattern annotations - Biological Pathway


How to obtain and install JVM(Java virtual machine)
Go to http://java.sun.com/products/archive/index.html , download the J2SE for your operating system, and follow the installation instructions over there.

System Recommendations
All of the implementations are carefully crafted to achieve both speed and memory efficiency. To give an idea about what kind of CPU power and RAM capacity required, here is a sample of the systems that ExpressionSieve runs smoothly on:
Recommended screen resolution: 1024X768
Recommended JVM version: J2SE version 1.3.0_01 or later

Start ExpressionSieve TM
From linux/unix or windows command prompt, type> java -jar ExpressionSieve.jar , or double-click the .jar file on windows.
ExpressionSieve application interface: The main window is partitioned into three adjustable areas, the right upper part (the biggest part by default) is the main view area for displaying and navigating through various analysis results. The bottom part is the message area providing short summary in response to the action taken. The left part is the database browse area for navigating through databases of all analysis results.

Load in data
ExpressionSieve currently accepts two tab-delimited file formats.
1) Stanford file format ,  with array name in 1st row, gene name in 1st column. Each cell contains the expression value, which is either the 2-channel ratio value, or log(ratio), or signal. Taking log can be done later through " Preprocess the data ".
     From menu, File - Open ..., select the data file you wish to load. The data table appears in the display area.
2) Affy file format ,  with array name, "call", "p-value" in 1st row, gene name in 1st column. Each cell under array name is the expression signal, the cell under "call" is the call symbol, the cell under "p-value" is the p-value.
    From menu, File - Open Affy Data..., select the data file you wish to load.  The data table appears in the display area. The blank cells present "A" call values, treated as missing values, which will be handled through " Preprocess the data " .

In either case, information related to the data file is displayed in the message area. Click on a column title, a new window pops up displaying a histogram for this array data. To see histogram for a different experiment, either click on it's column title, or use the scroll bar in the new window. Scroll to the very end, the last two histograms, "Histogram for Std", "Histogram for Span" can help you decide on the cut off value for data filtering, see "Preprocess the data ". Click on a row title, a new window pops up displaying the expression profile for this gene across all arrays.
We have noticed that on linux, the new window does not pop up automatically, instead stays as an icon, so click on the icon to view the histogram.

Preprocess the data
Handles duplicate gene id, missing value, data transformation and normalization, and data filtering.
From menu, choose "PreProcess", there are three types of preprocesses.  If "Preprocess" is skipped, instead, doing analysis (see Pattern recognition analysis ) right after loading the data , the default actions are taken.
Data Integrity Check & Fix: offers options for checking and fixing duplicate gene ids, and missing values.
Duplicate gene ids: to handle this, the choices are "Append Suffix", "Average", and "Discard all". The default action is " Append Suffix ".
Missing values -  Max % Allowed in column: if the percent of missing values in a column exceeds this value, the column is discarded. The default is "100"%.
Missing values - Max No. Allowed in row: if the number of missing values in a row exceeds this value, the row is discarded. The default is "0".
Missing values - Order of Filter: sequence of filtering, column first or row first. The default is "Row First.
Missing values - Estimate Method: method to estimate remaining missing values. The default is "Column Mean".
Data Transform & Normalization: offers options for data log transformation and normalization.
Take Log?: input the base for log-transformation, if doing log-transformation for negative values, a dialog box pop-ups for options to handle this. The default is "Take Log".
Normalization Policy: used to center data around row/column mean/median. The default is "Do Nothing".
Row/column mean center: Subtract the row-wise/column-wise mean from the values in each row/column, so that the mean value of each row/column is 0.
Row/column median center: Subtract the row-wise/column-wise median from the values in each row/column, so that the median value of each row/column is 0.

Data Filtering: offers options to filter out genes that do not meet the desired criteria. The default is no filtering.
Check STD?: Any gene with a standard deviation (std) across all arrays less than "STD" will be filtered out
Check Span?: Any gene whose span (MaxValue - MinValue) is less than " Span" will be filtered out.
Check ABS?: Any gene whose maximum absolute value cross all arrays is less than "ABS" will be filtered out.
The three data filters can be combined in any way, and any order. The "Histogram for Std", "Histogram for Span" discussed in Load in data can help you decide on the value of "STD" and "Span".
The data generated from each preprocessing step as well as the original data can be saved as tab-delimited files, see Save analysis results , and viewed in table format in the main view area, see View results calculated early .
 
Select data analysis settings
From menu, chose " Setting ", which allows to select analysis dimension (gene or experiment) and similarity metric.

Setting - Gene or Experiment: Let you choose the analysis dimension, i.e. whether to do analysis on "gene" or "experiment".
Setting - Similarity Measure: Let you choose the similarity metric from eight different types. After you make this selection, the distance distribution shows in the main view area..
Show Current Settings: Show the current selection of analysis dimension and similarity measure.
Show Default Settings: Show the default selection of analysis dimension and similarity measure, which is "gene" and "Pearson".
 
Pattern recognition analysis
From menu, choose "Analysis ", which lists the pattern recognition algorithms implemented.
Analysis - PCA: does PCA analysis, and generate PCA projection dot plot .

Analysis - Single Link Cluster, or Average Link Cluster: does hierarchical clustering, generates the color map with one dendrogram for the analysis dimension chosen. Selecting hierarchical clustering again for another dimension (for example, clustering on gene first, then on experiment) generates the color map with two dendrograms . The calculations on the two analysis dimension (gene and experiment) are completely independent, can be done using different similarity metrics and different hierarchical clustering algorithms. Click a node on the dendrogram or drag a box on the color map will pop up a new window displaying the selected area of color map and dendrogram(s) , and at the same time highlight the selected tree on the original dendrogram.

Analysis - KMean Cluster: does KMean analysis. First, pops up a dialog box for you to input the number of cluster you wish to partition, then display the result depending on the analysis dimension (gene or experiment). 
    For KMean clustering on gene , it displays a table describing, for each cluster, the size (how many genes in this cluster), the center (the gene id which is the center of the cluster), Min Dist (the shortest distance from all genes to the cluster center), Max Dist (the maximum distance from all genes to the cluster center) and Average Dist (the average of the distances from all genes to the cluster center). Click on a cluster number, a new window pops up, displaying the expression profiles and the list of genes for this cluster. Click on a gene id from the gene id list on the right-hand side of the popped up window, will highlight this gene id and it's expression profile in blue, similarly as clicking on one of the profile. Click on " Scan All Clusters " button displays the expression profiles for all the clusters. Then you can click on any cluster to display the expression profiles for just that cluster. Click on " Scan All Sketches " button displays the center profile for all the clusters. Again you can click on any cluster to display the expression profiles for just that cluster. Going back to the main KMean result window, click on the " View KMean Result in PCA Projection Plane " button generates a PCA projection dot plot with dots colored and shaped depending on the KMean cluster it belongs to. Click on a cluster from the table on the left-hand side of the window pops up a new window, displaying the PCA projection for just that cluster. Then click on either the id from the id list on the right-hand side of new window or the dot on the PCA projection highlights the id and dot for this gene. To go back from the PCA projection view to the table view for KMean clustering on gene, use menu "View", see " ".
    For KMean clustering on experiment, the PCA projection plot is the only view.
Analysis - Nearest Neighbours: depending on the analysis demension, sort genes or experiments by increasing distance to a selected center. First, pops up a dialog box for you to input the center ID, either a gene ID or experiment ID. Here is a sample result .

View previous analysis results
From menu, choose "View ", which lets you view previous analysis results again. If "View" is selected before "Analysis", i.e. no calculation done early, "View" works similarly as "Analysis" with slight difference for some pattern recognition algorithms discussed below.

View - Data Table: Lets you view the original or preprocessed data tables.
View - Statistical Summary: Lets you view the statistical summary for the original or preprocessed data, it is the same summary you get when you click the column title on a data table as described in " Load in data ".
View - PCA Eigenvalue: Displays the Eigenvalue table . Click on the column title pops up a new window displaying the same information in a XY plot .
View - PCA Projection: Displays the PCA projection dot plot. If no PCA has been calculated before, selecting either "PCA Eigenvalue" or "PCA Projection" will run the PCA analysis and display the results accordingly.
View - Single Link, or Average Link: Display the color map and dendrogram(s). If the analysis has not been done, it will be done first. As already mentioned in " Pattern recognition analysis ", depending on whether one or two analysis dimension(s) has been analyzed, the display can have one or two dendrogram(s) along with the color map.
View - KMean: Displays the KMean clustering results. As mentioned in " Pattern recognition analysis ", the default display is different depending on the analysis dimension currently chosen. If KMean clustering has not been calculated, it will be calculated using the default number of cluster = 1.
View - Nearest Neighbours: Displays previous "Nearest Neighbours" analysis result. "Analysis - Nearest Neighbours" has to be done first.
View - Color PCA: see explanation below.
View - Color Hierarchical Tree:  see explanation below.

Color PCA dot by annotation
Color PCA dots by experiment annotations, such as cell type, compound treatment, etc.
View - View PCA Colored by Attributes...: the first time this function is called for a dataset, a dialog box pops up for loading the experiment or gene annotation file, depending on the current analysis dimension chosen.
Here is a sample of an experiment annotation file (tab-delimited). Then the title line from the annotation file is displayed, click a title, displays the colored PCA . You can toggle between multiple titles ( screen shot ). Click an annotation name, a new window pops up for data in the selected cluster . Dots and names are both clickable, clicking one will highlight the other.
PCA on gene can also be colored by gene annotations, follow similar steps.

Color hierarchical tree by annotation
Color hierarchical clustering result by experiment annotations, such as cell type, compound treatment, etc. It takes a few steps to achieve this.

1) Does hierarchical clustering analysis as described in " Pattern recognition analysis ", which is "Analysis - Single Link Cluster, or Average Link Cluster ".
2) File - Open Experiment ID Description...: load in a file containing experiment annotations. Here is a sample of the tab-delimited file.
3) View - Color Tree: displaying a color bar for each sample colored according to annotations. You can toggle between multiple annotations.

Save analysis results

Copy/paste and save on disk, list of genes, or list of experiments, or expression data for future analysis. The menu is open through right-click, available for most of the views. However, not all options are valid for a particular views.
Save Result...: Can be used to save data table, either orignal or preprocessed, when the main area displays a data table either right after loading the data, or use "View - Data Table".
Save as Gif File... and Save as JPEG File...: Save the content in a image file.
Save Gene List...: Save the list of gene IDs.
Save Gene List with EPV...: Save the list of genes along with expression values for all experiments.
Copy Gene List...: Copy the list of genes onto a system clip board for pasting into other applications.
Save Experiment List...: Save the list of experiment IDs.
Save Experiment List with EPV: Save the list of experiments along with expression values for all genes.
Copy Experiment List...: Copy the list of experiments onto a system clip board for pasting into other applications.
Save Both List...: Save both lists of gene IDs and experiment IDs.
Save Both List with EPV...: Save both gene IDs and experiment IDs along with expression values.
Copy Both List...: Copy both geneIDs and experiment IDs onto a system clip board for pasting into other applications.

Save hierarchical clustering results
After hierarchical clustering analysis, save the interesting part of  the tree. First, highlight the part to be saved through 1) click a node on the dendrogram. 2) drag a box on the color map. Either way, a new window pops up, containing the part of data selected. Right click while mouse over the new window, displays a set of menu available for saving. All saving options are available for this view.

Save PCA analysis results
For annotation colored PCA view , click an annotation name, a new window pops up for data in the selected cluster. Right click while mouse over the new window, displays a set of menu available for saving. Available saving options are for saving image and text file data related to the analysis dimension chosen. Image saving option is also available for uncolored PCA .