BSPC Algorithm

Digital Transformation


The first step in the algorithm is to transform the real-valued (i.e. analog) gene expression levels into a small number of discrete (i.e. digital) levels. This is accomplished using a clustering algorithm applied to the expression levels from each gene, one gene at a time. In other words we are performing a one-dimensional clustering where the single dimension is the expression level values of all the samples of a single gene. Any one of a number of clustering algorithms could be used for this step, however, it is important that the number of resulting clusters should not be fixed in advance. In particular we wish to allow genes in which the samples' expression levels distribute in what is apparently a unimodal distribution to form a single cluster.


For the examples shown here, we utilized a single linkage (nearest neighbor) hierarchical clustering followed by the application of the Gap Statistic Method of Tibshirani et al. to generate clusters from the hierarchical tree. Each sample's expression level is replaced by the mean of the cluster to which it belongs. These mean values act as cluster identifiers while maintaining information about the relative positions of the clusters, information that is important for the next stage of the processing.

Gap Stat

The basic idea of the gap method is as follows. If one partitions data with n independent observations into k clusters and measures as Wk the pooled within-cluster sum of squares around the cluster mean, the error measure Wk will decrease as the number of clusters k increases, reaching zero when k = n. For some k, this decrease often flattens markedly, and this may be taken as an indication of the optimal number of clusters. The gap statistic developed by Tibshirani et al. is an attempt at formalizing this. Wk for real data is compared to a reference distribution computed from a uniform distribution over the range of observed values. The optimal number of clusters k is then taken to be the smallest k for which Gap(k) > Gap(k+1) - sk+1.

For further definitions and details, the reader is referred to the original publication "Estimating the number of clusters in a data set via the gap statistic" by Tibshirani R, Walther G, Hastie published in The Journal of the Royal Statistical Society Series B - Statistical Methodology 63: 411-423 Part 2 2001 and also at www-stat.stanford.edu/~tibs/research.html.



Binary-State Patterns


To compensate for the potential that a gene's expression levels might erroneously break into more clusters than there are actual gene states, we utilized a procedure that allows these breaks to be mended by combining clusters in a restricted manner. For genes generating three or more clusters, unions of different combinations of clusters are made each time forming a different view of the distribution of the gene levels into states. These views are stored as binary strings which we refer to as state patterns. Each putative state of each gene is represented by a corresponding binary state pattern.

A binary state pattern string has length T, where T is the number of samples in the dataset. Each position within the string contains either a 0 or a 1, each corresponding to a single sample using an arbitrary but fixed ordering of the samples. A 1 indicates that the expression level for this gene (the one associated with this gene state pattern) and this sample (the one associated with this bit position) is believed to be in this state (the one corresponding to this gene state pattern). A 0 indicates that the sample's gene is not in the state.

The following pseudo-code describes the algorithm for combining clusters into putative states and forming the corresponding gene state pattern. Note that for each gene the output of the one-dimensional single linkage hierarchical clustering algorithm is a series of floating-point values, one value for each sample. These values act as identifiers for the cluster to which the sample belongs. If all the values/identifiers for a given gene are the same then only a single cluster was found for this gene. These identifiers also happen to be the mean of the expression levels of the samples belonging to the corresponding cluster. The mean level is known to be a suitable unique identifier because these clusters cannot overlap.





1) for each gene determine the list of unique identifiers
2) if there is only a single unique identifier go on to the next gene at step 1
3) initialize a binary state pattern and its complement to all 0's and all 1's, respectively
4) sort the unique identifiers (remember these are also the cluster mean values)
5) for each unique identifier
6) for each position in the state pattern
     7) set position to a 1 if the identifier corresponding 
          to this position is equal to the current unique identifier, 
          set a 0 in the complement
     8) go to step 6 until all the positions/samples have been visited
9) save the current state pattern or its complement depending upon which has fewer 1's, if they have an equal number of 1's save the pattern which is numerically higher (when viewed as a unsigned binary number)
10) go to step 5 until each unique identifier has been visited
11) go to step 1 until each gene has been visited
12) done
Note that since complementary patterns contain redundant information we have systematically thrown away one pattern from each complementary pair of patterns.

Matching Patterns


A search for matching binary state patterns can either be conducted in a supervised fashion (i.e. search for matches to a specific pattern) or in an unsupervised fashion (i.e. search for all patterns that match one another). In either case we will refer to the number of patterns that all match as being the match-count. In general, large match-counts are less likely to occur by chance alone than small match-counts. For the supervised search case, the probability of finding a match with a match-count greater than or equal to a specified integer j is determined by the following expression:
BSPC formula
And where k is the number of 1's in the reference pattern, Nk is the number of gene state patterns having k 1's and T is the length of the gene state pattern.

In the unsupervised search case there may be multiple gene state patterns that have matches. For our purposes, we will refer to these groups of matching patterns as "clusters". A cluster identifies both a group of samples (samples corresponding to the 1's in the gene state pattern) and a group of genes (those which originally gave rise to the gene state pattern). Clusters that exceed a specified threshold in their match-count will be referred to as "passing clusters". An independent search for clusters can be performed separately for each set of gene state patterns having a specified number of 1's. Each of these tests has a probability of finding a match with a match-count greater than or equal to a specified integer j that is described by the following expression:
BSPC formula
Since these tests are independent of one another, the overall probability of finding even a single passing cluster during any of these tests can be fixed to be below a desired alpha-level through a judicious choice of match-count threshold for each test (see next section) and then summing the probabilities over all tests.

Match-count Threshold

The probability of finding random matching patterns in one subset is independent of finding matches in any of the other subsets. Thus, the search for matches occurring in any of a group of subsets constitutes multiple independent tests, the overall probability of which can be calculated by summing the probabilities for each individual subset. The probability for each subset can be individually controlled by considering only matches involving greater than a threshold number of patterns. Since this "match-count" threshold is an integer, we do not have complete control over the probability of a "passing" match (i.e. a match involving more than the threshold number of patterns), but by increasing the match-count threshold we can insure that the probability is less than a specified target single-test alpha-level.

The goal here is to determine a different threshold for each subset Sk such that the overall probability of finding a cluster within any subset is as close as possible to but not above the desired overall alpha-level. Simply setting the target alpha-level for each subset to be the overall alpha-level divided by the number of subsets to be tested (i.e. the number of independent tests) will generally result in a overall test that is too stringent because the integer thresholds typically result in probabilities below the target and in composite will be well below the desired overall alpha-level. A higher single-test target alpha-level can therefore be used when adjusting the match-count thresholds to get an overall probability as close as possible to but still below the desired overall alpha-level. Although some subsets will end up being more stringent than others, this is an unavoidable consequence of the integer nature of the match-count threshold; they cannot be made to be equal. By using the same target for all subsets we are treating the subsets as equally as is possible.

The following pseudo-code describes the iterative algorithm used to determine the target single-test alpha-level and the corresponding match-count thresholds. Since we know that the optimal target must be somewhere between the desired overall alpha-level and this alpha-level divided by the number of subsets to be tested, we will use these values as the initial known too-high and too-low target levels, respectively. The initial test guess of the target single-test alpha-level will be half-way between the too-high and too-low levels. As test target levels are found to yield cumulative probabilities that either too high or too low relative to the desired overall alpha-level, these become the new too-high and too-low bounds on the target and the new target is set at half the distance to the appropriate bound (the too-low bound if the target was found to be too high and vice-versa).

1) initialize the too-high bound to the desired overall alpha-level, the too-low bound to the alpha-level divided by the number of subsets to be tested and the single-test target to be halfway between these
2) loop
3) for each subset of interest

4) increment the match-count threshold, calculating the probability for each from equation (see below) until this probability falls below the target and take this probability and this threshold as the probability and match-count threshold for this subset
5) accumulate the probabilities
6) go to step 3 until all subsets of interest have been visited

7) if the cumulative probability is greater than the desired overall alpha-level, set the new too-high bound to the current target, set the new target to halfway between its current value and the too-low bound and calculate the change in the target

8) else, set the new too-low bound to the current target, set the new target to halfway between its current value and the too-high bound and calculate the change in the target

9) go to step 2 until the change in the target is less than a small fraction (1%) of the desired overall alpha-level and the cumulative probability is below this alpha-level.
10) done