Breast cancer prognosis by combinatorial analysis of gene expression data

Alexe, Gabriela; Alexe, Sorin; Axelrod, David E; Bonates, Tibérius O; Lozina, Irina I; Reiss, Michael; Hammer, Peter L

doi:10.1186/bcr1512

Research article
Open access
Published: 19 July 2006

Breast cancer prognosis by combinatorial analysis of gene expression data

Gabriela Alexe^1,2,3,
Sorin Alexe¹,
David E Axelrod^4,5,
Tibérius O Bonates¹,
Irina I Lozina¹,
Michael Reiss^5,6 &
…
Peter L Hammer¹

Breast Cancer Research volume 8, Article number: R41 (2006) Cite this article

10k Accesses
260 Citations
6 Altmetric
Metrics details

Abstract

Introduction

The potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We re-examine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors.

Method

Data were analyzed using the combinatorics and optimization-based method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines.

Results

LAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a by-product of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics.

Conclusion

The study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses.

Introduction

Microarray gene expression technology has provided extensive datasets that describe patients with cancer in a new way. Several methodologies have been used to extract information from these datasets. In this study we used the methodology of logical analysis of data (LAD) [1, 2] to reanalyze the publicly available microarray dataset reported by van 't Veer and coworkers [3]. The motivation for using yet another method to analyze these data was the expectation that the specific aspects of LAD, and especially the combinatorial nature of its approach, would allow the extraction of new information on the problem of metastasis-free survival of breast cancer patients, and in particular on the role of various significant combinations of genes that may have an influence on this outcome.

The main goal of the study by van 't Veer and coworkers was to predict the clinical outcome of breast cancer (that is, to identify those patients who will develop metastases within 5 years) based on analysis of gene expression signatures. The crucial importance of this problem arises from the fact that the available adjuvant (chemo or hormone) therapy, which reduces by about one-third the risk for distant metastases, is not really necessary for 70–80% of the patients who currently receive it. Moreover, this therapy can have serious side effects and involves high medical costs. The study by van 't Veer and coworkers illustrates clearly that machine learning techniques, data mining, and other new techniques applied to DNA microarray analysis can outperform most clinical predictors currently in use for breast cancer. The study concludes that the new findings, '... provide a strategy to select patients who would benefit from adjuvant therapy'.

A specific feature of datasets coming from genomics is the presence of a very large number of measurements concerning gene expressions but only a relatively small number of observations. For instance, the attributes in the van 't Veer study correspond to more than 25,000 human genes, whereas the number of cases was only 97. In that dataset, each case is described by the expression levels of 25,000 genes, as measured by fluorescence intensities of RNA hybridized to microarrays of oligonucleotides. The cases included in the dataset are 97 lymph-node-negative breast cancer patients, who are grouped into a training set of 78 and a test set of 19 cases. The training set includes 34 positive cases (having a 'poor prognosis' signature; that is, having fewer than 5 years of metastasis-free survival) and 44 negative cases (having a 'good prognosis' signature; i.e. having more than 5 years of metastasis-free survival). The test set includes 12 positive and seven negative cases.

The van 't Veer study used DNA microarray analysis in primary breast tumors, and "applied supervised classification to identify gene expression signature strongly predictive of a short interval to distant metastases ('poor prognosis' signature) in patients without tumor cells in local lymph nodes at diagnosis (lymph node negative)". The study identified 231 genes as being significant markers of metastases, all of whose correlations with outcome exceeded 0.3 in absolute value, and it constructed an optimal prognosis classifier based on the best 70 genes. In the training set the system predicted correctly the class of 65 of the 78 cases (that is, with an accuracy of 83.3%, corresponding to a weighted accuracy of 83.6%), whereas in the test set it predicted correctly the class of 17 of the 19 cases (that is, with an accuracy of 89.5%, corresponding to a weighted accuracy of 88.7%). Weighted accuracy is defined as the average of the proportion of correctly predicted cases within the set of positive cases and that of correctly predicted negative cases in the dataset.

Numerous statistical and machine learning methods have been successfully applied to the analysis of microarray datasets; these methods include cluster analysis (hierarchical clustering [4–7], self-organizing maps [8–10], and two-way clustering [11]), regression analysis [12], nearest neighborhood methods [14], decision trees [14–17], artificial neural networks [18, 19], support vector machines [20–23], principal component analysis [24–28], singular value decomposition [29–32], and multidimensional scaling [33, 34]. A pattern-based recognition method has been developed using other kinds of data for prediction of outcome in preclinical and clinical trials of cancer patients [35, 36].

The present study uses LAD, a combinatorics, optimization, and logic based methodology for the analysis of data. Specific features of the LAD approach include the exhaustive examination of the entire set of genes (without excluding those that have low statistical correlations with the outcome, or those that have low expression levels), focusing on the classification power of combinations of genes (without confining attention only to individual genes) and on the possibility of extracting novel information on the role of genes and of combinations of genes through the analysis of these exhaustive lists.

LAD has been shown to offer important insights into problems ranging from oil exploration [2], labor productivity analysis [37] and country creditworthiness evaluation [38], to medical application (for example, risk evaluation among cardiac patients [39, 40]), polymer design for artificial bones [41], computerized pulmonology [42], genomic-based diagnosis and prognosis of lymphoma [43], and proteomics-based ovarian cancer diagnosis [44].

The present study uses LAD to analyze a breast cancer genomic dataset [3]. Our goals in re-examining that dataset are to evaluate the potential of LAD in developing a prognostic system for breast cancer using genomic data; to derive additional information about the influence of certain genes and combinations of genes; and to identify new classes of patients.

We present an introduction to LAD, and develop a new type of classification model that can distinguish between patients who will have a metastasis-free survival of 5 years from the others. The structure of the paper is as follows. In the Materials and method section we briefly present the concepts and methodology of LAD, illustrating them on a small 'demonstration model', which can distinguish between poor and good prognosis based on the expression levels of six genes. In the Results section we present an 'enhanced model' with improved accuracy, involving 17 genes and having excellent sensitivity and specificity both on the training and on the test sets. It is shown that this model distinguishes between positive and negative cases in the training set with a weighted accuracy of 100%, and exhibits a weighted accuracy of 82.5% in cross-validation experiments. On the test set, the model classifies correctly 18 out of 19 cases. Numerous other findings concerning the influence of various genes, and differences discovered between the structures of the training and the test sets are also presented in the Results section.

The presentation of the 'enhanced model' not only allows the construction of a high-accuracy prognostic model, but it also makes possible the derivation of interesting conclusions about the dataset, about significant genes and combinations of genes, and about new classes of patients, among other factors.

Materials and methods

LAD concepts

It can be expected that 'large' or 'small' values of the expression levels of certain genes can determine the poor or bad prognosis of a breast cancer patient. In order to express such relations in more precise terms, it is natural to replace terms such as 'large' and 'small' with conditions of the type '... is more than' or '... is less than' a certain value. It is therefore natural to examine the role of well chosen cut points associated with the expression levels of genes. For instance, the observation that low intensity levels of gene Contig15031_RC are (more or less) characteristic for a poor prognosis is imprecise; it can be reformulated as the ultra-simplistic classification system, 'If the intensity level of gene Contig15031_RC is at most 0.055 then the patient has a poor prognosis'. The assumption of this rule is valid for 25 positive and 11 negative cases in the training set (that is, it has a sensitivity of 25/34 = 73.5% and a specificity of 33/44 = 75%).

Combinations of such cut point based conditions naturally extend this idea. For instance, the combined requirement of satisfying simultaneously the three conditions 'The intensity level of gene Contig15031_RC is at most 0.055' and 'The intensity level of gene NM_004035 is at least -0.106' and 'The intensity level of the gene NM_003239 is at most -0.014' is fulfilled by 22 of the 34 positive cases in the dataset and by none of the negative ones. Again, these three requirements could be viewed as a classification system of poor prognosis cases, having a sensitivity of 64.7% and a specificity of 100%.

Such ideas are at the foundation of LAD. The essence of LAD is to detect patterns, or combinatorial biomarkers (that is, simple classifiers consisting of restrictions imposed on the values of the expression levels of the intensities of a combination of several genes); to generate patterns exhaustively and in an algorithmically efficient way; to use the collection of patterns as a prognostic system and thoroughly validate it; to extract from this collection as much additional information as possible about the role and nature of genes in the dataset (that is, to detect promoters and blockers); and to study the common characteristics of groups of patients that satisfy similar patterns.

We describe below the basic concepts used in LAD, including some of its computational aspects. In particular, we describe more precisely the concepts of support sets, patterns, pandects, and LAD-based classification systems, and we discuss the validation techniques used.

Cut points and binarization

One of the underlying principles of LAD is to disregard the exact values of a variable (for example, a gene), specifying for each patient only whether the corresponding value of this variable is sufficiently 'large' or 'small'. The implementation of this principle requires the determination of several cutpoints c' _j, c" _j, ..., for intensity levels I _jof each gene j, such that the conditions requiring that the expression levels of the gene's intensity are low (or high) can be formalized as I _j≤ c' _j(I _j≥ c" _j), and so on.

LAD associates to each variable x _jand each possible cutpoint c _ja binary variable y _jthat is equal to 1 whenever x _j≥ c _j, and to 0 otherwise. In this way, a numerical variable (for example, specifying the expression levels of the intensity of a gene j) is transformed into a large number of binary variables. Because the size of the dataset (which has been very large from the beginning) increases even further, this problem is handled by carrying out a 'filtering' process, which retains only a 'support set' consisting of a very small number of these variables.

Support sets

In order to distinguish between measurements of good and of poor prognosis patients, only a tiny fraction of the information contained in the (original or binarized) dataset is needed. In particular, all of the information about the vast majority of the genes in the dataset is redundant. Moreover, even for the genes that are not redundant, only a few (usually only one) of the corresponding binary variables are needed. A set of binary variables that are sufficient to distinguish poor from good prognosis cases will be called a support set. A support set is called 'minimal' if none of its proper subsets is a support set; clearly, not every minimal support set is of minimum size. It is important to note that a dataset may admit hundreds or thousands of minimal support sets. The reduction of a large dataset to a substantially smaller one that includes only the variables in the chosen support set allows a major simplification of the problem, and has great importance for diagnosis and prognosis (although, in some cases, the presence of a limited number of redundant variables may be acceptable in terms of ensuring greater stability of results).

The problem of finding minimal support sets has been modeled elsewhere [1, 2, 45] as a typical 'set-covering' problem, and numerous methods are known in combinatorial optimization for the solution of this problem. In our case, the excessive dimensions of the associated set-covering problem (approximately 20,000 constraints involving between 2 and 3 million 0–1 variables) required the use of powerful heuristics to trim down the size of the problem. In order to be able to handle the large problems typical for genomic and proteomic datasets, a general heuristic size-reduction procedure has been developed [46]. The essence of this method is to balance the conflicting criteria of minimizing size and maximizing discrimination between positive and negative observations. In contrast to many statistically based methods, the support set generation procedures of LAD are guided by the collective strength of the subsets of variables, without being necessarily restricted to those variables that have the highest individual correlation coefficients with the outcome.

The feature selection procedure [46] applied for the van 't Veer dataset consists of two stages. In a first 'filtering' stage, a relatively small subset of relevant features was identified on the basis of several combinatorial, statistical, and information/theoretical criteria (for example, separation measure, envelope eccentricity, system entropy, signal to noise ratio). In the second stage, the importance of variables selected in the first step was evaluated based on the frequency of their participation in the set of all maximal patterns (see below) and generated using an efficient, total polynomial time algorithm [47], and a large proportion of the low impact variables was eliminated. This step was applied iteratively, until a Pareto-optimal support set was arrived at, which balanced the conflicting criteria of simplicity and accuracy; in the construction of the demonstration and enhanced models this support set consisted of only 6, respectively 17, of the 25,000 genes.

The high sensitivity and specificity of the prognostic system built on these small sets of genes are to a large extent due to the qualities of the underlying support set.

Logical patterns

A 'conjunction' is a set of conditions that require that the binary variables appearing in a selected subset of the support set take specific (0 or 1) values (that is, that the expression levels of the corresponding genes should be below or above certain cut points). The typical conjunctions appearing in most data analysis studies fix the values of not more than two or three binary variables. A conjunction is called a positive (or negative) pattern if its set of conditions are satisfied simultaneously by 'sufficiently many' of the positive (or negative) cases, and by 'sufficiently few' of the negative (or positive) cases.

For example, in the van 't Veer breast cancer dataset, if 'sufficiently many' is defined as 'at least 30%', then the three conditions 'The intensity level of the gene Contig15031_RC is at most 0.055' and 'The intensity level of the gene NM_004035 is at least -0.106' and 'The intensity level of the gene NM_003239 is at most -0.014' are fulfilled by 22 of the 34 positive cases in the training set and by none of the negative cases. Therefore, the simultaneous fulfillment of these three conditions describes a positive pattern (to be denoted P1). Similarly, the three conditions 'The intensity level of the AF018081 is at most 0.071' and 'The intensity level of the gene Contig26768_RC is at most 0.098' and 'The intensity level of the gene Contig15031_RC is at least 0.0915' are fulfilled by 15 of the 44 negative cases in the dataset and by none of the positive cases. Therefore, the simultaneous fulfillment of these three conditions describes a negative pattern (to be denoted N1).

Two of the most important characteristics of a pattern are its 'degree' and its 'coverage'. The degree of a pattern is simply the number of variables (genes) involved in its defining conditions. In our example, both P1 and N1 have degree 3. A case C is said to 'display' a pattern, or to be 'covered' by it, if the corresponding intensity levels of the gene expressions satisfy the defining conditions of that pattern. The prevalence of a positive (or negative) pattern is simply the proportion of positive (or negative) cases covered by it. For example, the three defining conditions of P1 are satisfied simultaneously by 22 of the 34 positive cases (that is, the prevalence of P1 is 64.7%). Similarly, N1 covers 15 of the 44 control cases (that is, its prevalence is 34.1%). Patterns that cover only positive or only negative cases are called 'pure' patterns. Clearly, both P1 and N1 are pure patterns. Usually, datasets that admit pure patterns of low degrees and high prevalences allow the construction of reliable LAD diagnostic and prognostic systems.

Several combinatorial algorithms [47–50] are available for the efficient generation of libraries of patterns. These pattern extraction algorithms are intended to identify exhaustively the collections of positive and negative patterns hidden in the dataset, without any prior knowledge of the distribution of the data domain.

As an indication of their efficiency, we note that the generation of the 133,920 potential patterns examined for this study and the selection of the 385 maximal pure patterns required a total computer time of 5.1 s.

It should be noted that the concept of patterns resembles that of rules, which appears in expert systems and in various decision tree-based methods. It should also be mentioned that the number of rules in a dataset is exponentially large, and therefore the generation of every possible rule is not realistic. Although most of the rule-based methods generate a relatively small number of potentially significant rules, one of the major characteristics of LAD is the systematic generation of an extremely large collection of potentially significant rules, and in a subsequent stage the 'filtering' of this collection in order to retain only a reasonably sized collection that can jointly explain the positive or negative nature of every case in the dataset. This approach not only ensures that there is the possibility of selecting those rules or patterns that, taken individually, carry the greatest amount of information (for example, have low degrees and high coverages); it also maximizes the collective inference power of the selected family of patterns. In essence, the pattern generation system of LAD consists of a systematic, exhaustive combinatorial enumeration process, which is guided by clear optimization criteria.

Pandects

The pandect (i.e. the collection of all of the positive and negative patterns corresponding to a dataset) is an important component of LAD because it allows construction of diagnostic and prognostic systems, analysis of the importance and role of variables, and identification of new classes of observations, among other factors. In view of the enormous number of patterns corresponding to a dataset, the construction of the entire pandect is not realistic. However, it has been seen in numerous case studies that the knowledge of special subsets of the pandect is sufficient for accurate analysis of datasets. The set of all positive (or negative) patterns of degree at most d ⁺ (or d ^-) and prevalence at least p ⁺ (or p ^-) is called the (d ⁺, p ⁺) positive pandect (or the (d ^-, p ^-) negative pandect). The best pandect-defining parameters d ⁺, d ^-, p ⁺, and p ^- for the analysis of a particular dataset are determined experimentally by carrying out a series of k-fold cross-validation experiments. The computational complexity of generating the pandect depends mostly on the values of d ⁺ and d ^-. Because in most cases very small values (usually not more than 2 or 3) of d ⁺ and d ^- are sufficient for the generation of an extremely useful pandect, this component of LAD can be calculated in a very efficient way. The particular pandect used in the present study is defined by d ⁺ = d ^- = 3 and p ⁺ = p ^- = 15%, and consists of 215 positive and 170 negative patterns. Although patterns can be viewed as tests that are indicative of a good or bad prognosis, the 'pandect' plays the role of a high powered prognostic battery of tests. Clearly, the pandect is not a minimal system because it may contain many redundant patterns, without which the system can still remain accurate. As a matter of fact the pandect of the van 't Veer dataset contains several minimal separating subsets of patterns (called 'models'); two such models are discussed in this report: a 'demonstration model' consisting of nine positive and seven negative patterns, and an 'enhanced model' consisting of 20 positive and 20 negative patterns. It should be added that the built-in redundancy of the large pandect of 215 + 170 patterns can substantially increase [51] the prognostic system's 'stability' or 'robustness' when it is applied to new cases.

Pattern space

In the given dataset, each patient is described in terms of approximately 25,000 attributes (genes) by specifying their respective expression levels. Taking into account the fact that LAD patterns can be viewed as logically synthesized attributes that can be expected to reflect more closely the condition of a patient than the original 'raw data', it is reasonable to assume that a description of patients specifying exactly the set of patterns by each individual should represent more precisely the patient's condition. This pattern-based representation of the observations can be achieved by associating to each patient and to each pattern in the pandect an indicator variable that shows whether the patient satisfies (indicator = 1) or does not satisfy (indicator = 0) the conditions that define that pattern. In this way, each patient is characterized by a sequence of 0–1 values of the indicator variables associated with the positive and negative patterns in the pandect.

Calibration

The quality of the prognosis given by the pandect is a consequence of the choice of several control parameters. The collection of control parameters include the number of cutpoints per gene, upper bounds on the size of support sets, pattern degrees, and lower bounds on pattern prevalence. The control parameters define uniquely the pandect. The best values of the control parameters are determined iteratively by assigning some values to them, constructing the associated pandect, verifying the correctness of its predictions, reassigning the values, and continuing this sequence of steps until one arrives at a pandect with highly accurate predictions. The verification process is based on well known statistical cross-validation techniques.

The most frequently used cross-validation techniques are the leave-one-out (or jackknifing) method and of k-folding. All of the cross-validation techniques are conducted within the training set (that is, they do not involve any observation in the test set). In leave-one-out, one of the cases is taken as verification set, the pandect is built on the remaining cases (the learning set), and its prognosis is checked on the unique case in the verification set, with this experiment being repeated for each case in the training set. In k-folding, the training set is partitioned randomly into k (for example, 2, 5, or 10) subsets; one of these subsets is then selected as the verification set, the pandect is constructed on the remainder of the training set (viewed as the learning set), and the prognosis of the pandect is checked on the verification set. This experiment is repeated k times, for each of the k possible selections of the verification set.

The entire calibration process is conducted only on the training set and it is intended to identify the best parameters to be used in the construction of the LAD models, and not to validate the LAD predictions (that process is described below).

Validation

Validation of the LAD results can be carried out in two ways. First, the predictions of the pandect built on the training set must be checked on the test set. This is the most frequently used validation method. In order to increase the reliability of the proposed pandect, an additional validation procedure can be applied. In this second validation procedure, a new dataset is created that consists of all of the observations in the original training and test sets. The second validation consists of the application of the usual cross-validation techniques (k-folding and/or leave-one-out) to this augmented dataset, using the parameters found at the calibration stage.

Illustration with a demonstration model

The LAD method was trained and calibrated on the same training set of 78 samples used by van't Veer and coworkers [3]. The prognosis results for LAD were validated on the same test set of 19 samples used by van't Veer and coworkers. The samples in the test set were disregarded during the training procedure.

Support set selection

The LAD method starts with a pre-processing procedure for the selection of a significant support set of genes, on which the proposed prognostic system will be constructed. Because these systems are expected to have high accuracy, we restricted our study only to those 13,387 genes whose log-ratio measurements of fluorescence intensities are known for every single patient (that is, we eliminated those genes that include missing data). Part of our feature selection uses some statistical measures, and for this purpose we normalize the data by applying the following formula: x → (x - x_min)/(x_max - x_min).

After removing variables based on these measures, the original variables are reintroduced and a support set is determined. We recall that a support set consists of a subset of variables with the property that a model can build on them (not including any variable outside the support set), which can distinguish positive cases from negative ones.

In our dataset, from the set of 13,387 genes, using the method presented by Alexe and coworkers [46], we have extracted several support sets, including one consisting of six genes (Table 1), on which we shall build a 'demonstration model' (Table 2). Out of the six genes in the support set, one is involved in cell growth and three are enzymes [52].

Table 1 The six-gene support set of the demonstration model

Breast cancer prognosis by combinatorial analysis of gene expression data

Abstract

Introduction

Method

Results

Conclusion

Introduction

Materials and methods

LAD concepts

Cut points and binarization

Support sets

Logical patterns

Pandects

Pattern space

Calibration

Validation

Illustration with a demonstration model

Support set selection

Binarization

Pattern and model generation

Prognosis

Validation of the demonstration model

Results

Prognostic system

Significant biomarkers

Promoters and blockers

Special classes of positive cases

Cohesion

Predictability

Distinctive coverage by patterns

Distinctive gene expression ranges

Statistical distinctions of clinical features

Summary

Special classes of negative cases

Cohesion

Predictability

Distinctive coverage by patterns

Distinctive gene expression ranges

Statistical distinctions of clinical features

Summary

Discussion

Comparison of weighted accuracies

Comparison of support sets

Individual versus collective biomarkers

Contrast between training and test sets

Individualized therapy

Prognostic index

Comparison with other studies

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Breast Cancer Research

Contact us