Preview only show first 10 pages with watermark. For full document please download

Tnb Gene

   EMBED


Share

Transcript

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 121 Estimating Sparse Gene Regulatory Networks Using a Bayesian Linear Regression Pinaki Sarder∗ , Student Member, IEEE, William Schierding, J. Perren Cobb, and Arye Nehorai, Fellow, IEEE Abstract—In this paper, we propose a gene regulatory network (GRN) estimation method, which assumes that such networks are typically sparse, using time-series microarray datasets. We represent the regulatory relationships between the genes using weights, with the “net” regulation influence on a gene’s expression being the summation of the independent regulatory inputs. We estimate the weights using a Bayesian linear regression method for sparse parameter vectors. We apply our proposed method to the extraction of differential gene expression software selected genes of a human buffy-coat microarray expression profile dataset of ventilator-associated pneumonia (VAP), and compare the estimation result with the GRNs estimated using both a correlation coefficient method and a database-based method ingenuity pathway analysis. A biological analysis of the resulting consensus network that is derived using the GRNs, estimated with both our and the correlation-coefficient methods results in four biologically meaningful subnetworks. Also, our method performs either better than or competitively with the existing well-established GRN estimation methods. Moreover, it performs comparatively with respect to: 1) the ground-truth GRNs for the in silico 50- and 100-gene datasets reported recently in the DREAM3 challenge and 2) the GRN estimated using a mutual information-based method for the topranked Bayesian analysis of time series (a Bayesian user-friendly software for analyzing time-series microarray experiments) selected genes of the VAP dataset. Index Terms—BANJO, Bayesian analysis of time series (BATS), Bayesian linear regression, correlation coefficient, DREAM, extraction of differential gene expression software (EDGE), gene regulatory network (GRN), ingenuity pathway analysis (IPA), mutual information, network identification by multiple regression (NIR) [time series network identification (TSNI)], sparsity, ventilatorassociated pneumonia (VAP). I. INTRODUCTION STIMATING gene regulatory networks (GRNs) from microarray data using reverse engineering is an important problem in genomics [1]. We propose a GRN estimation method from time-series microarray data using a Bayesian approach, E Manuscript received September 30, 2009; revised February 1, 2010. Date of current version June 3, 2010. Asterisk indicates corresponding author. ∗ P. Sarder is with the Department of Electrical and Systems Engineering, Washington University, St. Louis, MO 63130 USA (e-mail: psarde1@ ese.wustl.edu). W. Schierding is with the Department of Genetics, Washington University, St. Louis, MO 63110 USA (e-mail: [email protected]). J. P. Cobb is with the Center for Critical Illness and Health Engineering and the Departments of Surgery and Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA (e-mail: [email protected]). A. Nehorai is with the Department of Electrical and Systems Engineering, Washington University, St. Louis, MO 63130 USA (e-mail: nehorai@ ese.wustl.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNB.2010.2043444 considering the regulatory interactions between the genes in the network to be sparse. Our approach is motivated by the insight provided by earlier studies on GRN, demonstrating that such networks in most biological systems are typically sparse [2]–[5]. In our study, we use a precise physical model to describe the interactions among the genes, particularly1 we represent the putative regulatory relationship between the genes using sparse linear coefficients or weights [6]. We estimate the weights by employing a Bayesian linear regression method for sparse parameter vectors [7]. There exist several methods to model and estimate the related GRN from a microarray dataset, for example, GRN estimations using a genetic algorithm [8], a neural network [9], and Bayesian models [10] are well-known classical methods. In another early intriguing approach, Vogelstein et al. [11] and Marcotte [12] model the GRN using patterns similar to complex Webs or electrical circuits. Yeung et al. [13] propose linear models to describe the gene interactions, and use singular value decomposition to estimate the GRN. In a similar study, di Bernardo et al. [14] develop an expectation–maximization algorithm to estimate the GRN. Currently, the three most well-established approaches in the community are [15]: 1) an ordinary differential-equationbased method, known variously as network identification by multiple regression (NIR) or time series network identification (TSNI) [16], [17]; 2) a dynamic Bayesian-network-based method, known as BANJO [18]; and 3) information theoretic approaches [19]–[21]. In a recent study, Cantone et al. [15] build a synthetic network called In-vivo Reverse-engineering and Modelling Assessment (IRMA), consisting of five genes, for assessing the performance of the existing GRN estimation approaches. They conclude that NIR (TSNI) and BANJO are able to correctly infer the regulatory interactions from the timeseries experimental data of their implemented IRMA network. We, therefore, compare the performance of our proposed method with NIR (TSNI) and BANJO using the two published time-series datasets in [15]. Our method outperforms the other two methods in one dataset, and shows a performance competitive with them in the other. There are two immediate advantages of our method compared to the other two methods, as we understand in the performance comparison. First, our method is simple and automatic, and it does not require additional experiments to estimate the model parameters; in contrast, NIR (TSNI) requires additional experiments to estimate the complex kinetics parameters involved in its related model for estimation [15]. Second, the performance of our method should not be affected by the number of time points in the time-series microarray data that we analyze; in contrast, improving the performance of BANJO requires collection of additional time-series measurements. Note 1536-1241/$26.00 © 2010 IEEE Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. 122 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 that IRMA is a small network. We are, however, well grounded in applying our method to large microarray datasets, since the algorithms that perform well with small networks also perform well with larger networks [15], [20], [22]. We further evaluate the performance of our proposed method by applying it to the in silico time-series datasets, which consist of 50 and 100 genes, reported recently in the DREAM3 challenge [23], [24]. These datasets represent gene interactions under several external perturbations, and our method performs significantly with respect to their ground-truth GRNs. We apply our proposed method to the extraction of differential gene expression software (EDGE) [25] selected genes of a human circulating white blood cell (“buffy-coat”) microarray dataset of ventilator-associated pneumonia (VAP) [26]. We analyze the common VAP network edges that we estimate using our method, and that McDunn et al. [26] estimate using a correlation-coefficient-based method and a commercial database-based method ingenuity pathway analysis (IPA) [27]. We hypothesize here that the common edges, based on a consensus between the networks that our proposed and the correlationcoefficient-based methods predict, and with significantly high absolute weights that our proposed method estimates, are significant. However, to our knowledge, the two computational networks, the consensus network, and the IPA are each equally valid. There are definite advantages to supplementing results from the consensus network with the analytic capabilities of IPA. By doing so, we obtain four biologically meaningful subnetworks from the consensus network. We then further apply our method to the top-ranked BATS [28] selected genes of the VAP dataset. Our method here performs significantly with respect to the GRN, estimated using a mutual-information-based method [21], and thus reconfirms its own validity. The paper is organized as follows. In Section II, we first discuss the biophysical model [6] and our proposed biostatistical model to describe time-series gene microarray data. We then briefly discuss our strategy to estimate the GRN. In Section III, we illustrate the results obtained in comparing our proposed method with NIR (TSNI) and BANJO. In Section IV, we present the results obtained by applying our method to the DREAM3 in silico challenge data. In Section V, we present an analysis of the VAP dataset using our proposed method. Finally, we conclude in Section VI. II. ESTIMATION In this section, we first briefly review the biophysical model presented in [6], and then, propose a corresponding biostatistical model to describe a time-series gene microarray data. We then discuss our strategy to estimate the GRN. A. Biophysical Model We denote the measured expression of the kth gene of the ith patient or experiment at the time-point tj as xik (tj ), where k ∈ {1, 2, . . . , G}, i ∈ {1, 2, . . . , P }, and j ∈ {1, 2, . . . , N }. Here, the measurement from one patient or experiment to the other is independent with each other. We then denote the total regulatory input to the kth gene by all other genes as rki (tj ). In vector form, Fig. 1. Example of the dose-response function for the kth gene with α k = 1 and β k = 0, where rki (tj ) is the input and sik (tj + 1 ) is the output. we define the gene expression for the ith patient or experiment at time-point tj as xi (tj ) = [xi1 (tj ), xi2 (tj ), . . . , xiG (tj )]T and the i (tj )]T , total regulatory input as r i (tj ) = [r1i (tj ), r2i (tj ), . . . , rG where T denotes the matrixn transpose operation. We denote the gene regulation matrix as W of size G × G, with elements wk l for k, l ∈ {1, 2, . . . , G}. The net regulatory effect of the lth gene on the kth gene at time t is simply the expression level xil (t) of the lth gene multiplied by its regulatory influence wk l on the kth gene. The total regulatory input rki (t) to the kth gene is derived by summing across all the genes in the system [6]:  rki (t) = wk l xil (t). (1) l Hence, we have r i (t) = W xi (t). (2) A positive or negative value of wk l models the lth gene stimulating or repressing the expression of the kth gene, while a zero value indicates that the lth gene does not influence the transcription of the kth gene. Thus, in addition to zero values, each gene has multiple inputs, both positive and negative, of differing strengths. The response of each gene to the regulatory input rki (t) is given by using a dose-response function [6] sik (tj +1 ) = 1 , 1 + exp[−(αk rki (tj ) + βk )] 0 < sik (tj +1 ) < 1 (3) where αk and βk are two gene-specific constants defining the dose-response function and sik (tj +1 ) denotes the relative expression level of the kth gene at time tj +1 , see Fig. 1. Briefly, each gene has a static dose-dependent response to activate and repress regulatory influences. The parameter αk can be any positive real number that defines the slope of the curve at its inflection point. Genes with a large αk will shift rapidly from Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. SARDER et al.: ESTIMATING SPARSE GENE REGULATORY NETWORKS USING A BAYESIAN LINEAR REGRESSION Fig. 2. 123 Biophysical model (shown in the dotted box) fitting between the observations xi (tj ) and xi (tj + 1 ) [6]. near zero expression to near maximal expression, when the activating input surpasses some gene-specific thresholds, while those with small αk will have a nearly linear response over the biologically relevant range of regulatory inputs. The βk constant can be any real constant, and it represents the gene’s basal level of expression. A positive βk represents a gene with high basal levels of transcription, and negative βk represents the converse. We convert the relative expression level sik (tj +1 ) to the actual expression level for the kth gene by multiplying it with its maximal expression mk (∀k ∈ {1, 2, . . . , G}). Hence, the measured expression level xik (tj +1 ) of the kth gene at time tj +1 is approximately equivalent to mk sik (tj +1 ). This is shown in Fig. 2, where m = [m1 , m2 , . . . , mG ]T . Thus, for the kth gene of the ith patient or experiment at time tj +1 , we have   m  k wk l xil (tj ) + βk − 1 ≈ αk − ln i xk (tj +1 ) l  ⇒ yki (tj +1 ) ≈ αk wk l xil (tj ) + βk (4) l mk − 1]. where we define yki (tj +1 ) = − ln[ x i (t j+1) k B. Biostatistical Model We propose a biostatistical model related to the biophysical model described in Section II-A, namely, the dose-response curve imperfectly models the biological response of a gene, and hence, the model fitting in (4) incurs some error, which we consider to be biological noise. We introduce this error while fitting αk l wk l xil (tj ) + βk to yki (tj +1 ) in (4). Thus, we have  yki (tj +1 ) = αk wk l xil (tj ) + βk + eik (tj +1 ) (5) l where eik (tj +1 ) is the model-fitting error, distributed as an additive Gaussian random variable with mean zero and variance σe2 , independently and identically distributed (i.i.d.) from measurement to measurement for k ∈ {1, 2, . . . , G}, i ∈ {1,2, . . . , P }, and j ∈ {1, 2, . . . , N − 1}. Thus, for the ith patient or experiment, we have from (5) i  i i y (tj +1 ) = W x (tj ) + β + e (tj +1 ) (6) i where y i (tj ) = [y1i (tj ), y2i (tj ), . . . , yG (tj )]T , W  is a G × G  matrix with elements wk l = αk wk l for k, l ∈ {1, 2, . . . , G}, β = [β1 , β2 , . . . , βG ]T , and ei (tj +1 ) = [ei1 (tj +1 ), ei2 (tj +1 ), . . . , eiG (tj +1 )]T . We express the kth (∀k ∈ {1, 2, . . . , G}) entry of the vector y i (tj +1 ) in (6) as follows:  w  k1 yki (tj +1 ) = [ xi1 (tj ) xi2 (tj ) ··· xiG (tj )  w   k2   .  .  1]  .      wk G  βk + eik (tj +1 ). (7) We then stack yki (tj +1 ) for the ith patient or experiment (∀i ∈ {1, 2, . . . , P }) at time-points t ∈ {t1 , t2 , . . . , tN }, and obtain y ik = X i hk + eik (8) where y ik = [yki (t2 ), yki (t3 ), . . . , yki (tN )]T  i x1 (t1 ) xi2 (t1 ) ··· xiG (t1 )  xi (t ) xi2 (t2 ) ··· xiG (t2 )  1 2  .. .. ..  . . ··· . Xi =   .. .. ..   . . ··· . xi1 (tN −1 ) xi2 (tN −1 ) · · · 1  1  ..  .  ..   . xiG (tN −1 ) 1 hk = [wk 1 , wk 2 , . . . , wk G , βk ]T , and eik = [eik (t2 ), eik (t3 ), . . . , eik (tN )]T . We further stack y ik for all patients or experiments i ∈ {1, 2, . . . , P }, and obtain  y1   1   e1  X k k  y2   X 2   e2   k    k  .   ..     ..  =  .  hk +  ...  ⇒ y k = Xhk + ek (9)    .     ..   .   ..   .   .   .  P P X yk ePk with y k and ek being vectors of size P (N − 1), hk a vector of size (G + 1), and X a matrix of size P (N − 1) × (G + 1). We assume in our analysis that the number of patients or experiments P and the number of time-points N are such that P (N − 1) > (G + 1). Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. 124 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 C. Estimating the GRN We aim to estimate the GRN, i.e., the unknown gene regulation weight matrix W , using the linear model (9) for each k = 1, 2, . . . , G. In addition, we also aim to estimate the unknown parameters αk and βk of the kth (∀k ∈ {1, 2, . . . , G}) gene, which are nuisance parameters. Here, recall from the previous section that αk and the elements wk l (∀k, l ∈ {1, 2, . . . , G}) of W are not separable from each other for a given k, and thus, we estimate them together as wk l = αk wk l without loss of generality. Therefore, the unknown parameters we estimate in our analysis are wk l (∀k, l ∈ {1, 2, . . . , G}) and βk (∀k ∈ {1, 2, . . . , G}); note that these parameters are the elements of the vectors hk (∀k ∈ {1, 2, . . . , G}) in (9). Hence, we estimate hk from (9) using the Bayesian linear regression method proposed in [7] for sparse parameter vectors. Here, we assume that hk (∀k ∈ {1, 2, . . . , G}) are sparse vectors with the locations of the zeros unknown. This assumption is motivated by earlier studies [2]–[5], which concluded that GRNs in most of the biological systems are sparse. Note that we describe the GRN using wk l (∀k, l ∈ {1, 2, . . . , G}), which, in turn, are related with the elements of the vector hk (∀k, l ∈ {1, 2, . . . , G}) through wk l = αk wk l , see Section II-B, and so, we assume that hk (∀k ∈ {1, 2, . . . , G}) are sparse. III. PERFORMANCE COMPARISON WITH EXISTING APPROACHES In this section, we compare the performance of our proposed GRN estimation method with the NIR (TSNI) [16], [17] and BANJO [18] methods, the two most well-established methods for estimating GRNs from time-series microarray data. Here, we employ the two time-series datasets of the IRMA network, recently published in [15], implemented for assessing performances of the existing GRN estimation methods. Cantone et al. [15] conclude that the GRN estimation methods NIR (TSNI) and BANJO are able to correctly estimate the regulatory interactions from their collected time-series datasets. Therefore, we compare the performance of these two methods with our method for assessing our performance. As shown in Fig. 3, the IRMA network is implemented for the yeast Saccharomyces cerevisiae, considering five genes (SWI5, ASH1, CBF1, GAL4, and GAL80) regulating each other through a variety of interactions, without being affected by endogenous genes. The two time-series datasets in several independent experiments were collected by “switching” genes ON or OFF by culturing the yeast cells in a galactose medium or a glucose medium in 14 uniformly spaced time-points, respectively [15]. Cantone et al. [15] name these datasets “switch on” and “switch off.” They average these datasets over the independent experiments [15], and we use them to estimate the GRN with our proposed method. See [15] for a detailed description, and for brief reviews on the two GRN estimation methods NIR (TSNI) and BANJO. Here, in estimating the GRN using our proposed method, we use ∆Ct values of the published datasets in [15] as the gene Fig. 3. Ground-truth IRMA network. Red and blue lines signify stimulation and repression, respectively, from one gene to the other in the connecting edges. expression levels, and compute mk for the kth gene in (4) as mk = maxi,j xik (tj ). (10) In Fig. 4, we present the GRN estimation results using the switch on and switch off time-series datasets for the ground-truth network shown in Fig. 3, namely, Fig. 4(a) and (c) shows the GRN estimation results for the switch-on time-series dataset using the algorithms NIR (TSNI) and our proposed one, respectively. Similarly, in Fig. 4(b) and (d), we present the GRN estimation results for the switch-off time-series dataset using the algorithms NIR (TSNI) and our proposed one, respectively. We quantify the performance in terms of the ratio of the correctly predicted edges out of the total number of predicted edges [i.e., positive predicted value (PPV)], and in terms of the ratio of all the true edges that have been correctly identified by the employed algorithm [i.e., sensitivity (Se)] [20]. We test the significance of the employed algorithm by comparing its PPV with the PPV of a “random” performance that randomly assigns edges between a pair of genes in the estimated network. For example, for a fully connected network, the random algorithm would have a PPV = 1 for all sensitivity levels, as any pair of genes is connected in the network. In the network shown in Fig. 3, the expected PPV for a random guess of directed edges among genes is PPV = 0.4, so any value higher than 0.4 will be significant. We outperform the algorithms NIR (TSNI) and BANJO for the switch-off time-series datasets, see Fig. 4(b) and (d); BANJO shows similar performance to NIR (TSNI) [15]. For the switchon time-series dataset, we outperform BANJO,- and obtain performance competitive to NIR (TSNI). That is to say, NIR (TSNI) performs worse than our proposed method in terms of Se, and it performs better than our method in terms of PPV; the BANJO algorithm fails to perform better than the random algorithm [15]. Also note that we consider in the performance comparison only the accuracy in estimating the directed edges using our method, and not the accuracy in estimating whether the directed edges are stimulating or repressing. During the performance comparison, we realize two immediate advantages of our proposed method in comparison with the other two methods. First, our method is simple and automatic, and it does not require additional experiments to estimate Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. SARDER et al.: ESTIMATING SPARSE GENE REGULATORY NETWORKS USING A BAYESIAN LINEAR REGRESSION 125 Fig. 4. Estimated GRN using NIR (TSNI) for: (a) the switch-on dataset and (b) the switch-off dataset. Estimated GRN using our proposed method for: (c) the switch-on dataset and (d) the switch-off dataset. Red and blue lines signify stimulation and repression, respectively, from one gene to the other in the connecting edges. The dashed lines signify a false estimated connection, see the ground truth shown in Fig. 3. the model parameters; in contrast, NIR (TSNI) requires additional experiments to estimate the complex kinetics parameters involved in its related model for estimation [15]. Second, the performance of our method should not be affected by the number of time-points in the time-series microarray data that we analyze as long as P (N − 1) > (G + 1) and σe2 is negligible in (5), see Section II; in contrast, improving the performance of BANJO requires collection of additional time-series measurements. We are also well grounded in applying our method to microarray datasets larger than IRMA, since the algorithms that perform well with the small networks, such as IRMA, also perform well with larger networks [15], [20], [22], see also Section IV. IV. DREAM3 In Silico CHALLENGE NETWORK ANALYSIS We apply our proposed method to the in silico time-series datasets, consisting of 50 and 100 genes, reported recently in the DREAM3 challenge [23], [24]. Here, we employ: 1) the five datasets for the 50-gene networks and 2) the five datasets for the 100-gene networks—both described in the DREAM3 challenge. Each dataset represents the specific gene interactions of Escherichia coli or yeast under several external perturba- tions or in several independent experiments. The datasets were collected in 23 independent experiments for the 50-gene networks and in 46 independent experiments for the 100-gene networks. We compare the directed edges of our estimated GRNs with the ground-truth GRNs for each of these datasets, and obtain for each of them a higher value of the PPV using our method than we obtain using the random algorithm. Recall from Section III that if a method results in a higher value of the PPV than is obtained using the random algorithm, then that method is significant. We thus conclude that our method performs significantly with respect to the ground-truth GRNs for the DREAM3 in silico challenge datasets, see Tables I and II. We summarize two key observations from our simulations for the results presented in Tables I and II. First, we find that the performance of our proposed method improves as we increase the number of independent experiments in our simulations, i.e., with increasing P in (9). We thus understand that the in silico datasets that we use here are very noisy. Second, we estimate many false edges for all of the in silico datasets. This would be the case when the datasets we employ are very noisy. Note that our method is purely computational, and requires a prior knowledge of the sparsity and the gene-expression variation [7]. We Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. 126 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 TABLE I PERFORMANCE FOR THE 50-GENE NETWORK TABLE III DATA COLLECTION DAYS OF THE 11 VAP PATIENTS TABLE II PERFORMANCE FOR THE 100-GENE NETWORK currently replace these parameters with their empirical Bayesian estimates obtained from the data [7], and such estimates are often very inaccurate for noisy data. One possible way to improve the accuracy in estimating these parameters is using their knowledge from biology; as a result, the performance of our method improves. V. VAP DATA ANALYSIS VAP refers to a pneumonia that develops in patients who require mechanical ventilation through an endotracheal or tracheostomy tube. This disease is a common, severe, and costly complication that increases the likelihood of death from infection [29]. A. Experiment Circulating white blood cell microarray expression profiles were generated from 20 mechanically ventilated patients every two days for up to three weeks [26]. Eleven of these patients developed VAP, as diagnosed by the attending physician [26]. We present the data collection days of these patients in Table III, with day zero being the day that a patient was diagnosed as having VAP. Using EDGE software, 85 mRNA species were identified [25], with a false discovery rate of 0.1, whose abundance changed concordantly among the patients during the data collection window surrounding the day zero data [26]. We also identify 50 mRNA species from the average of the microarray expression profiles, which appear at the top of the ordered list, when the genes are ranked based on their differential expressions using a gene selection software BATS (a Bayesian user-friendly software for analyzing time-series microarray ex- Fig. 5. Venn diagram showing the number of common edges in any combination of the three methods, i.e., our proposed, and the correlation-coefficientand IPA-based methods, and the number of edges that appear in only a single method, but not in the other ones. periments) [28]. We use, in this ranking, the default parameters that are set in the BATS analysis window. B. Analysis Using the 85 EDGE Selected Genes We estimate the GRN for the 85 EDGE selected genes using our proposed method, and compare the resulting directed edges with those of McDunn et al. [26], using both a correlationcoefficient-based method [19] and the IPA [27] based method estimate. (IPA is a knowledge-based software that estimates the GRN based on the gene–gene interactions reported in the literature.) The goal of our comparison is to quantify the similarity or difference among the networks that the three methods (i.e., our method, the correlation-coefficient-based method, and the IPA-based method) derive. We expect that the common edges would be the most likely to be biologically significant. Note that we can perform a similar analysis using the 50 BATS selected genes. We, however, do not include such analysis here for brevity. 1) Analysis–1: Our proposed method estimates 4256 network edges with nonzero weights, out of a possible 7225 network edges. The correlation-coefficient- and IPA-based methods estimate 613 and 44 network edges, respectively. In Fig. 5, we present a venn diagram to show the number of common edges in any combination of these three methods and the number of edges that appear in only a single method. We see that all the three Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. SARDER et al.: ESTIMATING SPARSE GENE REGULATORY NETWORKS USING A BAYESIAN LINEAR REGRESSION 127 Fig. 6. Common GRN estimated using our proposed and the correlation-coefficient-based methods. Red and blue color lines signify stimulation and repression, respectively, from one gene to the other in the connecting edges. methods overlap. Our method and the correlation-coefficientbased method exhibit 321 common network edges. There is only one edge, between CCR1 and IL-1 beta, common to all the three estimated networks. This edge is well documented in the literature [27], [30]. However, we estimate this edge with a weight near zero using our proposed method, and hence, we consider this edge insignificant. Neither the correlation-coefficient nor our proposed method shares a significant number of edges with the network derived from the reported gene–gene associations in the IPA [27]. Furthermore, the common edges between the GRNs that our method and the IPA-based method estimate are estimated to be very weak using our method. Hence, we consider that these edges are also insignificant. We thus focus on the 321 edges that appear in the GRNs that both our proposed and the correlation-coefficient-based methods estimate. We use the weight values of these network edges, which our method estimates, as metrics for measuring their significance. For simplicity, we eliminate the edges with absolute estimated weights of less than 10% of the maximum absolute weight value of the common 321 branches, and obtain a consensus network of 118 branches. In Fig. 6, we present the consensus GRN, which resembles a directed graph, using these 118 branches. Here, a positive edge weight (denoted by red) signifies stimulation from one gene to the other, and a negative value (denoted by blue) signifies repression. The network in Panel A contains genes that are downregulated, i.e., their RNA abundances are decreasing, and the network in Panel B contains genes that are upregulated, i.e., their RNA abundances are increasing. Here, note that upregulation and downregulation are classified using the correlation-coefficient-based method. Next, we briefly summarize the consensus GRN analysis scheme, which is also represented in Fig. 7. 1) Select the significant genes of the microarray dataset of interest using EDGE. 2) First, estimate the GRN using our proposed method. Recall that the proposed method assumes the network is sparse, and estimates the network edges with nonzero weights. Moreover, our method estimates the directions of the estimated edges and classifies whether the connections signify stimulation or repression. 3) Next, estimate the GRN using the correlation-coefficientbased method proposed in [19]. The correlationcoefficient-based method generates an undirected network, but classifies whether the network is upregulatory or downregulatory [19]. 4) Then, obtain a consensus network using the common edges of the GRNs that our and the correlation-coefficientbased methods estimate. This consensus network is sparse, directed, and classified as upregulatory or downregulatory. The edges of this network are classified as stimulating or repressing from one gene to the other, and are weighted to specify the degree of stimulation or repression. One can apply a threshold to the edge weights to reduce this consensus network, but that is not necessary. 5) Use IPA to define the molecular and functional classifications of the consensus network and generate hypotheses. The lack of correlation between the consensus network and the IPA-based network suggests a fundamental concern that cannot be resolved from the current data, given our inability to validate either approach. Note that the IPA-based method suffers from its inability to restrict gene–gene connections based upon species or cell phenotype; it is also unable to connect genes of Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. 128 Fig. 7. IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 Block diagram of the consensus GRN analysis scheme. unknown functions, which is, however, one of the strengths of the other two methods. 2) Analysis–2: To gain biological insights, we eliminate the network edges with absolute estimated weights of less than 90% of the maximum absolute weight of the common 321 edges, which our proposed and the correlation-coefficient-based methods estimate. Thus, we obtain four subnetworks with the strongest connection weights, as shown in Fig. 8. They contain several interesting interactions. Two of them are upregulatory (networks A and B), and the other two are downregulatory. 3) Network A: This is the smallest network. It exhibits two-gene interaction, performs cell attraction [pro-platelet basic protein (PPBP), chemotaxis], and negatively controls cell growth and differentiation (PPP2R5A) [27]. This network contains PPBP as the most important gene, which is a growth factor that is a potent chemoattractant and activator of neutrophils [27], [30].This gene is also known as B cell translocation gene (BTG), and a decrease of the concentration ratio between BTG and PF4 is a sensitive indicator of in vitro platelet activation [27], [30]. 4) Network B: This network exhibits interaction between three genes: inhibin beta A (INHBA), cathelicidin antimicrobial peptide (CAMP), and Fms-related tyrosine kinase 1 (FLT1). Here, the central node mRNA is INHBA, which is postulated to be a growth/differentiation factor. It has a beta A subunit mRNA, which is identical to the erythroid differentiation factor subunit mRNA [27], [30]. Note that INHBA is heavily studied in cancer research [27], [30], and has only one gene that exists in the human genome [27], [30]. The second gene, CAMP, is an antimicrobial protein found in granulocytes [27]. It is reported as one of the strongest nodes by McDunn et al. [26] in their GRN analyses performed for the current 85 human mRNA VAP dataset and for a 219 mouse mRNA dataset. The last gene, FLT1, is a protein kinase that is important for the control of cell growth/differentiation [27]. 5) Network C: This network exhibits a three-gene interaction. The central node here is the chemokine receptor (CCR1), which is the only gene node common to all the three network analyses considered for the VAP dataset, as shown in Fig. 6. In this network, the mRNA LIMK2 mediates interactions for the actin cytoskeleton, which is reminiscent of ACTR3 in network D [27]. Also, network C provides an interaction (albeit indirect) between the actin cytoskeleton (LIMK2) and inhibition of apoptosis (BIRC1) [27]. 6) Network D: This is the largest network with eight genes. Here, the central node is the actin-related protein 3 (ACTR3), whose function is largely unknown, but which is related to nucleotide binding and cell motility [27]. Actin polymerization of ACTR3 is mediated by the Arp2/3 complex, which has been found to play a critical role in some pathogens’ intracellular motility [27], [30]. This gene inhibits the HIV-1’s ability to activate the Arp2/3 complex, and could be a potential chemotherapeutic intervention for acquired immunodeficiency syndrome (AIDS) [27], [30]. It also performs actin polymerization of the Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. SARDER et al.: ESTIMATING SPARSE GENE REGULATORY NETWORKS USING A BAYESIAN LINEAR REGRESSION 129 Fig. 8. Subnetworks formed using the most powerful stimulating and repressing connections. Red and blue color lines signify stimulation and repression, respectively, from one gene to the other in the connecting edges. host cell as necessary to combat for parasite infection [27], [30]. The other genes in network D perform diverse cell functions, including inhibition of apoptosis, cell interaction and differentiation, and ion transport [27], [30]. 7) Discussion: To summarize, we generate the following experimental tasks for future research using conventional molecular techniques. 1) The interaction between CCR1 and IL-1beta is the only overlap between all the three network analyses considered for the VAP dataset. This interaction is already well documented in the literature [27], [30]; however, we estimate this connection to be very weak using our proposed method. Thus, it would be of interest to validate the importance of this interaction, particularly for optimized host responses to bacterial pathogens. 2) The gene ACTR3 has many connections with the other genes, see Fig. 6. The function of this gene is largely unknown in the literature, especially with response to the host–pathogen interactions [27]. We will investigate this further in our future work. 3) Networks C and D both have similar edges, implying actin and apoptosis interaction, which is important to the host response to pneumonia, see Fig. 8. This interaction is defined as actin-mediated apoptosis in the literature [30]. Note that there is a close association between the actin organization and mitochondrial function, namely, the actin cytoskeleton has been implicated in the movement of mitochon- dria along the actin filaments via an ARP2/3-dependent propulsion mechanism [31]. Furthermore, a link between the actin-regulatory protein gelsolin and mitochondrially induced apoptosis has also been described in mammalian cells [32], [33]. A reduction in the actin dynamics caused by mutations in the actin itself leads to the development of apoptotic phenotypes, such as a loss of mitochondrial membrane potential, elevated reactive oxygen species levels, and DNA fragmentation [34]. Our current computational finding supports these studies for the human genome VAP dataset; however, neither the correlation-coefficientbased method nor our proposed method estimate any connection between LIMK2 and ACTR3, suggesting the need for further experimental verification in our future work. C. Analysis Using the 50 BATS Selected Genes We compare our proposed method with a mutual-informationbased method in estimating the GRN [21] using the 50 BATS selected genes. The software BATS is developed to select the differentially expressed genes from their time-series measurement [28]. Our method estimates the GRN from time-series microarray datasets, and thus, the resulting estimated GRN from the BATS selected genes should be important. On the other hand, the mutual-information-based method for estimating the GRN is new and well established in the community [21]. Thus, it is interesting to compare the directed edges of the GRN Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. 130 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 that is estimated using the mutual-information-based method with those estimated using our method. We consider here that the GRN that is estimated using the mutual-information-based method is the ground-truth GRN. To compute the GRN using the mutual-information-based method, we employ the time-series microarray dataset of patients 1, 2, 3, 5, 6, 7, 9, and 10, at data collection days −3, −1, 1, and 3, with day zero being the day that a patient was diagnosed as having VAP, see Section V. We use the data from these collection days as they are common among most of the patients. We then average these data over the patients and use the average dataset to compute the GRN using the mutual-informationbased method [21]. To compute the GRN using our proposed method, we employ all the available time-series microarray expression profiles from all the patients; see Section V and Table III. We obtain the PPV= 0.07 using our method rather than the PPV= 0.05 using the random algorithm, and obtain Se= 0.17 using our method.Recall from Section III that if a method results in a higher value of the PPV than is obtained using the random algorithm, then that method is significant. We thus conclude that our method performs significantly with respect to the mutual information based method. ulatory or downregulatory. The edges of this network are classified as stimulating or repressing, and their estimated weights using our method specify their degree of stimulation or repression. We then analyzed the consensus network using IPA to define molecular and functional classifications and generate hypotheses, and obtained four biologically meaningful networks. Using the top-ranked BATS [28] selected genes of the VAP dataset, we further found that our method performs significantly with respect to the GRN estimated using a mutual-informationbased method [28], and thus reconfirmed the validity of our method. In our future experimental work, we will use conventional molecular analysis to analyze the four biologically meaningful subnetworks obtained by comparing the GRNs estimated from the VAP dataset using our method, the correlation-coefficientbased method, and the IPA-based method. VI. CONCLUSION REFERENCES In this paper, we proposed a new GRN estimation method derived from time-series gene-expression datasets. We described the time-series gene-expression data using the biophysical model suggested in [6], and proposed a related biostatistical model. Motivated by past studies indicating that most biological networks are typically sparse [2]–[5], we assumed that the GRN is sparse, and estimated it using a Bayesian linear regression method for sparse parameter vectors [7]. We compared the performance of our proposed method with NIR (TSNI) and BANJO, the two most well-established methods in the literature for estimating the GRN from a timeseries gene-expression data. For this comparison, we employed two time-series gene-expression datasets, recently published in [15], collected for assessing performances of the existing GRN estimation methods. The proposed method outperformed the methods NIR (TSNI) and BANJO in one dataset, and obtained a performance competitive with them in the other. We evaluated further the performance of our method using the in silico time-series datasets consisting of 50 and 100 genes, reported recently in the DREAM3 challenge [23], [24], and our method performs significantly with respect to their ground-truth GRNs. We also estimated the GRN for a VAP dataset published in [26] using our proposed method, and compared the result with the GRNs that McDunn et al. [26] estimated using correlation-coefficient-based [19] and IPA-based [27] methods. We found that the GRNs estimated using our and the correlationcoefficient-based methods display a large number of common edges. These two methods are unbiased and independent of each other, and hence, their large overlap is interesting. The resulting consensus network is sparse, directed, and classified as upreg- [1] J. L. DeRisi, V. R. Iyer, and P. O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, vol. 278, pp. 680–686, 1997. [2] M. I. Arnone and E. H. Davidson, “The hardwiring of development: Organization and function of genomic regulatory systems,” Development, vol. 124, pp. 1851–1864, 1997. [3] D. Thieffry, A. Huerta, E. Perez-Rueda, and J. Collado-Vides, “From specific gene regulation to genomic networks: A global analysis of transcriptional regulation in Escherichia coli,” BioEssays, vol. 20, pp. 433–440, 1998. [4] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barab´asi, “The large-scale organization of metabolic networks,” Nature, vol. 407, pp. 651–654, 2000. [5] H. Jeong, S. Mason, A.-L. Barab´asi, and Z. N. Oltvai, “Lethality and centrality in protein networks,” Nature, vol. 411, pp. 41–42, 2001. [6] D. C. Weaver, C. T. Workman, and G. D. Stormo, “Modeling regulatory networks with weight matrices,” Pac. Symp. Biocomput, vol. 4, pp. 112– 123, 1999. [7] E. G. Larsson and Y. Sel´en, “Linear regression with a sparse parameter vector,” IEEE Trans. Signal Process., vol. 55, no. 2, pp. 451–460, Feb. 2007. [8] M. Wahde and J. Hertz, “Coarse-grained reverse engineering of genetic regulatory networks,” BioSystems, vol. 55, pp. 129–136, 2000. [9] P. Dhaeseleer, S. Liang, and R. Somogyi, “Genetic network inference: From co-expression clustering to reverse engineering,” Bioinformatics, vol. 16, pp. 707–726, 2000. [10] A. Hartemink, D. Gifford, T. Jaakkola, and R. Young, “Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks,” Pac. Symp. Biocomput., vol. 6, pp. 422– 433, 2001. [11] B. Vogelstein, D. Lane, and A. J. Levine, “Surfing the p53 network,” Nature, vol. 408, pp. 307–310, 2000. [12] E. M. Marcotte, “The path not taken,” Nat. Biotechnol., vol. 19, pp. 626– 627, 2001. [13] M. K. Stephen Yeung, J. Tegn´er, and J. J. Collins, “Reverse engineering gene networks using singular value decomposition and robust regression,” in Proc. Nat. Acad. Sci. USA, 2002, vol. 99, pp. 6163–6168. [14] D. di Bernardo, M. J. Thompson, T. S. Gardner, S. E. Chobot, E. L. Eastwood, A. P. Wojtovich, S. J. Elliott, S. E. Schaus, and J. J. Collins, “Chemogenomic profiling on a genome-wide scale using reverseengineered gene networks,” Nat. Biotechnol., vol. 23, pp. 377–383, 2005. ACKNOWLEDGMENT The authors would like to acknowledged the Systems Analysis Group, Washington University, St. Louis, for providing them valuable feedback to improve this paper. They also would like to thank the anonymous reviewers for their constructive comments. Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply. SARDER et al.: ESTIMATING SPARSE GENE REGULATORY NETWORKS USING A BAYESIAN LINEAR REGRESSION [15] I. Cantone, L. Marucci1, F. Iorio, M. Aurelia Ricci, V. Belcastro, M. Bansal, S. Santini, M. di Bernardo, D. di Bernardo, and M. P. Cosma, “A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches,” Cell, vol. 137, pp. 172–181, 2009. [16] G. Della Gatta, M. Bansal, A. Ambesi-Impiombato, D. Antonini, C. Missero, and D. di Bernardo, “Direct targets of the TRP63 transcription factor revealed by a combination of gene expression profiling and reverse engineering,” Genome Res., vol. 18, pp. 939–948, 2008. [17] T. S. Gardner, D. di Bernardo, D. Lorenz, and J. J. Collins, “Inferring genetic networks and identifying compound mode of action via expression profiling,” Science, vol. 301, pp. 102–105, 2003. [18] J. Yu, V. A. Smith, P. P. Wang, A. J. Hartemink, and E. D. Jarvis, “Advances to Bayesian network inference for generating causal networks from observational biological data,” Bioinformatics, vol. 20, pp. 3594–3603, 2004. [19] J. Ruan and W. Zhang, “Identification and evaluation of functional modules in gene co-expression networks,” in Proc. RECOMB Satellite Conf. Syst. Biol. Comput. Proteomics, 2006, pp. 57–76. [20] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo, “How to infer gene networks from expression profiles?” Mol. Syst. Biol., vol. 3, p. 78, 2007. [21] K. Basso, A. A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, and A. Califano, “Reverse engineering of regulatory networks in human B cells,” Nat. Genet., vol. 37, pp. 382–390, 2005. [22] G. Stolovitzky, D. Monroe, and A. Califano, “Dialogue on reverse engineering assessment and methods: The DREAM of high-throughput pathway inference,” Ann. NY Acad. Sci., vol. 1115, pp. 1–22, 2007. [23] (2010). [Online]. Available: http://wiki.c2b2.columbia.edu/dream/index. php/D3c4full [24] D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano, “Generating realistic in silico gene networks for performance assessment of reverse engineering methods,” J. Comput. Biol., vol. 16, pp. 229–239, 2009. [25] (2010). [Online]. Available: http://www.edge-software.com/ [26] J. E. McDunn, K. D. Husain, A. D. Polpitiya, A. Burykin, J. Ruan, Q. Li, W. Schierding, N. Lin, D. Dixon, W. Zhang, C. M. Coopersmith, W. M. Dunne, M. Colonna, B. Ghosh, and J. P. Cobb, “Plasticity of the systematic inflammatory response to acute infection during critical illness,” PLoS One, vol. 3, p. e1564, 2008. [27] (2010). [Online]. Available: http://www.ingenuity.com/products/ pathways_analysis.html [28] C. Angelini, L. Cutillo, D. De Canditiis, M. Mutarelli, and M. Pensky, “BATS: A Bayesian user-friendly software for analyzing time series microarray experiments,” BMC Bioinf., vol. 9, p. 415, 2008. [29] N. Safdar, C. Dezfulian, H. R. Collard, and S. Saint, “Clinical and economic consequences of ventilator-associated pneumonia: A systematic review,” Crit. Care Med., vol. 33, pp. 2184–2193, 2005. [30] (2010). [Onlne]. Available: http://www.ncbi.nlm.nih.gov/sites/entrez [31] I. R. Boldogh, H. Yang, W. D. Nowakowski, S. L. Karmon, L. G. Hays, J. R. Yates, and L. A. Pon, “Arp2/3 complex and actin dynamics are required for actin-based mitochondrial motility in yeast,” in Proc. Nat. Acad. Sci. USA, vol. 98, pp. 3162–3167, 2001. [32] S. Ohtsu, S. Izumi, S. Iwanaga, N. Ohno, and T. Yadomae, “Analysis of mitogenic substances in Bupleurum chinenese by ESR spectroscopy,” Biol. Pharm. Bull., vol. 20, pp. 97–100, 1997. [33] D. Koya, M. Haneda, H. Nakagawa, K. Isshiki, and H. Sato, “Amelioration of accelerated diabetic mesangial expansion by treatment with a PKC β inhibitor in diabetic db/db mice, a rodent model for type 2 diabetes,” FASEB J., vol. 14, pp. 439–447, 2000. [34] C. W. Gourlay, L. N. Carpp, P. Timpson, S. J. Winder, and K. R. Ayscough, “A role for the actin cytoskeleton in cell death and aging in yeast,” J. Cell Biol., vol. 164, pp. 803–809, 2004. Pinaki Sarder (S’03) received the B.Tech. degree in electrical engineering from the Indian Institute of Technology, Kanpur, India, in 2003. He is currently working toward the Ph.D. degree in the Department of Electrical and System Engineering, Washington University, St. Louis (WUSTL), MO. He is also a Research Assistant in the Department of Electrical and System Engineering, Washington University. His research interests include statistical signal processing, biomedical imaging, and genomics. Mr. Sarder was a recipient of the Imaging Sciences Pathway Program Fellowship for Graduate Students at WUSTL from January 2007 to December 2007. 131 William Schierding, photograph and biography not available at the time of publication. J. Perren Cobb graduated (cum laude) with the M.D. degree from the University of Louisville School of Medicine, Louisville, KY. He was a trainee in general surgery at the University of California, San Francisco. He completed fellowships in critical care at the National Institute of Health and the University of Pittsburgh. His investigative work has been supported by the National Institutes of Health, the American Association for the Surgery of Trauma, the Society of Critical Care Medicine, and the Barnes-Jewish Hospital Foundation. He is currently the Director of the Center for Critical Illness and Health Engineering, Washington University School of Medicine, St. Louis, MO, where he is also a Professor of surgery and anesthesiology, and an Associate Professor of genetics. His clinical specialty is surgical critical care with a research interest in the pathophysiology of sepsis and injury. His current grants focus on the application of functional genomics to critical illness and injury, and the development of novel sepsis diagnostics. Prof. Cobb is the Chair of the Steering Committee for the U.S. Critical Illness and Injury Trials Group. He was the President of the Association for Academic Surgery. He was the recipient of the Research Scholarship Award of the American Association for the Surgery of Trauma, the Founders Grant for Training in Critical Care Research of the Society of Critical Care Medicine, the George H. A. Clowes, Jr., Memorial Research Career Development Award of the American College of Surgeons, and the 2nd Annual Critical Care Medicine Distinguished Alumnus of the University of Pittsburgh. In 2005, Genome Technology Magazine named him a Microarray Innovator of the Year. Arye Nehorai (S’80–M’83–SM’90–F’94) received the B.Sc. and M.Sc. degrees in electrical engineering from the Technion, Haifa, Israel, and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA. From 1985 to 1995, he was a faculty member in the Department of Electrical Engineering, Yale University. In 1995, he joined the Department of Electrical Engineering and Computer Science, University of Illinois, Chicago, as a Full Professor, where from 2000 to 2001, he was the Chair of the department’s Electrical and Computer Engineering (ECE) Division. In 2001, he was named the University Scholar of the University of Illinois. In 2006, he became the Chair of the Department of Electrical and Systems Engineering, Washington University, St. Louis, MO, where he has been the inaugural holder of the Eugene and Martha Lohman Professorship, and the Director of the Center for Sensor Signal and Information Processing since 2006. Dr. Nehorai has been a Fellow of the Royal Statistical Society since 1996. He was the Editor-in-Chief of the IEEE TRANSACTIONS ON SIGNAL PROCESSING during 2000–2002. From 2003 to 2005, he was the Vice President (Publications) of the IEEE Signal Processing Society (SPS), the Chair of the Publications Board, a member of the Board of Governors, and a member of the Executive Committee of this Society. From 2003 to 2006, he was the Founding Editor of the special columns on Leadership Reflections in the IEEE SIGNAL PROCESSING MAGAZINE. He was the corecipient of the IEEE SPS 1989 Senior Award for Best Paper with P. Stoica, the coauthor of the 2003 Young Author Best Paper Award, and the corecipient of the 2004 Magazine Paper Award with A. Dogandzic. He was the Distinguished Lecturer of the IEEE SPS for the term 2004 to 2005. He was the recipient of the 2006 IEEE SPS Technical Achievement Award and the 2010 IEEE SPS Meritorious Service Award. He is the Principal Investigator of the Multidisciplinary University Research Initiative project entitled Adaptive Waveform Diversity for Full Spectral Dominance. Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 08,2010 at 22:13:31 UTC from IEEE Xplore. Restrictions apply.