Skip to content

Running FALCON

Seb edited this page Jun 16, 2020 · 25 revisions

We provide a driver/run-through script “DriverFalcon” which integrate all the functionalities of FALCON as listed above into a single pipeline. We also provide a series of drivers to complete a number of specific analyses related to regularized optimization.

Unfortunately, we do not provide a GUI anymore. The reason is that the complexity of FALCON has increased and that the interface capabilities of Matlab are limited, resulting in a choice to not support the graphical option and prioritize other developments.

1. Model definition

For this example we will run an analysis of the PDGF dataset included in the examples. This dataset is described in Trairatphisan et al., 2016

InputFile = 'PDGF_model.xlsx' MeasFile = 'PDGF_meas.xlsx'

Here we define the optimization problem by providing FALCON with the model topology and the measurements for some of the nodes of the network, in a series of conditions where the values of the input nodes change.

estim = FalconMakeModel(InputFile, MeasFile);

...reading network file...

...checking network consistency...

...reading datafile...

... ... ...

Network loaded: 30 nodes and 38 interactions (1 Boolean gates)

Data loaded: 9 inputs, 6 outputs and 6 experimental conditions

The model has 19 parameters for 36 datapoints

This step automatically defines a number of hyperparameters which the users might want to change:

  • the number of rounds for each optimization is set by default to 4. There is no proof that our objective function is convex for all possible topologies. As our gradient descent algorithm is not stochastic, it is good practice to start several optimizations from different random initial conditions. The more the better, however as this number is usually below the number of cores of CPUs it is a good starting point for estimating the convergence of the optimization. It can be changed to an arbitrary integer N with estim.optRound = N.

  • the maximum number of evaluations of the objective function is set by default to 1000 times the number of parameters, but fmincon will usually stop when a minimum is reached earlier. It can be changed to an arbitrary integer N with estim.MaxFunEvals = N.

  • the distribution of initial conditions is set by default to 'normal', meaning that the initial conditions will be drawn from a bounded normal distribution with mean 0.5 and standard deviation 0.167. This has been shown to accelerate optimizations, but can be changed to an uniform distribution with estim.IC_Dist = 'uniform' or to a near-zero distribution where parameters are random but smaller than 1E-6 with estim.IC_Dist = 'scratch'. A user-defined seed can be set with estim.Seed = N.

  • the type of optimization is set by default to 'MSE', meaning that the estimated targets are the measurements means, without consideration of their error term. This error term is displayed in the fitting plots to estimate the adequacy of the optimization results but not taken into account during optimization. This can be changed to a maximum-likelihood estimation, in which the error for each datapoint is weighted by a term proportional to the inverse of its variance. Datapoints with a a smaller measurement error will tend to be estimated more closely. This can be set with estim.ObjFunction = 'likelihood'.

  • the weights of the different datapoints is set by default to be uniform, meaning that all datapoints are equally weighted in the objective function. For exploratory purposes it can be useful to change the weights of some of the datapoints and observe the change in steady-state behavior. This can be changed by providing an array X with the same size as estim.Output with estim.Weights = X.

  • the steady-state threshold is set by default to be equal to 2.2204E-14 times the number of nodes, which correspond to the limit of the sum of differences between the node values of two successive network updates to stop the simulation. This can be increased to an arbitrary float F, as it sometimes helps accelerating computations, at the expense of a small decrease in accuracy, with estim.SSthresh = F.

2. Optimization

During this step the network is initialized to random node values, initial conditions are chosen for the parameters, then the network is repeatedly updated until it reaches a steady-state. The node values are then compared to the measured outputs and the cost is calculated. Parameter values are explored one by one in order to create a matrix of pseudo-gradients and these partial derivatives are used to determine the next parameter estimate. The process is repeated until the objective function converges.

estim = FalconOptimize(estim)

Four lists of numbers roll down the screen indicating respectively the MSE, the regularization cost, the sum of the last two, and the BIC (Bayesian Information Criterion). Ideally the numbers should decrease continuously and converge to a single value across all rounds. When that is not the case, the optimization might not have converged or rogue nodes (nodes without parent and without pre-set values in the inputs) might be present. When the optimization finishes the list of optimized parameter values is displayed. A log of the optimization process and tables with optimized parameters are saved in a results folder carrying the timestamp of the start of the analysis. The results of the analyses already completed are stored in the variable estim.Results.

3. Additional analyses

First, we can re-simulate the network with the optimized parameter values and retrieve the optimized node values, as well as displaying a number of important plots:

[estim, StateValues, MSE] = FalconSimul(estim)

By default, the condensed fitting plot is displayed, as well as the cross-error heatmap, the regression plot, and the display of the node values during network updates until steady-state. All figures are saved in the results folder.

Next, the evolution of fitting cost can be explored to assess the convergence of the gradient-descent algorithm:

estim = FalconFitEvol(estim)

By default, ten optimizations are performed. The analysis of the plot can reveal that the optimization stopped prematurely, if the lines seem to converge to different minima.

Next, the uncertainty on parameter estimates can be explored. As the sizes of phosphoproteomic datasets is usually not large enough to allow bootstrapping (all nodes and conditions are important to define the functioning of the system) another strategy was implemented, where data is resampled from their empirical (assumed normal) distribution. By resampling many datasets and optimizing repeatedly the same model on each of them, it is possible to retrieve important information about the constrains of the parameters, i.e. to what extent are datapoints restricting the range of possible values and how does the measurement error, which might be larger on some datapoints that others or in some conditions, influence the uncertainty of the estimates. By default 50 synthetic datasets are produced, however this can be decreased if necessary to an arbitrary integer N with estim.NDatasets = N. The resampling is done with:

estim = FalconResample(estim)

In this example, parameter km3 has been estimated around 0.4, but during resampling it spans a broad part of the range, therefore the error term of the dataset has a large influence on the uncertainty of our estimate for this node. In contrast, node k13 was estimated around the same value 0.4, but appears more constrained by the dataset.

Another way to obtain the same information is by systematically optimizing the network on all conditions but keeping one of the parameter fixed. By exploring different values for this parameter it is possible to assess its identifiability, which is the degree to which it must be set with a certain range to allow the system to successfully reproduce the data.

estim = FalconLPSA(estim)

The function outputs the plots of the minimal value of the objective function for each parameter along a series of fixed values, as well as the covariance matrix, which can be used to identify groups (usually pairs) of parameters that compensate each other.

Next, the behavior of the system can be explored by systematically turning of each one of the interactions and studying the fitness of the resulting model:

estim = FalconKO(estim)

The same process can be applied to the nodes. When knocking out a node, two scenarios can be imagined: one in which the parameter values are allowed to be re-estimated to fit the dataset as best as possible:

estim = FalconKONodes(estim)

And one where the optimal parameter values are kept and the behavior of the system is evaluated without allowing the network to re-wire:

estim = FalconKONodes_fast(estim)

The difference of 'efficiency' of the two strategies can be examined by plotting their BIC values against each other:

X = estim.Results.KnockOutNodes.BIC_values; Y = estim.Results.KnockOutNodesFast.BIC_values;

figure, plot(X,Y,'.k', 'MarkerSize', 15), hold on

In this graph, the diagonal line represents the lack of effect of the adaptation of the network on the fitting of the dataset. In the bottom-left the model with the lowest BIC (in this case the base model). Distance from the bottom-left corner indicates loss of fitness of the system, however molecules far from the line are able to adapt the weights of the network in order to maintain functionality in the tested experimental conditions. In this example, knock-out of STAT5 or MEK results in a drastic loss of fitness, which is not recoverable. By contrast, knock-out of PLCg also results in an immediate loss of fitness, but this loss is recovered when the network is re-optimized.