Skip to contents

Internally, disize uses Stan to fit a Bayesian model that jointly estimates the effect of covariates (structured according to design_formula) on gene expression and any confounding batch-effects:

\begin{aligned} \mathbf{y}_g &\sim \text{NegBinom}(\mathbf{\mu}_g, \phi) \\ \log \mathbf{\mu}_g &= \mathbf{\alpha}_g + \mathbf{X} \mathbf{\beta}_g + \mathbf{Z}\mathbf{b}_g + \mathbf{o} \\ &= \mathbf{\alpha}_g + \mathbf{X} \mathbf{\beta}_g + \mathbf{Z}\mathbf{b}_g + (\mathbf{B} \mathbf{s}) \\ \mathbf{b}_g &\sim \text{Normal}(\mathbf{0}, \mathbf{G}_g) \end{aligned}

Where \mathbf{y}_g denotes the vector of counts of a gene g for all observations, which is realized from the negative binomial parameterized by the effect of the covariates (\mathbf{\alpha}_g + \mathbf{X} \mathbf{\beta}_g + \mathbf{Z}\mathbf{b}_g) and any batch-effects (\mathbf{B} \mathbf{s}).

The experimental design is specified by an R formula (design_formula) that constructs a “fixed-effects” model matrix \mathbf{X} (without an intercept) and a “random-effects” model matrix \mathbf{Z}; this by itself is a regular GLMM.

The confounding batch-effect is essentially an unknown offset \mathbf{o} = \mathbf{B} \mathbf{s}, where \mathbf{B} specifies the batch membership for each observation and \mathbf{s} contains the “size factors” which scale the true magnitude of expression.

At its face, however, this model should not be identifiable for most experimental designs (often the batch ID results in too many free parameters such as when it is identical to a measured covariate). This identifiability issue is overcome by constraining \mathbf{s} and assuming only a fraction of features are significantly affected by the covariates measured in the experiment; in other words, the estimated coefficients \mathbf{\beta}_g, \mathbf{b}_g (excluding the intercept) are sparse across genes.

This assumption is encoded in the model by placing distinct horseshoe priors on each of the model coefficients (both the fixed- and random-effects). This allows the priors to be learned independently of each other using the large number of features measured in RNAseq experiments.

Estimation

Since the posterior distribution for \mathbf{s} is heavily concentrated for typical datasets with thousands of features, it is sufficient to quickly provide a point estimate of \mathbf{s} and delegate estimation of the model coefficients to existing tools like DESeq2 or edgeR (only replacing the small normalization step of their workflows).

disize uses Stan’s L-BFGS optimization algorithm to find the model’s maximum a posteriori (MAP) for \mathbf{s}. We end up doing this in fewer iterations than needed for all parameters to converge by using a heuristic to guess how long the procedure should run for; this is followed up with a diagnostic to ensure the size factors have converged.