Implementation of disize • disize

Internally, disize uses Stan to fit a Bayesian model that jointly estimates the effect of covariates (structured according to design_formula) on gene expression and any confounding batch-effects:

Where \mathbf{y}_g denotes the vector of counts of a gene g for all observations, which is realized from the negative binomial parameterized by the effect of the covariates (\mathbf{\alpha}_g + \mathbf{X} \mathbf{\beta}_g + \mathbf{Z}\mathbf{b}_g) and any batch-effects (\mathbf{B} \mathbf{s}).

The experimental design is specified by an R formula (design_formula) that constructs a “fixed-effects” model matrix \mathbf{X} (without an intercept) and a “random-effects” model matrix \mathbf{Z}; this by itself is a regular GLMM.

The confounding batch-effect is essentially an unknown “offset” (not in the literal sense as it is being estimated) \mathbf{o} = \mathbf{B} \mathbf{s}, where \mathbf{B} specifies the batch membership for each observation and \mathbf{s} contains the “size factors” which scale the true magnitude of expression.

At its face, however, this model should not be joint identifiable for most experimental designs (often the batch ID is perfectly collinear with a measured covariate). This is overcome by constraining \mathbf{s} and assuming only a fraction of features are significantly affected by the covariates measured in the experiment; in other words, the estimated coefficients \mathbf{\beta}_g, \mathbf{b}_g (excluding the intercept) are sparse across genes.

This assumption is encoded in the model by placing distinct horseshoe priors on each of the model coefficients (both the fixed- and random-effects). This allows the priors to be learned independently of each other using the large number of features measured in RNAseq experiments:

\begin{aligned} \mathbf{y}_g &\sim \text{NegBinom}(\mathbf{\mu}_g, \phi) \\ \log \mathbf{\mu}_g &= \mathbf{\alpha}_g + \mathbf{X} \mathbf{\beta}_g + \mathbf{Z}\mathbf{b}_g + \mathbf{o} \\ &= \mathbf{\alpha}_g + \mathbf{X} \mathbf{\beta}_g + \mathbf{Z}\mathbf{b}_g + (\mathbf{M} \mathbf{s}) \\ \mathbf{\beta}_g &\sim \text{Normal}(\mathbf{0}, \mathbf{\lambda}_g^{(f)} \odot \mathbf{\tau}^{(f)}) \\ \mathbf{\lambda}_g^{(f)} &\sim \text{Half-Cauchy}(\mathbf{0}, \mathbf{1}) \\ \mathbf{b}_g &\sim \text{Normal}(\mathbf{0}, \mathbf{R} [ \mathbf{\lambda}_g^{(r)} \odot \mathbf{\tau}^{(r)} ]) \\ \mathbf{\lambda}_g^{(r)} &\sim \text{Half-Cauchy}(\mathbf{0}, \mathbf{1}) \\ \mathbf{1} \cdot \exp(\mathbf{s}) &= n_b \end{aligned}

Where \mathbf{R} specifies the random-effects term membership for each random-effect coefficient, and n_b denotes the number of batches.

Estimation

Since the posterior distribution for \mathbf{s} is heavily concentrated for typical datasets with thousands of features, it is sufficient to quickly provide a point estimate of \mathbf{s} and delegate estimation of the model coefficients to existing tools like DESeq2 or edgeR (only replacing the small normalization step of their workflows).

disize uses Stan’s L-BFGS optimization algorithm to find the model’s maximum a posteriori (MAP) for \mathbf{s}.