A Review of MoR and TMM Normalization
Set Up
Let y_{i,g} denote the observed count in a sample i \in \mathcal{I} (with n_i = |\mathcal{I}|) for a gene g \in \mathcal{G} (with n_g = |\mathcal{G}|) realized from a particular negative binomial distribution:
\begin{aligned} Y_{i,g} &\sim \text{NegBinom}(\mu_{i,g}, \phi_{g}) \\ \mu_{i,g} &= \eta_{i,g} \cdot \rho_{i} \end{aligned}
Where \eta_{i,g} denotes the true magnitude of expression and \rho_{i} denotes the “size factor” that scales this expression from its ground-truth.
DESeq2
’s Median of Ratios
We first focus on a particular gene g and compare the observed count for a sample i to the geometric average of counts across samples to get a ratio R_{i,g}:
R_{i,g} = \frac{ y_{i,g} }{( \prod_{i \in \mathcal{I}} y_{i,g} )^{ \frac{1}{n_i} } }
We then take the median of these ratios to get our size factor estimates!
\hat{s}_{i} = \underset{g \in \mathcal{G}}{\text{median }} R_{i,g}
edgeR
’s Trimmed Mean of M Values
We first “normalize” the observed count profile for a sample i by the total number of counts N_i = \sum_{g \in \mathcal{G}} y_{i,g} in order to get proportions:
y'_{i,g} = \frac{y_{i,g}}{N_i}
We next select a reference sample i^{\dagger} \in \mathcal{I} and compute both the ratio of log-transformed proportions and their weights:
\begin{aligned} R_{i,g} &= \frac{\log_2 y'_{i,g}}{\log_2 y'_{i^{\dagger}, g}} \\ w_{i,g} &= \frac{ N_i - Y_{i,g} }{ N_i Y_{i,g} } + \frac{ N_{i^{\dagger}} - Y_{i^{\dagger},g} }{ N_{i^{\dagger}} Y_{i^{\dagger},g} } \end{aligned}
We then filter the genes to a subset \mathcal{G}'_i \subset \mathcal{G} by symmetrically “trimming” away the smallest and largest ratios for a sample i to XX% of the original number (defaults to 70%).
We finally compute the size factor by taking the weighted average of these ratios and raising it to the second power:
\log_2 \hat{s}_i = \frac{ \sum_{g \in \mathcal{G}'_i} w_{i,g} R_{i,g} }{ \sum_{g \in \mathcal{G}'_i} w_{i,g} }
Simulation Framework
Comparing Two Conditions
# Simulating settings
n_donors <- 6L
n_genes <- 1000L
model_formula <- reformulas::splitForm(~ 0 + cond_id + (1 | donor_id))
# Construct model matrices
metadata <- data.frame(
donor_id = 1:n_donors,
cond_id = cut(1:n_donors, 2),
batch_id = 1:n_donors
)
fe_design <- model.matrix(
model_formula$fixedFormula,
data = metadata
)
re_design <- reformulas::mkReTrms(
model_formula$reTrmFormulas,
fr = metadata,
calc.lambdat = FALSE
)$Ztlist |> lapply(Matrix::t)
# Simulate effect of covariates on expression