Skip to contents

A Review of MoR and TMM Normalization

Set Up

Let y_{i,g} denote the observed count in a sample i \in \mathcal{I} (with n_i = |\mathcal{I}|) for a gene g \in \mathcal{G} (with n_g = |\mathcal{G}|) realized from a particular negative binomial distribution:

\begin{aligned} Y_{i,g} &\sim \text{NegBinom}(\mu_{i,g}, \phi_{g}) \\ \mu_{i,g} &= \eta_{i,g} \cdot \rho_{i} \end{aligned}

Where \eta_{i,g} denotes the true magnitude of expression and \rho_{i} denotes the “size factor” that scales this expression from its ground-truth.

DESeq2’s Median of Ratios

We first focus on a particular gene g and compare the observed count for a sample i to the geometric average of counts across samples to get a ratio R_{i,g}:

R_{i,g} = \frac{ y_{i,g} }{( \prod_{i \in \mathcal{I}} y_{i,g} )^{ \frac{1}{n_i} } }

We then take the median of these ratios to get our size factor estimates!

\hat{s}_{i} = \underset{g \in \mathcal{G}}{\text{median }} R_{i,g}

edgeR’s Trimmed Mean of M Values

We first “normalize” the observed count profile for a sample i by the total number of counts N_i = \sum_{g \in \mathcal{G}} y_{i,g} in order to get proportions:

y'_{i,g} = \frac{y_{i,g}}{N_i}

We next select a reference sample i^{\dagger} \in \mathcal{I} and compute both the ratio of log-transformed proportions and their weights:

\begin{aligned} R_{i,g} &= \frac{\log_2 y'_{i,g}}{\log_2 y'_{i^{\dagger}, g}} \\ w_{i,g} &= \frac{ N_i - Y_{i,g} }{ N_i Y_{i,g} } + \frac{ N_{i^{\dagger}} - Y_{i^{\dagger},g} }{ N_{i^{\dagger}} Y_{i^{\dagger},g} } \end{aligned}

We then filter the genes to a subset \mathcal{G}'_i \subset \mathcal{G} by symmetrically “trimming” away the smallest and largest ratios for a sample i to XX% of the original number (defaults to 70%).

We finally compute the size factor by taking the weighted average of these ratios and raising it to the second power:

\log_2 \hat{s}_i = \frac{ \sum_{g \in \mathcal{G}'_i} w_{i,g} R_{i,g} }{ \sum_{g \in \mathcal{G}'_i} w_{i,g} }

Simulation Framework

Comparing Two Conditions

# Simulating settings
n_donors <- 6L
n_genes <- 1000L

model_formula <- reformulas::splitForm(~ 0 + cond_id + (1 | donor_id))

# Construct model matrices
metadata <- data.frame(
    donor_id = 1:n_donors,
    cond_id = cut(1:n_donors, 2),
    batch_id = 1:n_donors
)

fe_design <- model.matrix(
    model_formula$fixedFormula,
    data = metadata
)
re_design <- reformulas::mkReTrms(
    model_formula$reTrmFormulas,
    fr = metadata,
    calc.lambdat = FALSE
)$Ztlist |> lapply(Matrix::t)

# Simulate effect of covariates on expression