An introduction to disize

disize revolves around its only export: disize::disize! It has the following required arguments:

design_formula: An R formula that specifies the experimental design. This is essentially describing the model that you would’ve liked to fit in the absence of batch-effects; it is the same R formula you would pass to tools like DESeq2 or edgeR for differential expression analysis (including predictors like condition, sex, etc measured in your study). All terms used in this formula should be present in metadata.
counts: A (observation x features) matrix containing the transcript counts. This can be dense or sparse; an internal coercion to a dense matrix will be done after subsetting relevant features and observations.
metadata: A dataframe containing the observation-level information(with observations as rows).
batch_name: The name of the column in metadata containing the batch identifier.

If the row indices of counts and metadata do not correspond to the same observation but are named, you can specify obs_name which will automatically re-order counts to match the indices in metadata.

Note: disize encourages users to set the n_threads argument in order to take advantage of parallelization.

An example with DESeq2

To see disize in action, we will be analyzing the following dataset from Li et al., 2025 with DESeq2.

This dataset consists of a purified macrophage subtype(“Mac2”, induced in an ‘activated’ state) that has been partitioned into four groups exposed to different conditions. The authors offer this information on how the samples were processed:

The sorted Mac2 cells were divided into four groups and stimulated at the 3-h time point with the same concentrations as previously described. Then, the RNA was extracted using the RNeasy Plus Micro Kit as per manufacturer instructions. Poly(A)mRNA was isolated using mRNA Capture Beads 2.0 (Yeasen Cat.12629ES, CHN) with two rounds of purification, followed by RNA fragmentation with magnesium ions at 94°C (Yeasen Cat.12340ES97, CHN). RNA sequencing library preparation was performed using the TruSeq RNA Library Prep Kit v2 (Illumina). Sequencing was carried out as paired-end 2×150 bp (PE150) on an Illumina Novaseq™ X Plus (LC-Bio Technologies).

The TruSeq RNA Library Prep Kit involves “tagging” transcripts with barcodes that identify distinct samples, allowing all prepared cDNA libraries to be pooled together before sequencing. Since batch-effects are usually attributed to separate sequencing runs, then we expect very small batch-effects to be present in this dataset if we define a “batch” as the unit subjected to RNA extraction (and all further processing).

Dependencies

suppressPackageStartupMessages({
    library(disize)
    library(curl)
    library(R.utils)
    library(data.table)
})

Downloading the data

# Download counts and construct metadata
counts_path <- curl::curl_download(
    url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE273924&format=file&file=GSE273924%5Fraw%5Fcounts%2Etsv%2Egz",
    destfile = paste0(tempdir(), "/counts.tsv.gz")
)
counts <- data.table::fread(counts_path)

metadata <- data.frame(
    "sample_id" = c(colnames(counts)[-1]),
    "condition" = factor(rep(c("control", "lps", "nelps", "ne"), each = 3))
)

# Coerce to formatted matrix
gene_names <- counts$gene_id
counts <- t(as.matrix(counts[,-1]))
colnames(counts) <- gene_names

Running `disize`

The metadata contains the information for the experimental design:

print(metadata)
#>    sample_id condition
#> 1      Ctrl1   control
#> 2      Ctrl2   control
#> 3      Ctrl3   control
#> 4       LPS1       lps
#> 5       LPS2       lps
#> 6       LPS3       lps
#> 7     NELPS1     nelps
#> 8     NELPS2     nelps
#> 9     NELPS3     nelps
#> 10       NE1        ne
#> 11       NE2        ne
#> 12       NE3        ne

For this dataset, the study was primarily interested in the effect of condition on expression, thus the formula we would input into disize is:

design_formula <- ~ condition

We can finally run disize to get the estimated size factors:

size_factors_1 <- disize::disize(
    design_formula,
    counts,
    metadata,
    batch_name = "sample_id",
    n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~86.8s)
#> Finised in 78s!
print(size_factors_1)
#>        Ctrl1        Ctrl2        Ctrl3         LPS1         LPS2         LPS3 
#>  0.094267893  0.015031281 -0.076917418  0.082696685  0.007454061  0.097203017 
#>          NE1          NE2          NE3       NELPS1       NELPS2       NELPS3 
#> -0.075320906  0.031300295  0.103935208 -0.166847677 -0.137895724 -0.022277557

Evidently the batch-effect is indeed small across most samples! The samples “NELPS1” and “NELPS2” seem to have been processed slightly worse, but otherwise the estimated size factors are approximately the same (within ~0.1).

We can confirm these estimates by rerunning disize with a larger n_feats:

size_factors_2 <- disize::disize(
    design_formula,
    counts,
    metadata,
    batch_name = "sample_id",
    n_feats = 15000,
    n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~152s)
#> Finised in 124.6s!
print(size_factors_2)
#>        Ctrl1        Ctrl2        Ctrl3         LPS1         LPS2         LPS3 
#>  0.105272847  0.009010748 -0.080962030  0.068038928 -0.015195995  0.084975827 
#>          NE1          NE2          NE3       NELPS1       NELPS2       NELPS3 
#> -0.047985535  0.054307253  0.126804892 -0.181847664 -0.152241943 -0.023239295

Indeed the estimates remain largely the same.

Running DESeq

Once the size factors are inserted into the DESeqDataSet object, the analysis then proceeds normally.

# Constructing DESeqDataSet object
dds <- DESeq2::DESeqDataSetFromMatrix(
    countData = t(counts),
    colData = metadata,
    design = design_formula
)
DESeq2::sizeFactors(dds) <- exp(size_factors_2) # Insert size factors

# Run analysis
dds <- DESeq2::DESeq(dds)
#> using pre-existing size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing

# Print results
print(DESeq2::results(dds))
#> log2 fold change (MLE): condition nelps vs control 
#> Wald test p-value: condition nelps vs control 
#> DataFrame with 57132 rows and 6 columns
#>                       baseMean log2FoldChange     lfcSE      stat      pvalue
#>                      <numeric>      <numeric> <numeric> <numeric>   <numeric>
#> ENSMUSG00000028180 1.94400e+03     -0.7941853 0.0959234 -8.279367 1.23832e-16
#> ENSMUSG00000028182 2.81850e+01     -0.8954827 0.5068895 -1.766623 7.72914e-02
#> ENSMUSG00000028185 3.38437e-01     -0.0480489 4.4073563 -0.010902 9.91302e-01
#> ENSMUSG00000028184 1.03806e+04      0.7576360 0.0651975 11.620633 3.23730e-31
#> ENSMUSG00000028187 9.80161e+02     -0.1870775 0.1140778 -1.639912 1.01024e-01
#> ...                        ...            ...       ...       ...         ...
#> ENSMUSG00000106108   10.694170      6.7461601 1.4354838  4.699572 2.60707e-06
#> ENSMUSG00000042675 1582.859728     -0.0435406 0.0795679 -0.547212 5.84233e-01
#> ENSMUSG00000036959 1430.242522      0.5400550 0.0839045  6.436545 1.22224e-10
#> ENSMUSG00000036958    0.660686     -3.9086314 4.2775888 -0.913746 3.60850e-01
#> ENSMUSG00000042678    5.594502     -5.6048609 1.7066720 -3.284088 1.02313e-03
#>                           padj
#>                      <numeric>
#> ENSMUSG00000028180 1.09839e-15
#> ENSMUSG00000028182 1.46890e-01
#> ENSMUSG00000028185          NA
#> ENSMUSG00000028184 5.07829e-30
#> ENSMUSG00000028187 1.84655e-01
#> ...                        ...
#> ENSMUSG00000106108 1.04212e-05
#> ENSMUSG00000042675 7.22125e-01
#> ENSMUSG00000036959 7.32355e-10
#> ENSMUSG00000036958          NA
#> ENSMUSG00000042678 2.91412e-03

An example with DESeq2

Dependencies

Downloading the data

Running disize

Running DESeq

Running `disize`