disize
revolves around its only export:
disize::disize
! It has the following required
arguments:
design_formula
: An R formula that specifies the experimental design. This is essentially describing the model that you would’ve liked to fit in the absence of batch-effects; it is the same R formula you would pass to tools likeDESeq2
oredgeR
for differential expression analysis (including predictors likecondition
,sex
, etc measured in your study). All terms used in this formula should be present inmetadata
.counts
: A (observation x features) matrix containing the transcript counts. This can be dense or sparse; an internal coercion to a dense matrix will be done after subsetting relevant features and observations.metadata
: A dataframe containing the observation-level information(with observations as rows).batch_name
: The name of the column inmetadata
containing the batch identifier.
If the row indices of
counts
andmetadata
do not correspond to the same observation but are named, you can specifyobs_name
which will automatically re-ordercounts
to match the indices inmetadata
.
Note:
disize
encourages users to set then_threads
argument in order to take advantage of parallelization.
An example with DESeq2
To see disize
in action, we will be analyzing the
following dataset from Li et al., 2025 with DESeq2
.
This dataset consists of a purified macrophage subtype(“Mac2”, induced in an ‘activated’ state) that has been partitioned into four groups exposed to different conditions. The authors offer this information on how the samples were processed:
The sorted Mac2 cells were divided into four groups and stimulated at the 3-h time point with the same concentrations as previously described. Then, the RNA was extracted using the RNeasy Plus Micro Kit as per manufacturer instructions. Poly(A)mRNA was isolated using mRNA Capture Beads 2.0 (Yeasen Cat.12629ES, CHN) with two rounds of purification, followed by RNA fragmentation with magnesium ions at 94°C (Yeasen Cat.12340ES97, CHN). RNA sequencing library preparation was performed using the TruSeq RNA Library Prep Kit v2 (Illumina). Sequencing was carried out as paired-end 2×150 bp (PE150) on an Illumina Novaseq™ X Plus (LC-Bio Technologies).
The TruSeq RNA Library Prep Kit involves “tagging” transcripts with barcodes that identify distinct samples, allowing all prepared cDNA libraries to be pooled together before sequencing. Since batch-effects are usually attributed to separate sequencing runs, then we expect very small batch-effects to be present in this dataset if we define a “batch” as the unit subjected to RNA extraction (and all further processing).
Downloading the data
# Download counts and construct metadata
counts_path <- curl::curl_download(
url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE273924&format=file&file=GSE273924%5Fraw%5Fcounts%2Etsv%2Egz",
destfile = paste0(tempdir(), "/counts.tsv.gz")
)
counts <- data.table::fread(counts_path)
metadata <- data.frame(
"sample_id" = c(colnames(counts)[-1]),
"condition" = factor(rep(c("control", "lps", "nelps", "ne"), each = 3))
)
# Coerce to formatted matrix
gene_names <- counts$gene_id
counts <- t(as.matrix(counts[,-1]))
colnames(counts) <- gene_names
Running disize
The metadata
contains the information for the
experimental design:
print(metadata)
#> sample_id condition
#> 1 Ctrl1 control
#> 2 Ctrl2 control
#> 3 Ctrl3 control
#> 4 LPS1 lps
#> 5 LPS2 lps
#> 6 LPS3 lps
#> 7 NELPS1 nelps
#> 8 NELPS2 nelps
#> 9 NELPS3 nelps
#> 10 NE1 ne
#> 11 NE2 ne
#> 12 NE3 ne
For this dataset, the study was primarily interested in the effect of
condition
on expression, thus the formula we would input
into disize
is:
design_formula <- ~ condition
We can finally run disize
to get the estimated size
factors:
size_factors_1 <- disize::disize(
design_formula,
counts,
metadata,
batch_name = "sample_id",
n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~86.8s)
#> Finised in 78s!
print(size_factors_1)
#> Ctrl1 Ctrl2 Ctrl3 LPS1 LPS2 LPS3
#> 0.094267893 0.015031281 -0.076917418 0.082696685 0.007454061 0.097203017
#> NE1 NE2 NE3 NELPS1 NELPS2 NELPS3
#> -0.075320906 0.031300295 0.103935208 -0.166847677 -0.137895724 -0.022277557
Evidently the batch-effect is indeed small across most samples! The samples “NELPS1” and “NELPS2” seem to have been processed slightly worse, but otherwise the estimated size factors are approximately the same (within ~0.1).
We can confirm these estimates by rerunning disize
with
a larger n_feats
:
size_factors_2 <- disize::disize(
design_formula,
counts,
metadata,
batch_name = "sample_id",
n_feats = 15000,
n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~152s)
#> Finised in 124.6s!
print(size_factors_2)
#> Ctrl1 Ctrl2 Ctrl3 LPS1 LPS2 LPS3
#> 0.105272847 0.009010748 -0.080962030 0.068038928 -0.015195995 0.084975827
#> NE1 NE2 NE3 NELPS1 NELPS2 NELPS3
#> -0.047985535 0.054307253 0.126804892 -0.181847664 -0.152241943 -0.023239295
Indeed the estimates remain largely the same.
Running DESeq
Once the size factors are inserted into the DESeqDataSet
object, the analysis then proceeds normally.
# Constructing DESeqDataSet object
dds <- DESeq2::DESeqDataSetFromMatrix(
countData = t(counts),
colData = metadata,
design = design_formula
)
DESeq2::sizeFactors(dds) <- exp(size_factors_2) # Insert size factors
# Run analysis
dds <- DESeq2::DESeq(dds)
#> using pre-existing size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
# Print results
print(DESeq2::results(dds))
#> log2 fold change (MLE): condition nelps vs control
#> Wald test p-value: condition nelps vs control
#> DataFrame with 57132 rows and 6 columns
#> baseMean log2FoldChange lfcSE stat pvalue
#> <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSMUSG00000028180 1.94400e+03 -0.7941853 0.0959234 -8.279367 1.23832e-16
#> ENSMUSG00000028182 2.81850e+01 -0.8954827 0.5068895 -1.766623 7.72914e-02
#> ENSMUSG00000028185 3.38437e-01 -0.0480489 4.4073563 -0.010902 9.91302e-01
#> ENSMUSG00000028184 1.03806e+04 0.7576360 0.0651975 11.620633 3.23730e-31
#> ENSMUSG00000028187 9.80161e+02 -0.1870775 0.1140778 -1.639912 1.01024e-01
#> ... ... ... ... ... ...
#> ENSMUSG00000106108 10.694170 6.7461601 1.4354838 4.699572 2.60707e-06
#> ENSMUSG00000042675 1582.859728 -0.0435406 0.0795679 -0.547212 5.84233e-01
#> ENSMUSG00000036959 1430.242522 0.5400550 0.0839045 6.436545 1.22224e-10
#> ENSMUSG00000036958 0.660686 -3.9086314 4.2775888 -0.913746 3.60850e-01
#> ENSMUSG00000042678 5.594502 -5.6048609 1.7066720 -3.284088 1.02313e-03
#> padj
#> <numeric>
#> ENSMUSG00000028180 1.09839e-15
#> ENSMUSG00000028182 1.46890e-01
#> ENSMUSG00000028185 NA
#> ENSMUSG00000028184 5.07829e-30
#> ENSMUSG00000028187 1.84655e-01
#> ... ...
#> ENSMUSG00000106108 1.04212e-05
#> ENSMUSG00000042675 7.22125e-01
#> ENSMUSG00000036959 7.32355e-10
#> ENSMUSG00000036958 NA
#> ENSMUSG00000042678 2.91412e-03