Comparing batch definitions • disize

The definition of a “batch” is often not explicitly mentioned for many studies, instead you have to go inspecting the methods section or even contacting the authors for how samples were processed (were samples prepared on the same day with the same reagents, where library pooling was done, if multiple sequencing runs were used, etc). You’ll then have to decide what steps are likely to induce a non-negligible batch effect; anecdotally, separate sequencing runs are often the greatest concern (though earlier steps can still contribute). Luckily, disize allows observations to be partitioned into batches any way you like. To see this in action we’ll be using a dataset from the pasilla package.

suppressPackageStartupMessages({
    library(pasilla)
    library(dplyr)
    library(disize)
    library(DESeq2)
})

Reading in data

# Grab filepaths
counts_path <- system.file(
    "extdata",
    "pasilla_gene_counts.tsv",
    package="pasilla",
    mustWork=TRUE
)
metadata_path <- system.file(
    "extdata",
    "pasilla_sample_annotation.csv",
    package="pasilla",
    mustWork=TRUE
)

# Read in data
counts <- as.matrix(read.csv(counts_path, sep = "\t", row.names = "gene_id")) |>
    t()

metadata <- read.csv(metadata_path)
metadata <- metadata |>
    dplyr::mutate(sample_id = sub("fb", "", file)) |>
    dplyr::select(sample_id, condition, type) |>
    dplyr::mutate(dplyr::across(c(condition, type), as.factor))

Inspecting the study design

The dataset is derived from a study investigating the effect of various RNA binding proteins (RBPs) on alternative splicing regulation. The authors partitioned D. melanogaster cells into two different conditions: samples with condition = treated were treated with dsRNAs to knockdown expression via RNAi, while condition = untreated were left alone as a control. Additionally, two sets of duplicates were processed for each condition with different chemistry used for sequencing (single-read vs paired-end).

Therefore, a suitable definition of a “batch” might be the grouping duplicates together:

metadata <- metadata |>
    dplyr::mutate(id_1 = interaction(type, condition, sep = ":"))

However, the most granular definition would allow for batch-effects even within these duplicates:

metadata <- metadata |>
    dplyr::mutate(id_2 = sample_id)

print(metadata)
#>    sample_id condition        type                  id_1       id_2
#> 1   treated1   treated single-read   single-read:treated   treated1
#> 2   treated2   treated  paired-end    paired-end:treated   treated2
#> 3   treated3   treated  paired-end    paired-end:treated   treated3
#> 4 untreated1 untreated single-read single-read:untreated untreated1
#> 5 untreated2 untreated single-read single-read:untreated untreated2
#> 6 untreated3 untreated  paired-end  paired-end:untreated untreated3
#> 7 untreated4 untreated  paired-end  paired-end:untreated untreated4

(Evidently one sample was thrown away, although this does not pose a problem for anything downstream.)

Estimating size factors

In the absence of batch-effects we are interested in the effects of condition, so our design_formula should be:

design_formula <- ~ condition

With all the required arguments we can now run disize for both batch definitions:

size_factors_1 <- disize::disize(
    design_formula,
    counts,
    metadata,
    batch_name = "id_1",
    obs_name = "sample_id", # needed to order 'counts' correctly
    n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~58.1s)
#> Finised in 49.3s!

size_factors_2 <- disize::disize(
    design_formula,
    counts,
    metadata,
    batch_name = "id_2",
    obs_name = "sample_id",
    n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~59.6s)
#> Finised in 50.7s!

Comparing definitions

size_factors <- data.frame(
    sample_id = metadata$sample_id,
    id_1 = unname(size_factors_1[metadata$id_1]),
    id_2 = unname(size_factors_2[metadata$id_2])
)

print(size_factors)
#>    sample_id       id_1        id_2
#> 1   treated1  0.4237065  0.45382540
#> 2   treated2 -0.4065548 -0.34422601
#> 3   treated3 -0.4065548 -0.25613031
#> 4 untreated1  0.2060039  0.01351995
#> 5 untreated2  0.2060039  0.49017730
#> 6 untreated3 -0.5487210 -0.49525578
#> 7 untreated4 -0.5487210 -0.37507949

Although there are some similarities (e.g., the estimates for untreated3 and treated2 are approximately equal), there are enough differences between samples that the more granular definition is a better choice for this analysis.

Running DESeq

Once the size factors are inserted into the DESeqDataSet object, the analysis then proceeds normally.

# Constructing DESeqDataSet object
dds <- DESeq2::DESeqDataSetFromMatrix(
    countData = t(counts[metadata$sample_id, ]),
    colData = metadata,
    design = design_formula
)
DESeq2::sizeFactors(dds) <- exp(size_factors_2) # Insert size factors

# Run analysis
dds <- DESeq2::DESeq(dds)
#> using pre-existing size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing

# Print results
print(DESeq2::results(dds))
#> log2 fold change (MLE): condition untreated vs treated 
#> Wald test p-value: condition untreated vs treated 
#> DataFrame with 14599 rows and 6 columns
#>                baseMean log2FoldChange     lfcSE      stat    pvalue      padj
#>               <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
#> FBgn0000003    0.184560     -0.9959969  3.803883 -0.261837 0.7934472        NA
#> FBgn0000008  102.781102      0.0409561  0.220432  0.185799 0.8526022  0.965876
#> FBgn0000014    1.155164      0.5825494  2.103552  0.276936 0.7818292        NA
#> FBgn0000015    0.915909      1.9207946  2.072944  0.926602 0.3541331        NA
#> FBgn0000017 4707.654896      0.2802176  0.121269  2.310716 0.0208485  0.143939
#> ...                 ...            ...       ...       ...       ...       ...
#> FBgn0261571 9.07419e-02     -0.8510821  3.809148 -0.223431  0.823200        NA
#> FBgn0261572 6.71910e+00      1.0044574  0.766641  1.310205  0.190126  0.550984
#> FBgn0261573 2.42379e+03      0.0291687  0.118174  0.246829  0.805041  0.954300
#> FBgn0261574 5.24839e+03      0.0348369  0.190160  0.183197  0.854643  0.966593
#> FBgn0261575 1.14954e+01     -0.0962191  0.917024 -0.104925  0.916435  0.981692