The definition of a “batch” is often not explicitly mentioned for
many studies, instead you have to go inspecting the methods section or
even contacting the authors for how samples were processed (were samples
prepared on the same day with the same reagents, where library pooling
was done, if multiple sequencing runs were used, etc). You’ll then have
to decide what steps are likely to induce a non-negligible batch effect;
anecdotally, separate sequencing runs are often the greatest concern
(though earlier steps can still contribute). Luckily,
disize
allows observations to be partitioned into batches
any way you like. To see this in action we’ll be using a dataset from
the pasilla
package.
Reading in data
# Grab filepaths
counts_path <- system.file(
"extdata",
"pasilla_gene_counts.tsv",
package="pasilla",
mustWork=TRUE
)
metadata_path <- system.file(
"extdata",
"pasilla_sample_annotation.csv",
package="pasilla",
mustWork=TRUE
)
# Read in data
counts <- as.matrix(read.csv(counts_path, sep = "\t", row.names = "gene_id")) |>
t()
metadata <- read.csv(metadata_path)
metadata <- metadata |>
dplyr::mutate(sample_id = sub("fb", "", file)) |>
dplyr::select(sample_id, condition, type) |>
dplyr::mutate(dplyr::across(c(condition, type), as.factor))
Inspecting the study design
The dataset is derived from a study investigating the effect of
various RNA binding proteins (RBPs) on alternative splicing regulation.
The authors partitioned D. melanogaster cells into two
different conditions: samples with condition = treated
were
treated with dsRNAs to knockdown expression via RNAi, while
condition = untreated
were left alone as a control.
Additionally, two sets of duplicates were processed for each
condition with different chemistry used for sequencing (single-read vs
paired-end).
Therefore, a suitable definition of a “batch” might be the grouping duplicates together:
metadata <- metadata |>
dplyr::mutate(id_1 = interaction(type, condition, sep = ":"))
However, the most granular definition would allow for batch-effects even within these duplicates:
metadata <- metadata |>
dplyr::mutate(id_2 = sample_id)
print(metadata)
#> sample_id condition type id_1 id_2
#> 1 treated1 treated single-read single-read:treated treated1
#> 2 treated2 treated paired-end paired-end:treated treated2
#> 3 treated3 treated paired-end paired-end:treated treated3
#> 4 untreated1 untreated single-read single-read:untreated untreated1
#> 5 untreated2 untreated single-read single-read:untreated untreated2
#> 6 untreated3 untreated paired-end paired-end:untreated untreated3
#> 7 untreated4 untreated paired-end paired-end:untreated untreated4
(Evidently one sample was thrown away, although this does not pose a problem for anything downstream.)
Estimating size factors
In the absence of batch-effects we are interested in the effects of
condition
, so our design_formula
should
be:
design_formula <- ~ condition
With all the required arguments we can now run disize
for both batch definitions:
size_factors_1 <- disize::disize(
design_formula,
counts,
metadata,
batch_name = "id_1",
obs_name = "sample_id", # needed to order 'counts' correctly
n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~58.1s)
#> Finised in 49.3s!
size_factors_2 <- disize::disize(
design_formula,
counts,
metadata,
batch_name = "id_2",
obs_name = "sample_id",
n_threads = parallel::detectCores()
)
#> Formatting data...
#> Estimating size factors... (Max ETA: ~59.6s)
#> Finised in 50.7s!
Comparing definitions
size_factors <- data.frame(
sample_id = metadata$sample_id,
id_1 = unname(size_factors_1[metadata$id_1]),
id_2 = unname(size_factors_2[metadata$id_2])
)
print(size_factors)
#> sample_id id_1 id_2
#> 1 treated1 0.4237065 0.45382540
#> 2 treated2 -0.4065548 -0.34422601
#> 3 treated3 -0.4065548 -0.25613031
#> 4 untreated1 0.2060039 0.01351995
#> 5 untreated2 0.2060039 0.49017730
#> 6 untreated3 -0.5487210 -0.49525578
#> 7 untreated4 -0.5487210 -0.37507949
Although there are some similarities (e.g., the estimates for
untreated3
and treated2
are approximately
equal), there are enough differences between samples that the more
granular definition is a better choice for this analysis.
Running DESeq
Once the size factors are inserted into the DESeqDataSet
object, the analysis then proceeds normally.
# Constructing DESeqDataSet object
dds <- DESeq2::DESeqDataSetFromMatrix(
countData = t(counts[metadata$sample_id, ]),
colData = metadata,
design = design_formula
)
DESeq2::sizeFactors(dds) <- exp(size_factors_2) # Insert size factors
# Run analysis
dds <- DESeq2::DESeq(dds)
#> using pre-existing size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
# Print results
print(DESeq2::results(dds))
#> log2 fold change (MLE): condition untreated vs treated
#> Wald test p-value: condition untreated vs treated
#> DataFrame with 14599 rows and 6 columns
#> baseMean log2FoldChange lfcSE stat pvalue padj
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> FBgn0000003 0.184560 -0.9959969 3.803883 -0.261837 0.7934472 NA
#> FBgn0000008 102.781102 0.0409561 0.220432 0.185799 0.8526022 0.965876
#> FBgn0000014 1.155164 0.5825494 2.103552 0.276936 0.7818292 NA
#> FBgn0000015 0.915909 1.9207946 2.072944 0.926602 0.3541331 NA
#> FBgn0000017 4707.654896 0.2802176 0.121269 2.310716 0.0208485 0.143939
#> ... ... ... ... ... ... ...
#> FBgn0261571 9.07419e-02 -0.8510821 3.809148 -0.223431 0.823200 NA
#> FBgn0261572 6.71910e+00 1.0044574 0.766641 1.310205 0.190126 0.550984
#> FBgn0261573 2.42379e+03 0.0291687 0.118174 0.246829 0.805041 0.954300
#> FBgn0261574 5.24839e+03 0.0348369 0.190160 0.183197 0.854643 0.966593
#> FBgn0261575 1.14954e+01 -0.0962191 0.917024 -0.104925 0.916435 0.981692