SpaNorm: Spatially aware library size normalisation
Dharmesh D. Bhuva and Agus Salim
19 October 2024
Source:vignettes/SpaNorm.Rmd
SpaNorm.Rmd
Abstract
This package implements the spatially aware library size normalisation algorithm, SpaNorm. SpaNorm normalises out library size effects while retaining biology through the modelling of smooth functions for each effect. Normalisation is performed in a gene- and cell-/spot- specific manner, yielding library size adjusted data.
SpaNorm
SpaNorm is a spatially aware library size normalisation method that removes library size effects, while retaining biology. Library sizes need to be removed from molecular datasets to allow comparisons across observations, in this case, across space. Bhuva et al. (Bhuva et al. 2024) and Atta et al. (Atta et al. 2024) have shown that standard single-cell inspired library size normalisation approaches are not appropriate for spatial molecular datasets as they often remove biological signals while doing so. This is because library size confounds biology in spatial molecular data.
SpaNorm uses a unique approach to spatially constraint modelling approach to model gene expression (e.g., counts) and remove library size effects, while retaining biology. It achieves this through three key innovations:
- Optmial decomposition of spatial variation into spatially smooth library size associated (technical) and library size independent (biology) variation using generalized linear models (GLMs).
- Computing spatially smooth functions (using thin plate splines) to represent the gene- and location-/cell-/spot- specific size factors.
- Adjustment of data using percentile adjusted counts (PAC) (Salim et al. 2022), as well as other adjustment approaches (e.g., Pearson).
The SpaNorm package can be installed as follows:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# release version
BiocManager::install("SpaNorm")
# development version from GitHub
BiocManager::install("bhuvad/SpaNorm")
Load count data
We begin by loading some example 10x Visium data profiling the
dorsolateral prefrontal cortex (DLPFC) of the human brain. The data has
~4,000 spots and covers genome-wide measurements. The example data here
is filtered to remove lowly expressed genes (using
filterGenes(HumanDLPFC, prop = 0.1)
). This filtering
retains genes that are expressed in at least 10% of cells.
library(SpaNorm)
library(SpatialExperiment)
library(ggplot2)
# load sample data
data(HumanDLPFC)
# change gene IDs to gene names
rownames(HumanDLPFC) = rowData(HumanDLPFC)$gene_name
HumanDLPFC
#> class: SpatialExperiment
#> dim: 5076 4015
#> metadata(0):
#> assays(1): counts
#> rownames(5076): NOC2L HES4 ... MT-CYB AC007325.4
#> rowData names(2): gene_name gene_biotype
#> colnames(4015): AAACAAGTATCTCCCA-1 AAACACCAATAACTGC-1 ...
#> TTGTTTCCATACAACT-1 TTGTTTGTGTAAATTC-1
#> colData names(3): cell_count sample_id AnnotatedCluster
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
#> imgData names(4): sample_id image_id data scaleFactor
# plot regions
p_region = plotSpatial(HumanDLPFC, colour = AnnotatedCluster, size = 0.5) +
scale_colour_brewer(palette = "Paired", guide = guide_legend(override.aes = list(shape = 15, size = 5))) +
ggtitle("Region")
p_region
The filterGenes
function returns a logical vector
indicating which genes should be kept.
# filter genes expressed in 20% of spots
keep = filterGenes(HumanDLPFC, 0.2)
table(keep)
#> keep
#> FALSE TRUE
#> 2568 2508
# subset genes
HumanDLPFC = HumanDLPFC[keep, ]
The log-transformed raw counts are visualised below for the gene MOBP which is a marker of oligodendrocytes enriched in the white matter (WM) (Maynard et al. 2021). Despite being a marker of this region, we see that it is in fact absent from the white matter region.
logcounts(HumanDLPFC) = log2(counts(HumanDLPFC) + 1)
p_counts = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("logCounts")
p_region + p_counts
Normalise count data
SpaNorm normalises data in two steps: (1) fitting the SpaNorm model
of library sizes; (2) adjusting data using the fit model. A single call
to the SpaNorm()
function is enough to run these two steps.
To speed up computation, the model is fit using a smaller proportion of
spots/cells (default is 0.25). The can be modified using the
sample.p
parameter.
set.seed(36)
HumanDLPFC = SpaNorm(HumanDLPFC)
#> (1/2) Fitting SpaNorm model
#> 1004 cells/spots sampled to fit model
#> iter: 1, estimating gene-wise dispersion
#> iter: 1, log-likelihood: -3180454.444194
#> iter: 1, fitting NB model
#> iter: 1, iter: 1, log-likelihood: -3180454.444194
#> iter: 1, iter: 2, log-likelihood: -2560805.644205
#> iter: 1, iter: 3, log-likelihood: -2441673.154255
#> iter: 1, iter: 4, log-likelihood: -2404139.459938
#> iter: 1, iter: 5, log-likelihood: -2388905.294491
#> iter: 1, iter: 6, log-likelihood: -2383787.784655
#> iter: 1, iter: 7, log-likelihood: -2382083.583994
#> iter: 1, iter: 8, log-likelihood: -2381381.305476
#> iter: 1, iter: 9, log-likelihood: -2381058.488972
#> iter: 1, iter: 10, log-likelihood: -2380902.931337 (converged)
#> iter: 2, estimating gene-wise dispersion
#> iter: 2, log-likelihood: -2376651.402516
#> iter: 2, fitting NB model
#> iter: 2, iter: 1, log-likelihood: -2376651.402516
#> iter: 2, iter: 2, log-likelihood: -2375068.142673
#> iter: 2, iter: 2, log-likelihood: -2375068.142673
#> iter: 2, iter: 2, log-likelihood: -2375068.142673
#> iter: 2, iter: 3, log-likelihood: -2375068.142673 (converged)
#> iter: 3, estimating gene-wise dispersion
#> iter: 3, log-likelihood: -2375057.821580
#> iter: 3, fitting NB model
#> iter: 3, iter: 1, log-likelihood: -2375057.821580
#> iter: 3, iter: 1, log-likelihood: -2375057.821580
#> iter: 3, iter: 1, log-likelihood: -2375057.821580
#> iter: 3, iter: 2, log-likelihood: -2375057.821580
#> iter: 3, iter: 2, log-likelihood: -2375057.821580
#> iter: 3, iter: 2, log-likelihood: -2375057.821580
#> iter: 3, iter: 3, log-likelihood: -2375057.821580 (converged)
#> iter: 4, log-likelihood: -2375057.821580 (converged)
#> (2/2) Normalising data
HumanDLPFC
#> class: SpatialExperiment
#> dim: 2508 4015
#> metadata(1): SpaNorm
#> assays(2): counts logcounts
#> rownames(2508): ISG15 SDF4 ... MT-ND6 MT-CYB
#> rowData names(2): gene_name gene_biotype
#> colnames(4015): AAACAAGTATCTCCCA-1 AAACACCAATAACTGC-1 ...
#> TTGTTTCCATACAACT-1 TTGTTTGTGTAAATTC-1
#> colData names(3): cell_count sample_id AnnotatedCluster
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
#> imgData names(4): sample_id image_id data scaleFactor
The above output (which can be switched off by setting
verbose = FALSE
), shows the two steps of normalisation. In
the model fitting step, 1004 cells/spots are used to fit the negative
binomial (NB) model. Subsequent output shows that this fit is performed
by alternating between estimation of the dispersion parameter and
estimation of the NB parameters by fixing the dispersion. The output
also shows that each intermediate fit converges, and so does the final
fit. The accuracy of the fit can be controlled by modifying the
tolerance parameter tol
(default 1e-4
).
Next, data is adjusted using the fit model. The following approaches are implemented for count data:
-
adj.method = "logpac"
(default) - percentile adjusted counts (PAC) which estimates the count for each gene at each location/spot/cell using a model that does not contain unwanted effects such as the library size. -
adj.method = "person"
- Pearson residuals from factoring out unwanted effects. -
adj.method = "meanbio"
- the mean of each gene at each location estimated from the biological component of the model. -
adj.method = "medbio"
- the median of each gene at each location estimated from the biological component of the model.
These data are stored in the logcounts
assay of the
SpatialExperiment object. After normalisation, we see that MOBP is
enriched in the white matter.
p_logpac = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("logPAC")
p_region + p_logpac
Computing alternative adjustments using a precomputed SpaNorm fit
As no appropriate slot exists for storing model parameters, we currently save them in the metadata slot with the name “SpaNorm”. This also means that subsetting features (i.e., genes) or observations (i.e., cells/spots/loci) does not subset the model. In such an instance, the SpaNorm function will realise that the model no longer matches the data and re-estimates when called. If instead the model is valid for the data, the existing fit is extracted and reused.
The fit can be manually retrieved as below for users wishing to reuse
the model outside the SpaNorm framework. Otherwise, calling
SpaNorm()
on an object containing the fit will
automatically use it.
# manually retrieve model
fit.spanorm = metadata(HumanDLPFC)$SpaNorm
fit.spanorm
#> SpaNormFit
#> Data: 2508 genes, 4015 cells/spots
#> Gene model: nb
#> Degrees of freedom (TPS): 6
#> Spots/cells sampled: 25%
#> Regularisation parameter: 1e-04
#> Batch: NULL
#> log-likelihood (per-iteration): num [1:3] -2380903 -2375068 -2375058
#> W: num [1:4015, 1:73] 0.2645 0.4736 0.0547 -0.1756 0.6039 ...
#> W: - attr(*, "dimnames")=List of 2
#> W: ..$ : chr [1:4015] "1" "2" "3" "4" ...
#> W: ..$ : chr [1:73] "logLS" "bs.xy1" "bs.xy2" "bs.xy3" ...
#> alpha: num [1:2508, 1:73] 1.02 1.02 1.02 1.02 1.02 ...
#> gmean: num [1:2508] -1.17 -1.181 -1.227 -1.5 -0.359 ...
#> psi: num [1:2508] 9.77e-05 9.77e-05 9.77e-05 9.77e-05 1.30e-02 ...
#> wtype: Factor w/ 3 levels "batch","biology",..: 3 2 2 2 2 2 2 2 2 2 ...
When a valid fit exists in the object, only the adjustment step is
performed. The model is recomputed if overwrite = TRUE
or
any of the following parameters change: degrees of freedom
(df.tps
), penalty parameters(lambda.a
), object
dimensions, or batch
specification. Alternative adjustments
can be computed as below and stored to the logcounts
assay.
# Pearson residuals
HumanDLPFC = SpaNorm(HumanDLPFC, adj.method = "pearson")
p_pearson = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("Pearson")
# meanbio residuals
HumanDLPFC = SpaNorm(HumanDLPFC, adj.method = "meanbio")
p_meanbio = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("Mean biology")
# meanbio residuals
HumanDLPFC = SpaNorm(HumanDLPFC, adj.method = "medbio")
p_medbio = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("Median biology")
p_region + p_counts + p_logpac + p_pearson + p_meanbio + p_medbio + plot_layout(ncol = 3)
The mean biology adjustment shows a significant enrichment of the MOBP gene in the white matter. As the overall counts of this gene are low in this sample, other methods show less discriminative power.
Varying model complexity
The complexity of the spatial smoothing function is determined by the
df.tps
parameter where larger values result in more
complicated functions (default 6).
# df.tps = 2
HumanDLPFC_df2 = SpaNorm(HumanDLPFC, df.tps = 2)
p_logpac_2 = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("logPAC (df.tps = 2)")
# df.tps = 6 (default)
p_logpac_6 = p_logpac +
ggtitle("logPAC (df.tps = 6)")
p_logpac_2 + p_logpac_6
Enhancing signal
As the counts for the MOBP gene are very low, we see artifacts in the adjusted counts. As we have a model for the genes, we can increase the signal by adjusting all means by a constant factor. Applying a scale factor of 4 shows how the adjusted data are more continuous, with significant enrichment in the white matter.
# scale.factor = 1 (default)
HumanDLPFC = SpaNorm(HumanDLPFC, scale.factor = 1)
p_logpac_sf1 = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("logPAC (scale.factor = 1)")
# scale.factor = 4
HumanDLPFC = SpaNorm(HumanDLPFC, scale.factor = 4)
p_logpac_sf4 = plotSpatial(
HumanDLPFC,
colour = MOBP,
what = "expression",
assay = "logcounts",
size = 0.5
) +
scale_colour_viridis_c(option = "F") +
ggtitle("logPAC (scale.factor = 4)")
p_logpac_sf1 + p_logpac_sf4 + plot_layout(ncol = 2)
Session information
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SpatialExperiment_1.14.0 SingleCellExperiment_1.26.0
#> [3] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [5] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
#> [7] IRanges_2.38.1 S4Vectors_0.42.1
#> [9] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
#> [11] matrixStats_1.4.1 patchwork_1.3.0
#> [13] ggplot2_3.5.1 SpaNorm_0.99.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.2
#> [3] dplyr_1.1.4 farver_2.1.2
#> [5] fastmap_1.2.0 bluster_1.14.0
#> [7] digest_0.6.37 rsvd_1.0.5
#> [9] lifecycle_1.0.4 cluster_2.1.6
#> [11] statmod_1.5.0 magrittr_2.0.3
#> [13] compiler_4.4.1 rlang_1.1.4
#> [15] sass_0.4.9 tools_4.4.1
#> [17] igraph_2.0.3 utf8_1.2.4
#> [19] yaml_2.3.10 knitr_1.48
#> [21] dqrng_0.4.1 S4Arrays_1.4.1
#> [23] labeling_0.4.3 htmlwidgets_1.6.4
#> [25] DelayedArray_0.30.1 RColorBrewer_1.1-3
#> [27] abind_1.4-8 BiocParallel_1.38.0
#> [29] withr_3.0.1 desc_1.4.3
#> [31] grid_4.4.1 fansi_1.0.6
#> [33] beachmat_2.20.0 colorspace_2.1-1
#> [35] edgeR_4.2.2 scales_1.3.0
#> [37] cli_3.6.3 rmarkdown_2.28
#> [39] crayon_1.5.3 ragg_1.3.3
#> [41] generics_0.1.3 metapod_1.12.0
#> [43] httr_1.4.7 rjson_0.2.23
#> [45] DelayedMatrixStats_1.26.0 scuttle_1.14.0
#> [47] cachem_1.1.0 zlibbioc_1.50.0
#> [49] splines_4.4.1 parallel_4.4.1
#> [51] BiocManager_1.30.25 XVector_0.44.0
#> [53] vctrs_0.6.5 Matrix_1.7-0
#> [55] jsonlite_1.8.9 BiocSingular_1.20.0
#> [57] BiocNeighbors_1.22.0 irlba_2.3.5.1
#> [59] systemfonts_1.1.0 magick_2.8.5
#> [61] locfit_1.5-9.10 limma_3.60.6
#> [63] jquerylib_0.1.4 glue_1.8.0
#> [65] pkgdown_2.1.1 codetools_0.2-20
#> [67] gtable_0.3.5 UCSC.utils_1.0.0
#> [69] ScaledMatrix_1.12.0 munsell_0.5.1
#> [71] tibble_3.2.1 pillar_1.9.0
#> [73] htmltools_0.5.8.1 GenomeInfoDbData_1.2.12
#> [75] R6_2.5.1 textshaping_0.4.0
#> [77] sparseMatrixStats_1.16.0 evaluate_1.0.1
#> [79] lattice_0.22-6 highr_0.11
#> [81] BiocStyle_2.32.1 scran_1.32.0
#> [83] bslib_0.8.0 Rcpp_1.0.13
#> [85] SparseArray_1.4.8 xfun_0.48
#> [87] fs_1.6.4 prettydoc_0.4.1
#> [89] pkgconfig_2.0.3