Part I: Introduction to imaging-based spatial transcriptomics analysis (Bruker CosMx NSCLC)

Dharmesh D. Bhuva

Frazer Institute, The University of QueenslandSAiGENCI, The University of Adelaide
dharmesh.bhuva@adelaide.edu.au

Chin Wee Tan

Walter and Eliza Hall InstituteThe University of Melbourne
cwtan@wehi.edu.au

Nov 2025

Source: vignettes/workflow_spatial_cosmx.Rmd

workflow_spatial_cosmx.Rmd

R version: R version 4.5.2 (2025-10-31)
Bioconductor version: 3.22
Package version: 1.0.0

Introduction

Spatial molecular technologies have revolutionised the study of biological systems. We can now study cellular and sub-cellular structures in complex systems. However, these technologies generate complex datasets that require new computational approaches and ideas to analyse. Many of these pipelines are still being developed therefore what is state-of-the-art now will not remain so for too long. However, the biological aim of analysis will mostly remain the same with more being uncovered as computational methods advance. This workshop intends to demonstrate the types of biological questions that can be studied using spatial transcriptomics datasets. Our intention is NOT to use the state-of-the-art methods, but to show you how data can be analysed to study spatial hypotheses. As users get familiar with these ideologies, they can swap out older tools with newer, more powerful tools. As instructors, we hope to update this workshop as more and better tools become available. We look forward to receiving community contributions as well.

Pre-processing steps already performed

Data from the instrument comes in raw images that are often summerised to transcript tables. These tables are then coupled with immunofluorescence images for a few key proteins or marker such as DAPI which stains the nucleus, Pan-Cytokeratin which stains the membrane of cancer cells, a cocktail of membrane proteins, markers such as CD45 which stain immune cells, and a cocktail of cytoplasmic markers (technology specific). These proteins are used to perform cell segmentation as the measurements are obtained at the transcript level and not the cell level (as is the case with single-cell RNA-seq data). Platforms provide segmented data and users can opt to perform their own segmentation if needed. If you wish you re-segment the data yourself, you could try using methods such as BIDCell (Fu et al. 2024) that use a single cell reference dataset to perform segmentation using the marker proteins as well as transcriptomics measurements. Do note that vendors are constantly improving segmentation and recent improvements are accurate on difficult to segment cells such as neurons and astrocytes.

We will use the already segmented cell-level count data. We will also use a previously prepared SpatialExperiment object. If users want to prepare their own object, all they need is the cell matrix file, the feature table, the cell table and a table of spatial coordinates (if this information is not already in the cell table). With these, information on preparing your own SpatialExperiment object can be found in this package’s vignette. The design for this object is shown in the figure below.

SpatialExperiment object schematic

Loading prepared data

The data we will use in this workshop is from the Bruker/Nanostring CosMx SMI platform. Since the objective of this workshop is to introduce you to spatial transcriptomics analysis, we do not work with the full dataset, which contains over 700,000 cells from 8 samples and instead opt to work with a smaller subset of one sample. This region contains interesting features and is therefore enough to showcase a complete spatial transcriptomics analysis workflow. The data used in this workshop comes from 4 fields of view (FOVs) from the sample labeled Lung5_Rep3. This window is shown in the context of the full tissue below.

The CosMx non-small cell lung cancer (NSCLC) dataset.

library(SpatialExperiment)

data(lung5) # replicate 1
# preview object
lung5

## class: SpatialExperiment 
## dim: 980 13463 
## metadata(0):
## assays(1): counts
## rownames(980): VTN TGFBR2 ... ADGRG2 NegPrb8
## rowData names(2): gene genetype
## colnames(13463): Lung5_Rep3_16_1 Lung5_Rep3_16_10 ... Lung5_Rep3_22_998
##   Lung5_Rep3_22_999
## colData names(7): sample_id cell_id ... Level2 fov
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : x y
## imgData names(1): sample_id

# retrieve column (cell) annotations
colData(lung5)

## DataFrame with 13463 rows and 7 columns
##                      sample_id     cell_id              region  technology
##                    <character> <character>         <character> <character>
## Lung5_Rep3_16_1     Lung5_Rep3        16_1              Tumour       CosMx
## Lung5_Rep3_16_10    Lung5_Rep3       16_10              Tumour       CosMx
## Lung5_Rep3_16_100   Lung5_Rep3      16_100 Abnormal epithelium       CosMx
## Lung5_Rep3_16_1000  Lung5_Rep3     16_1000 Abnormal epithelium       CosMx
## Lung5_Rep3_16_1001  Lung5_Rep3     16_1001 Abnormal epithelium       CosMx
## ...                        ...         ...                 ...         ...
## Lung5_Rep3_22_995   Lung5_Rep3      22_995              Stroma       CosMx
## Lung5_Rep3_22_996   Lung5_Rep3      22_996              Stroma       CosMx
## Lung5_Rep3_22_997   Lung5_Rep3      22_997              Stroma       CosMx
## Lung5_Rep3_22_998   Lung5_Rep3      22_998              Stroma       CosMx
## Lung5_Rep3_22_999   Lung5_Rep3      22_999              Stroma       CosMx
##                                    Level1              Level2         fov
##                               <character>         <character> <character>
## Lung5_Rep3_16_1    Tumour area with str.. Epithelium - panCK+       fov16
## Lung5_Rep3_16_10   Tumour area with str..              Stroma       fov16
## Lung5_Rep3_16_100  Abnormal but not tum..              Stroma       fov16
## Lung5_Rep3_16_1000 Abnormal but not tum..              Stroma       fov16
## Lung5_Rep3_16_1001 Abnormal but not tum..              Stroma       fov16
## ...                                   ...                 ...         ...
## Lung5_Rep3_22_995                      NA              Stroma       fov22
## Lung5_Rep3_22_996                      NA              Stroma       fov22
## Lung5_Rep3_22_997                      NA              Stroma       fov22
## Lung5_Rep3_22_998                      NA              Stroma       fov22
## Lung5_Rep3_22_999                      NA              Stroma       fov22

# retrieve row (gene/probe) annotations
rowData(lung5)

## DataFrame with 980 rows and 2 columns
##                gene    genetype
##         <character> <character>
## VTN             VTN        Gene
## TGFBR2       TGFBR2        Gene
## JUNB           JUNB        Gene
## FKBP11       FKBP11        Gene
## REG1A         REG1A        Gene
## ...             ...         ...
## TSC22D1     TSC22D1        Gene
## CD40           CD40        Gene
## GDNF           GDNF        Gene
## ADGRG2       ADGRG2        Gene
## NegPrb8     NegPrb8      NegPrb

# retrieve spatial coordinates
head(spatialCoords(lung5))

##                            x        y
## Lung5_Rep3_16_1    -8679.934 12706.31
## Lung5_Rep3_16_10   -8894.582 12709.94
## Lung5_Rep3_16_100  -7189.803 12549.21
## Lung5_Rep3_16_1000 -8028.512 11347.95
## Lung5_Rep3_16_1001 -7470.037 11347.64
## Lung5_Rep3_16_1002 -6631.154 11351.96

# retrieve counts - top 5 cells and genes
counts(lung5)[1:5, 1:5]

## 5 x 5 sparse Matrix of class "dgCMatrix"
##        Lung5_Rep3_16_1 Lung5_Rep3_16_10 Lung5_Rep3_16_100 Lung5_Rep3_16_1000
## VTN                  .                .                 2                  .
## TGFBR2               .                1                 .                  .
## JUNB                 .                1                 .                  1
## FKBP11               .                .                 .                  .
## REG1A                .                .                 .                  .
##        Lung5_Rep3_16_1001
## VTN                     .
## TGFBR2                  .
## JUNB                    .
## FKBP11                  .
## REG1A                   .

library(ggplot2)
library(patchwork)
library(SpaNorm)

# set default point size to 0.5 for this report
update_geom_defaults(geom = "point", new = list(size = 0.5))
plotSpatial(lung5, colour = region) +
  scale_colour_brewer(palette = "Dark2") +
  # improve legend visibility
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5)))

For users interested in analysing the whole dataset, it can be obtained using the code below.

Please note that the full dataset contains over 500,000 cells and will not scale for the purpose of this workshop. If interested, run the code below on your own laptop.

## DO NOT RUN during the workshop as the data are large and could crash your session
if (!require("ExperimentHub", quietly = TRUE)) {
  BiocManager::install("ExperimentHub")
}
if (!require("SubcellularSpatialData", quietly = TRUE)) {
  BiocManager::install("SubcellularSpatialData")
}

# load the metadata from the ExperimentHub
eh = ExperimentHub()
# download the transcript-level data
tx = eh[["EH8232"]]
# filter the Lung5_Rep1 sample
tx = tx[tx$sample_id == "Lung5_Rep1", ]
# summarise measurements for each cell
lung5 = tx2spe(tx, bin = "cell")

Quality control

Most spatial transcriptomics assays include negative control probes to aid in quality control. These probes do not bind to the transcripts of interest (signal), so represent noise in the assay. There are three different types of negative controls in the 10x Xenium assay. As these vary across technologies and serve different purposes, we treat them equally for our analysis and assess them collectively. The breakdown of the different types of negative control probes is given below.

table(rowData(lung5)$genetype)

## 
##   Gene NegPrb 
##    960     20

We now use compute and visualise quality control metrics for each cell. Some of the key metrics we assess are:

sum or total - Total number of transcripts and negative control probes per cell (more commonly called the library size).
detected - Number of transcripts and negative control probes detected per cell (count > 0).
subsets_neg_sum - Count of negative control probes (noise) per cell.
subsets_neg_detected - Number of negative control probes expressed per cell (count > 0).
subsets_neg_percent - subsets_neg_sum / sum.

Additionally, the vendor (10x) provides their own summary of quality through the qv score. This is usually provided per detected measurements but we compute the average per cell when deriving cellular summaries. Likewise, we compute the number of transcripts allocated to the nucleus (overlaps_nucleus).

library(scater)
library(scuttle)

# identify the negative control probes
is_neg = rowData(lung5)$genetype != "Gene"
# compute QC metrics for each cell
lung5 = addPerCellQCMetrics(lung5, subsets = list("neg" = is_neg))

Quality control is all about visualisation!

One of the best ways to understand the data and identify problematic cells or regions is to visualise the QC metrics. We need to visualise the metrics in various contexts to better understand what cells appear of poor qaulity. As such, we usually plot metrics against each other, we visualise them in space, and we visualise them in the context of the higher dimensional structure of the data. To aid the latter, we will compute lower dimensional representations using the uniform manifold approximation and projection (UMAP). As the UMAP embedding can be computationally intensive, we usually perform PCA prior to UMAP. This approach to quality control is identical to that used for single-cell RNA-seq analysis.

# compute log counts - to fix the skew with counts
logcounts(lung5) = log2(counts(lung5) + 1)
# compute PCA
lung5 = runPCA(lung5, ncomponents = 30)
# compute UMAP
lung5 = runUMAP(lung5, dimred = "PCA")

We plot the sum (library size) against the number of genes detected, and colouring the points based on the proportion of negative control probes (subsets_neg_percent). We see that as more genes are detected, more measurements are obtained (library size increases). There are cells where the library size is very low and the number of genes detected is low as well. Considering the smaller panel size of 960 target genes in this assay, we expect the number of genes detected in some cell types to be small. However, if we detect very few genes in a cell, we are less likely to obtain any useful information from them. As such, poor quality cells can be defined as those with very low library sizes and number of genes detected.

plotColData(lung5, "detected", "sum", colour_by = "region", scattermore = TRUE) +
  scale_colour_brewer(palette = "Dark2") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5), title = "Region")) +
  geom_hline(yintercept = 50, lty = 2) +
  geom_vline(xintercept = 75, lty = 2)

However, we need to identify appropriate thresholds to classify poor quality cells. As the panel sizes and compositions for these emerging technologies are variable, there are no standards as of yet for filtering. For instance, immune cells will always have smaller library sizes and potentially a smaller number of genes expressed in a targetted panel as each cell type may be represented by a few marker genes. Additionally, these cells are smaller in size therefore could be filtered out using an area-based filter. Desipte the lack of standards, the extreme cases where cells express fewer than 50 genes and have fewer than 75 detections are clearly of poor quality. As such, we can start with these thresholds and adjust based on other metrics and visualisations.

library(standR)

# UMAP plot
p1 = plotDR(lung5, dimred = "UMAP", colour = sum, alpha = 0.75, size = 0.5) +
  scale_colour_viridis_c(option = "F") +
  labs(x = "UMAP1", y = "UMAP2") +
  theme(legend.position = "bottom")
# spatial plot
p2 = plotSpatial(lung5, colour = sum, alpha = 0.75) +
  scale_colour_viridis_c(option = "F") +
  theme(legend.position = "bottom")
p1 + p2 + plot_annotation(title = "Sum (library size)")

# UMAP plot
p1 = plotDR(lung5, dimred = "UMAP", colour = detected, alpha = 0.75, size = 0.5) +
  scale_colour_viridis_c(option = "F") +
  labs(x = "UMAP1", y = "UMAP2") +
  theme(legend.position = "bottom")
# spatial plot
p2 = plotSpatial(lung5, colour = detected, alpha = 0.75) +
  scale_colour_viridis_c(option = "F") +
  theme(legend.position = "bottom")
p1 + p2 + plot_annotation(title = "Detected (number of genes expressed)")

# UMAP plot
p1 = plotDR(lung5, dimred = "UMAP", colour = subsets_neg_percent, alpha = 0.75, size = 0.5) +
  scale_colour_viridis_c(option = "F", limits = c(0, 5), oob = scales::squish) +
  labs(x = "UMAP1", y = "UMAP2") +
  theme(legend.position = "bottom")
# spatial plot
p2 = plotSpatial(lung5, colour = subsets_neg_percent, alpha = 0.75) +
  scale_colour_viridis_c(option = "F", limits = c(0, 5), oob = scales::squish) +
  theme(legend.position = "bottom")
p1 + p2 + plot_annotation(title = "Proportion of negative control probes")

Having visualised the quality metrics and their distrition across space and the genomic dimensions, we can now begin to test threshold values. We want to remove low quality cells but we have to be careful not to introduce bias (e.g., by preferentially removing biological entities such as regions or cells).

Try modifying the thresholds below to see the effect of over- or under- filtering.

What happens when you set the sum threshold to 300?
What about setting the detected threshold to 100?

thresh_sum = 75
thresh_detected = 50
thresh_neg = 5

# classify cells to keep
lung5$keep = lung5$sum > thresh_sum & lung5$detected > thresh_detected & lung5$subsets_neg_percent < thresh_neg

# UMAP plot
p1 = plotDR(lung5, dimred = "UMAP", colour = keep, alpha = 0.75, size = 0.5) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "UMAP1", y = "UMAP2") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5)))
# spatial plot
p2 = plotSpatial(lung5, colour = keep, alpha = 0.75) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5)))
p1 + p2 + plot_annotation(title = "Cells retained after QC filtering")

In practice, after poor quality cells are filtered out, we tend to see clearer separation of clusters in the UMAP (or equivalent visualisations). We also notice that poor quality cells often seem like intermmediate states between clusters in these plots. Additionally, they tend to be randomly distributed in space. If they are concentrated in some sub-regions (as above - along the field of view boundaries), it is usually due to technical reasons. In the above plot, we poor quality primarily occurs along the field of view boundaries. It is unlikely that biology would be present in such a regular shape therefore we can use that region as a guide in selecting thresholds. Having selected the thresholds, we now apply them and remove the negative control probes from donwstream analysis. This phenomenon of lower library size and lower number of genes detected has been observed in the past and is due to cells being cut off from the field of view.

# apply the filtering
lung5 = lung5[, lung5$keep]
# remove negative probes
lung5 = lung5[!is_neg, ]

Newer quality control tools are being developed and are currently under peer-review. These include SpotSweeper (Totty, Hicks, and Guo 2024) and SpaceTrooper. As these software are not deposited to more permanent repositories, we do not use them in this workshop. We advise advanced users to use these tools for their QC as they help come up with additional QC summaries for each cell or region in space. SpaceTrooper for instance, computes a flag score that penalises cells that are too large or small, and whose aspect ratio is large. While these filters may make sense in most systems, they could penalise stomal or neuronal cells that are elongated and whose aspect ratio will be very large. Therefore always remember that QC is relative the biological system being studied.

As these technologies are still emerging, all quality control metrics should be evaluated on a case-by-case basis depending on the platform, panel set, and tissue. Knowing the biology to expect or what is definitely NOT biology helps in making these decisions. As any experienced bioinformatician will tell you, you will vary these thresholds a few times throughout your analysis because you learn more about the data, biology, and platform as you analyse the data. This cyclical approach can introduce biases so always rely on multiple avenues of evidence before changing parameters.

Normalisation

Above we plot the sum, which is commonly known as the library size, for the purpose of quality control. Assessing the library size is helpful in identifying poor quality cells. Beyond QC, variation in library size can affect donwstream analysis as it represents measurement bias. A gene may appear to be highly expressed in some cells simply because more measurements were sampled from the cell, and not because the gene is truly highly expressed. Whether it is sequencing-based or imaging-based transcriptomics, we are always sampling from the true pool of transcripts and are therefore likely to have library size variation.

While library sizes are a good measure of technical variation in conventional sequencing data (bulk and single-cell RNA-seq), they are not the most appropriate in assessing technical effects in targeted panels because the panel is biased towards measuring the biology of interest. As such, we need a purely technical measurement to adjust our data. Negative control probes offer this measure of technical biases in the data as they are designed to not measure any biology (i.e., the are synthetic probes designed to not hybridise with any transcripts in the system).

Using negative control probes to normalise spatial transcriptomics data.

Though it is ideal to use negative control probes for normalisation of targeted spatial transcriptomics data, the data we are analysing is relatively old and from a prototype instrument, therefore, the negative control probe counts are too low to be used for modelling. Additionally, the current Bruker CosMx instrument uses two kinds of negative control probes, conventional negative control probes (such as in this dataset) and false codes. The latter are best used for modelling technical variation in CosMx datasets as their counts are generally higher. We recommend using false code counts for normalising Bruker CosMx datasets.

lung5$logLS = log1p(lung5$sum) - mean(log1p(lung5$sum))
plotSpatial(lung5, colour = logLS) +
  scale_colour_distiller(palette = "PRGn", limits = c(-0.5, 0.5), oob = scales::squish) +
  theme(legend.position = "bottom")

To remove these technical biases, we perform normalisation. As measurements are independent in bulk and single-cell RNA-seq (reactions happen independently for each sample/cell), each cell or sample can be normalised independently by computing sample/cell-specific adjustment factors. Measurements are locally dependent in spatial transcriptomics datasets therefore technical effects (including conventional library size effects) confound biology (Bhuva et al. 2024). The plot above show this clearly as the technical effect demarcate tissue structures. These can be due to various causes such as:

The tissue structure in one region is more rigid, thereby making it difficult for reagents to permeate.
Reagents distributed unevenly due to technical issues leading to a sub-region within a tissue region being under sampled.

In both cases, each cell cannot be normalised independently as nearby cells are likely to experience similar technical effects. To tackle this, our team has developed the SpaNorm normalisation method that uses spatial dependence to decouple technical and biological effects (Salim et al. 2024). SpaNorm uses a spatially constrained regularised generalised linear model to model library size effects and subsequently adjust the data. This is done by calling the SpaNorm() function on a SpatialExperiment object.

A schematic diagram of the SpaNorm normalisation method.

The key parameters for SpaNorm normalisation that need to be adjusted are:

sample.p - This is the proportion of the data that is sub-sampled to fit the regularised GLM. As spatial transcriptomics datasets are very large, the model can be approximated using by sampling. In practice, we see that ~50,000 cells in a dataset with 500,000 cells works well enough under the assumption that all possible cellular states are sampled evenly across the tissue. This is likely to be true in most cases. Below we use 25% of the cells (which is the default as well) to fit the model (you can tweak this to 1 to use all cells).
df.tps - Spatial dependence is modelled using thin plate spline functions. This parameter controls the degrees of freedom (complexity) of the function. Complex functions will fit the data better but could overfit therefore in practice, we see that 6-10 degrees of freedom are generally good enough. If the degrees of freedom are increased, regularisation should also be increased (lambda.a parameter). We also allow separate specification of the degrees of freedom for the spline functions for biological and technical effects. Below we specify 8 d.f. for biology and 3 d.f. for technical effects. This works on the assumption that biology will vary more sharply across space than technical effects, which we expect to be relatively smooth.

set.seed(1000)
# perform SpaNorm normalisation
lung5 = SpaNorm(lung5, df.tps = c(8, 3), sample.p = 0.25, batch = lung5$fov)

Above, the model is fit and the default adjustment, logPAC, is computed. Alternative adjustments such as the mean biology (meanbio), pearson residuals similar to sctransform used in Seurat (pearson), and median biology (median) can be computed as well (this will be faster as the fit does not need to be repeated, however, the same parameters need to be passed to the function).

# compute mean biology
lung5_pearson = SpaNorm(lung5, df.tps = c(8, 3), batch = lung5$fov, adj.method = "pearson")

p1 = plotSpatial(lung5, what = "expression", assay = "logcounts", colour = EPCAM, alpha = 0.75) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("Normalised expression") +
  theme(legend.position = "bottom")
p2 = plotSpatial(lung5_pearson, what = "expression", assay = "logcounts", colour = EPCAM, alpha = 0.75) +
  scale_colour_viridis_c(option = "F", limits = c(NA, 3), oob = scales::squish) +
  ggtitle("Mean from SpaNorm model") +
  theme(legend.position = "bottom")
p1 + p2

Since the SpaNorm model learns the functions for biological and technical effects for each gene, we can also investigate these to ensure that the correct biology is learnt and modelled. The plotCovariate() functions can be used to study these. It is worth remembering that while the biological function is independent of covariates (like an intercept), the technical function is a function of the library size. To enable some interpretation, we show how the function evaluates at the mean library size below.

p1 = plotCovariate(lung5, covariate = "biology", colour = EPCAM, alpha = 0.75) +
  scale_colour_viridis_c(option = "F") +
  ggtitle("Biological variation") +
  theme(legend.position = "bottom")
p2 = plotCovariate(lung5, covariate = "ls", colour = EPCAM, alpha = 0.75) +
  scale_colour_viridis_c(option = "G") +
  ggtitle("Technical variation") +
  theme(legend.position = "bottom")
p3 = plotCovariate(lung5, covariate = "batch", colour = EPCAM, alpha = 0.75) +
  scale_colour_viridis_c(option = "A") +
  ggtitle("Batch variation") +
  theme(legend.position = "bottom")
p1 + p2 + p3

EPCAM is an epithelial cell marker gene that will also be expressed in lung cancer cells that are of epithelial origin. The SpaNorm model correctly models the regions where the cancer cells exist (mean) and is able to normalise technical effects without affecting the biology. The technical function looks similar to the function for biology and this is expected as technical effects due to library size biases confound biology, however, SpaNorm is able to perform the adjustment while preserving most of the biology.

Library size normalisation is important to ensure technical variation is removed from gene expression. In the most extreme cases, a gene that is highly expressed in a region of interest may seem to be lowly expressed because of a drop in library sizes in said region (see manuscript). We have seen that negative probes across most technologies are also capturing library size effects very well and are experimenting with using these to model technical variation and normalise data. In some technologies/datasets, these are very lowly expressed, not because of better data quality but by design as well, and therefore would NOT faithfully represent technical variation. As with every other step, this is still an active area of research.

Identifying spatially variable genes (SVGs)

Not all genes in the study are likely to be meaningful, therefore, a key step in any transcriptomic analysis is to identify informative features/genes. In spatial transcriptomics, we are interest in identifying features that are spatially non-random. These genes should vary spatially along with biology and are thus referred to as spatially variable genes (SVGs). Common approaches to identify these genes use approaches inspired from spatial statistics such as Moran’s I. However, these approaches generally work with unnormalised data and often apply their own non-spatial normalisation to the data prior to identifying SVGs. As we have a spatial normalisation approach, we want to retain our normalisation.

We have recently developed a model-based SVG detection approach which makes use of the model fit for the normalisation step of SpaNorm. Note that the SpaNorm model contains coefficients to represent biological and technical variation. Genes that are spatially variable, and therefore biologically informative should have non-zero coefficients for biology. We can test for the importance of the coefficients for the biological components using a likelihood ratio test (LRT) by comparing the SpaNorm model (Model 1 below) to a nested model (Model 2) containing only the technical components. This allows us to identify statistically significant spatial variation. The SpaNormSVG() function fits this additional nested model, compares it to the full SpaNorm model, and performs the likelihood ratio test to identify spatially variable genes (SVGs). Results of the test are stored in the rowData of the SpatialExperiment object, but can also be retrieved using the topSVGs() function.

A schematic diagram of the SpaNormSVG normalisation method.

set.seed(1000)
lung5 = SpaNormSVG(lung5)
topsvgs = topSVGs(lung5, n = 9, fdr = 0.05)

lapply(rownames(topsvgs), function(gene) {
  plotSpatial(
    lung5,
    what = "expression",
    assay = "logcounts",
    colour = !!sym(gene),
    size = 0.2
  ) +
    scale_colour_viridis_c(option = "F")
}) |> 
  wrap_plots(ncol = 3)

Dimensional reduction using GLM-PCA

Spatial transcriptomics data is usually in the form of sparse counts. Principal components analysis (PCA), which relies on Euclidean geometry is not appropriate with such data. Subsequently, PCA is usually performed on log normalised counts, a transformation which makes the data more amenable to PCA. However, this approach also introduces biases to the PCA estimation (discussed in depth in (Townes et al. 2019)). The GLM-PCA approach was proposed to compute PCA directly on counts, however, the approach is computationally intensive. A more practical approach was proposed by (Townes et al. 2019) which approximated the GLM-PCA by computing PCA on the deviance residuals from a null model (a model consisting of technical variation alone). This model is already fit to identify SVGs in the previous step (Model 2) therefore GLM-PCA can be easily computed. The SpaNormPCA() function can be used to compute GLM-PCA. In most cases, it will be comparable to PCA on the logPAC counts generated by SpaNorm, however, the GLM-PCA approximation is theoretically more rigorous and will poses better properties as discussed in (Townes et al. 2019).

lung5 = SpaNormPCA(lung5, svg.fdr = 0.1, ncomponents = 20, residuals = "pearson")
lung5 = runUMAP(lung5, min_dist = 0.3)

# UMAP plot
p1 = plotDR(lung5, dimred = "UMAP", colour = region, alpha = 0.75, size = 0.5) +
  scale_color_brewer(palette = "Dark2") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5))) +
  labs(x = "UMAP1", y = "UMAP2") +
  theme(legend.position = "bottom")
p2 = plotDR(lung5, dimred = "UMAP", colour = fov, alpha = 0.75, size = 0.5) +
  scale_color_brewer(palette = "Set2") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5))) +
  labs(x = "UMAP1", y = "UMAP2") +
  theme(legend.position = "bottom")
p1 + p2

Cell typing

Cell typing can be performed using various approaches. Most of these were developed for the analysis of single-cell RNA-seq datasets and port across well to spatial transcriptomics datasets. The two common strategies are reference-based and reference-free annotations. In the former, annotations are mapped across from a previously annotated dataset. For a thorough discussion on this topic, review the Orchestrating Single-Cell Analysis (OSCA) with Bioconductor book (Amezquita et al. 2020). Here, we use the SingleR package (Aran et al. 2019) along with a reference non-small cell lung cancer (NSCLC) dataset (salcher2022high?).

SingleR is a simple yet powerful algorithm that only requires a single prototype of what the cell type’s expression profile should look like. As such, counts are aggregated (pseudobulked) across cells to form a reference dataset where a single expression vector is created for each cell type. Aggregation is performed using the scater::aggregateAcrossCells function. While the original dataset contains >96,000 cells, we only provide the aggregated counts for this workshop. Aggregated counts tend to be more robust to technical variation and are therefore more suitable for cell typing.

library(scran)

# load reference dataset - Salcher et al., Cancer Cell, 2022
data(ref_luca)
# normalise the reference dataset
cl = quickCluster(ref_luca)
ref_luca = computeSumFactors(ref_luca, cluster = cl)
ref_luca = logNormCounts(ref_luca)
# cell types in the reference dataset
levels(ref_luca$cell_type)

##  [1] "Alveolar cell type 1"       "Alveolar cell type 2"      
##  [3] "B cell"                     "B cell dividing"           
##  [5] "Ciliated"                   "Club"                      
##  [7] "DC mature"                  "Endothelial cell arterial" 
##  [9] "Endothelial cell capillary" "Endothelial cell lymphatic"
## [11] "Endothelial cell venous"    "Fibroblast adventitial"    
## [13] "Fibroblast alveolar"        "Fibroblast peribronchial"  
## [15] "Macrophage"                 "Macrophage alveolar"       
## [17] "Mast cell"                  "Mesothelial"               
## [19] "Monocyte classical"         "Monocyte non-classical"    
## [21] "NK cell"                    "Neutrophils"               
## [23] "Pericyte"                   "Plasma cell"               
## [25] "Plasma cell dividing"       "ROS1+ healthy epithelial"  
## [27] "Smooth muscle cell"         "T cell CD4"                
## [29] "T cell CD8"                 "T cell dividing"           
## [31] "T cell regulatory"          "Tumor cells"               
## [33] "cDC1"                       "cDC2"                      
## [35] "myeloid dividing"           "pDC"                       
## [37] "stromal dividing"           "transitional club/AT2"

library(SingleR)

# predict cell types using SingleR
prediction = SingleR(test = lung5, ref = ref_luca, labels = ref_luca$cell_type)
# add prediction to the SpatialExperiment object
lung5$cell_type_fine = prediction$labels

Since the reference datasets contains highly detailed cell annotation, we are able to annotate our data to such a high degree. However, for the sake of demonstration, we simplify the annotation to collapse them to a higher order classification below.

# load the cell type map
data(luca_map)
lung5$cell_type = luca_map[lung5$cell_type_fine]

plotSpatial(lung5, colour = cell_type, alpha = 0.75) +
  ggtitle("Predicted cell type") +
  scale_colour_brewer(palette = "Paired") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5, alpha = 1)))

Though discrete cell type identities are appealing due to their ease of interpretation, in many biological systems, metastable cell states (also referred to as plasticity) are likely to be present. Thus, such discrete cell type annotation may force cells into states that are not representative of true biology. PhiSpace, covered in the second part of this workshop series, is a method developed to deal with this issue. PhiSpace assigns a continuous score for each cell belonging to a cell type. These scores can either be thought of as the degree of evidence towards a cell conforming to a certain cell type, or the location of a cell in a plastic trajectory. In cancer sytems, where plasticity is dominant due to regression to stem-like states, the continuous cell annotation PhiSpace provides is powerful. Part 2 of this workshop describes PhiSpace in detail and demonstrates the types of analyses it powers.

Spatial domain identification

Regions were annotated by an expert pathologist in our dataset. However, this is not always feasible, either due to access to a pathologist, or because the features are not visible in histology images alone. In such instances, a computational approach is required to identify spatial domains or cellular niches. These domains are generally, but not always, defined based on the homogeneity of biology. As this is still a developing area, there are many algorithms available, however, they generally fall within two broad categories of methods:

Cell type-based - These methods use the cell type information to identify regions where the cell type composition is homogeneous.
Expression-based - These methods use the full expression profile of cells to identify regions of the tissue with homogeneous gene expression profiles.

Approach 1 - Cell type-based domain identification

These methods use the cell types annotated in the previous type to determine the homogeneity of cell types within a local region of space around each cell. The hoodscanR method is one such method that computes the probability of each cell type being in a local neighbourhood of each cell in the dataset (Liu et al. 2024). The neighbourhood of each cell is determined based on the k parameter which specifies how many of the nearest neighbouring cells should be used to estimate the cell type probabilities.

library(hoodscanR)

# find neighbouring cells using an approximate nearest neighbour search
set.seed(1000)
nb_cells = findNearCells(lung5, k = 100, anno_col = "cell_type")
nb_cells$cells[1:5, 1:5]

##                    nearest_cell_1      nearest_cell_2      nearest_cell_3
## Lung5_Rep3_16_1        Fibroblast          Fibroblast          Fibroblast
## Lung5_Rep3_16_10       Fibroblast         Endothelial          Fibroblast
## Lung5_Rep3_16_100     Endothelial          Fibroblast Monocyte/Macrophage
## Lung5_Rep3_16_1000         T cell          Fibroblast          Fibroblast
## Lung5_Rep3_16_1001         B cell Monocyte/Macrophage              T cell
##                    nearest_cell_4 nearest_cell_5
## Lung5_Rep3_16_1        Fibroblast     Fibroblast
## Lung5_Rep3_16_10       Fibroblast     Fibroblast
## Lung5_Rep3_16_100          T cell        Myeloid
## Lung5_Rep3_16_1000         Plasma         Plasma
## Lung5_Rep3_16_1001        Myeloid    Endothelial

nb_cells$distance[1:5, 1:5]

##                    nearest_cell_1 nearest_cell_2 nearest_cell_3 nearest_cell_4
## Lung5_Rep3_16_1          59.66036       71.23732       93.28319      105.50952
## Lung5_Rep3_16_10         79.42413       80.90802       87.58466      103.09469
## Lung5_Rep3_16_100        42.83032       47.33316       66.48095       76.16124
## Lung5_Rep3_16_1000       58.31700       66.59987       66.92916       69.88571
## Lung5_Rep3_16_1001       65.97415       69.78375       85.52106       89.67675
##                    nearest_cell_5
## Lung5_Rep3_16_1         116.67380
## Lung5_Rep3_16_10        110.36708
## Lung5_Rep3_16_100        82.58831
## Lung5_Rep3_16_1000      100.66268
## Lung5_Rep3_16_1001      103.71212

# estimate proportion of each cell type in the neighbourhood of each cell
probmat = scanHoods(nb_cells$distance)
dim(probmat)

## [1] 11721   100

# compute probability of each each cell type in the neighbour hood of each cell
neighbourhoods = mergeByGroup(probmat, nb_cells$cells)
neighbourhoods[1:5, 1:5]

##                        B cell           DC Endothelial Fibroblast
## Lung5_Rep3_16_1    0.00000000 0.0958606703  0.02710968  0.6440314
## Lung5_Rep3_16_10   0.00000000 0.0504909298  0.17678046  0.5933056
## Lung5_Rep3_16_100  0.00601473 0.0002870623  0.36437939  0.1244843
## Lung5_Rep3_16_1000 0.01026742 0.0002508496  0.05354242  0.3221676
## Lung5_Rep3_16_1001 0.12641712 0.1079731780  0.08390277  0.2046618
##                    Monocyte/Macrophage
## Lung5_Rep3_16_1            0.013978776
## Lung5_Rep3_16_10           0.049942312
## Lung5_Rep3_16_100          0.188404569
## Lung5_Rep3_16_1000         0.009734457
## Lung5_Rep3_16_1001         0.080494328

# add neighbourhood information to the SpatialExperiment object
lung5 = mergeHoodSpe(lung5, neighbourhoods)

Once these probabilies are assigned around each cell, a secondary clustering can be performed using any desired method. The package currently only supports k-means clustering on the neighbourhood matrix but advanced users can cluster the matrix directly using their preferred method.

# identify spatial domains based on neighbourhood composition (5 clusters)
set.seed(1000)
lung5 = clustByHood(lung5, pm_cols = colnames(neighbourhoods), k = 5, val_name = "hoodscanr_region")
# plot predictions
p1 = plotSpatial(lung5, colour = hoodscanr_region, alpha = 0.75) +
  scale_colour_brewer(palette = "Set1", na.value = "#333333", name = "predicted") +
  ggtitle("Predicted pathology") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(nrow = 3, override.aes = list(shape = 15, size = 5)))
p2 = plotSpatial(lung5, colour = region) +
  scale_colour_brewer(palette = "Dark2") +
  ggtitle("Pathology annotations") +
  guides(colour = guide_legend(nrow = 3, override.aes = list(shape = 15, size = 5)))
p1 + p2 + plot_annotation(title = "Spatial domain identification")

The plot above shows that hoodscanR is able to differentiate the key regions and is clearly able to delineate the tumour regions. It is also able to identify more sophosticated structures such as region 4 which was not identified in the pathological annotation.

plotProbDist(lung5, pm_cols = colnames(neighbourhoods), by_cluster = TRUE, plot_all = TRUE, show_clusters = as.character(seq(5)), val_name = "hoodscanr_region")

we see that cluster 5 is entirely defined by the presence of malignant cells while cluster 4 is composed of immune cells, particularly B cells, T cells and plasma cells, indicating the formation of lymphoid aggregates. Lymphoid aggregates are interesting structures in the context of solid cancers, and depending on their expression profile, these can be beneficial for patients.

# calculate metrics of heterogeneity - entropy
lung5 = calcMetrics(lung5, pm_cols = colnames(neighbourhoods))
# plot metric
p1 = plotSpatial(lung5, colour = cell_type, alpha = 0.75) +
  scale_colour_brewer(palette = "Paired", na.value = "#333333") +
  ggtitle("Predicted cell type") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5)))
p2 = plotSpatial(lung5, colour = entropy, alpha = 0.75) +
  scale_colour_viridis_c(option = "G") +
  ggtitle("Cell type heterogeneity") +
  theme(legend.position = "bottom")
p1 + p2 + plot_annotation(title = "Cell type heterogeneity")

Looking at the heterogeneity of cell types across space, we see that cell composition is very homogeneous within the tumour regions while it is highest in the lymphoid aggregates which are often characterised by the presence of various different types of immune cells. We emphasise here that the homogeneity of the tumour regions is determined based on their cell type call alone and not on their expression profile. As such, the tumours themselves could be heterogeneous if finer cell typing or cell phenotyping were to be performed. Additionally, any errors made in cell typing will propage to domain identification.

The approach that follows allows an unbiased identification of domains using the gene expression data directly.

Approach 2 - Expression-based domain identification

Expression-based methods can be further sub-categorised into three categories:

Graph Neural Network (GNN) or Graph Convolutional Network (GCN) based algorithms.
Probabilistic models based on markov random fields.
Neighbourhood diffusion followed by graph-based clustering.

We note that this is not a comprehensive survey of the literature but just a summary of the most popular methods in this space. The first two approaches can be very powerful, however, are time consuming and will not always scale with larger datasets. Diffusion followed by graph-clustering is a relatively simple, yet powerful approach that scales to large datasets. Banksy is one such algorithm that first shares information within a local neighbourhood, and then performs a standard single-cell RNA-seq inspired graph-based clustering to identify spatial domains (Singhal et al. 2024).

The following key parameters determine the final clustering results:

lambda - This controls the degree of information sharing in the neighbourhood (range of [0, 1]) where larger values result in higher diffusion and thereby smoother clusters.
resolution - The resolution parameter for the graph-based clustering using either the Louvain or Leiden algorithms. Likewise, there are other parameters to control the graph-based clustering. The OSCA book discusses choices for these parameters in depth.
npcs - The number of principal components to use to condense the feature space. The OSCA book discusses choices for this in depth.
k_geom - The number of nearest neighbours used to share information. The authors recommend values within [15, 30] and note that variation is minimal within this range.

library(Banksy)

# parameters
lambda = 0.2
npcs = 30
k_geom = 50
res = 0.4

# compute the features required by Banksy
lung5 = computeBanksy(lung5, assay_name = "logcounts", k_geom = k_geom)

# compute PCA on the features
set.seed(1000)
lung5 = runBanksyPCA(lung5, lambda = lambda, npcs = npcs)

# perform graph-based clusering
set.seed(1000)
lung5 = clusterBanksy(lung5, lambda = lambda, npcs = npcs, resolution = res)

# store to the column banksy_region
lung5$banksy_region = lung5$`clust_M0_lam0.2_k50_res0.4`

plotSpatial(lung5, colour = banksy_region, alpha = 0.75) +
  scale_colour_brewer(palette = "Set1", na.value = "#333333", name = "predicted") +
  ggtitle("Mean from SpaNorm model") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5)))

Try varying the parameters to see how the clustering changes. Do note that the results are automatically stored in a new column in the colData. Unfortunately, the function does not allow specification of the new column name so the only approach is to look at the column names using colnames(colData(lung5)) and find the name of the new column (usually the last one) containing the clustering results. We then store these to a column of our choosing to make it easier to reference it.

Banksy is able to identify the tumour regions (4 and 6), lymphoid aggregates (5 and 9) and other stromal regions. However, we notice that it is not able to merge regions from the different fields of view (FOVs) which have batch effects. This is because the transcriptomic profiles may yet have batch effects following adjustment. This is one of the pit falls of methods that use expression data compared to cell-based methods. The former are more sensitive to batch effects than the latter, which use higher-order phenotypes (cell types) that are less likely to be altered due to batch effects.

We highlight here that when the full dataset is used and there are more cells to learn structure from, Banksy is able to identify merged clusters across the tissue. It is worth remembering that this workshop only provides a demonstration of methods. When methods are optimised and run on more data, perfomance will vary and will generally improve.

p1 = plotSpatial(lung5, colour = banksy_region, alpha = 0.75) +
  scale_colour_brewer(palette = "Set1", na.value = "#333333", name = "predicted") +
  ggtitle("Predicted pathology") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(nrow = 3, override.aes = list(shape = 15, size = 5)))
p2 = plotSpatial(lung5, colour = region) +
  scale_colour_brewer(palette = "Dark2") +
  ggtitle("Pathology annotations") +
  guides(colour = guide_legend(nrow = 3, override.aes = list(shape = 15, size = 5)))
p1 + p2 + plot_annotation(title = "Spatial domain identification")

The choice of the better approach amongst the two demonstrated here should be based on the question of interest. They both have their strengths and limitations. BANKSY allows us to identify within tumour differences because it assesses the expression profile directly, however, it does not offer any additional interpretation of the domains it identifies and is therefore normally coupled with differential expression and differential compositional analysis. Errors in the domain identification would propagate to these downstream tasks as well. On the other hand, hoodscanR inherently provides interpretation of the domains it identifies, but it is limited by the accuracy and resolution of cell type annotation.

Differential expression

Cell type identification or clustering allows us to identify the different cellular states in a biological system. The power of spatial molecular datasets is that we can study how cellular states change in response to different structures in the tissue. For instance, in our tissue we notice immune cells in lymphoid aggregates (region 4) and immune cells dispersed elsewhere in the tissue (region 3). We may be interested in studying the functional differences of cells in different domains. We can cluster the cells using a combination of cell types and regions and study phenotypic differences in a cell type within different regions.

# create clusters to study how cell types vary across domains
clusters = paste(lung5$cell_type, lung5$hoodscanr_region, sep = "_")
clusters[lung5$hoodscanr_region %in% c("3", "4")] |>
  table() |>
  sort() |> 
  tail(10)

## 
##               Tumor_4              T cell_3                  DC_4 
##                   190                   198                   235 
## Monocyte/Macrophage_4              T cell_4              Plasma_4 
##                   253                   284                   394 
##             Stromal_4         Endothelial_4              B cell_3 
##                   445                   520                   695 
##          Fibroblast_4 
##                  1188

The analysis below is only meant to be a demonstration of the potential of spatial transcriptomics data. A proper statistical analysis would require replicate tissue samples from across different patients. In such a case, we would aggregate transcript counts across cell types using the scater::sumCountsAcrossCells to create pseudo-bulk samples, and follow up with a differential expression analysis using the edgeR or limma pipelines. Alternatively, a more advanced tool such as nicheDE could be used to perform analysis on a single tissue, however, it would also be more robust with biological replicates. This tool is outside the scope of this introductory workshop therefore is not covered.

# identify markers
marker.info = scoreMarkers(lung5, clusters, pairings = list(c("B cell_4", "T cell_4"), c("B cell_3", "T cell_3")))
marker.info

## List of length 2
## names(2): B cell_4 T cell_4

Below we compare B cells from region 4 (lymphoid aggregates) against those from region 3 (dispersed). Investigating and visualising the top genes shows that CD74, MS4A1, and IGHM are up-regulated in B cells within lymphoid aggregates compared to dispersed B cells. These genes are indicative of increased antigen presentation, B cell maturation and increased activity against the adjacent tumour regions respectively.

# study markers of B cell_4
chosen_bcell = marker.info[["B cell_4"]]
ordered_bcell = chosen_bcell[order(chosen_bcell$rank.AUC), ]
head(ordered_bcell[, 1:4])

## DataFrame with 6 rows and 4 columns
##          self.average other.average self.detected other.detected
##             <numeric>     <numeric>     <numeric>      <numeric>
## CD74         2.862862      2.421214      0.951724       0.869083
## BGN          0.976033      0.452248      0.531034       0.275256
## COL3A1       0.635565      0.266700      0.413793       0.182305
## HLA-DRB1     1.733911      1.286152      0.813793       0.613179
## COL1A1       0.521940      0.199646      0.379310       0.152046
## IGHA1        2.682250      1.987282      0.827586       0.686425

plotExpression(lung5[, lung5$cell_type %in% "B cell"],
  features = head(rownames(ordered_bcell), 9),
  x = "hoodscanr_region",
  colour_by = "hoodscanr_region",
  ncol = 3
) +
  geom_vline(xintercept = c(3, 4), lwd = 0.1, lty = 2) +
  scale_colour_brewer(palette = "Paired", na.value = "#333333", name = "cluster") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5)))

Next we compare T cells from region 4 (lymphoid aggregates) against those from region 3 (dispersed). Investigating and visualising the top genes shows that FYB1, FYN, CD37, and CXCR4 are up-regulated in T cells. The first two genes indicate increased T-cell receptor (TCR) signalling while the latter two are involved in immune cell recruitment and infiltration. These genes suggest increased activity and infiltration of T cells in the lymphoid aggregates.

# study markers of T cell_4
chosen_tcell = marker.info[["T cell_4"]]
ordered_tcell = chosen_tcell[order(chosen_tcell$rank.AUC), ]
head(ordered_tcell[, 1:4])

## DataFrame with 6 rows and 4 columns
##        self.average other.average self.detected other.detected
##           <numeric>     <numeric>     <numeric>      <numeric>
## IL32       1.058549      0.674651      0.654930       0.444081
## COL9A2     2.676123      2.192122      0.971831       0.907093
## S100A4     0.733507      0.326125      0.464789       0.238184
## CD3E       1.120072      0.810657      0.725352       0.513527
## COL1A1     0.497963      0.199646      0.369718       0.152046
## IL7R       0.733384      0.367926      0.436620       0.257812

plotExpression(lung5[, lung5$cell_type %in% "T cell"],
  features = head(rownames(ordered_tcell), 9),
  x = "hoodscanr_region",
  colour_by = "hoodscanr_region",
  ncol = 3
) +
  geom_vline(xintercept = c(3, 4), lwd = 0.1, lty = 2) +
  scale_colour_brewer(palette = "Paired", na.value = "#333333", name = "cluster") +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(override.aes = list(shape = 15, size = 5)))

Functional analysis

The example dataset used here was generated using the Bruker/NanoString CosMx SMI platform that measured 960 genes. More recent versions of this platform provide whole transcriptome measurements (>20k genes). It is relatively easy to scan through the strongest differentially expressed genes from a list of at most 960 genes to understand the processes underpinning the system. With larger panel sizes, we need computational tools to perform such analyses, often referred to as gene-set enrichment analysis or functional analysis. As this is an extensive topic, it is not covered in depth in this workshop, however, we have previously developed a workshop on gene-set enrichment analysis and refer the attendees to that if they are interested. A web-based interface of some of the tools in our workshop can be accessed through vissE.cloud.

Optional: You can also explore the full dataset from this workshop on vissE.cloud through the following link. This exploratory analysis performs principal components analysis (PCA) or non-negative matrix factorisation (NMF) on the dataset to identify macroscopic biological programs. It then performs gene-set enrichment analysis on the factor loadings to explain the biological processes.

Packages used

This workflow depends on various packages from version 3.22 of the Bioconductor project, running on R version 4.5.2 (2025-10-31) or higher. The complete list of the packages used for this workflow are shown below:

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Banksy_1.6.0                       hoodscanR_1.7.2                   
##  [3] CosMxSpatialAnalysisWorkshop_1.0.0 singscore_1.30.0                  
##  [5] SpaNorm_1.3.4                      SpatialExperiment_1.20.0          
##  [7] SingleR_2.12.0                     standR_1.14.0                     
##  [9] scater_1.38.0                      scran_1.38.0                      
## [11] scuttle_1.20.0                     SingleCellExperiment_1.32.0       
## [13] SummarizedExperiment_1.40.0        Biobase_2.70.0                    
## [15] GenomicRanges_1.62.0               Seqinfo_1.0.0                     
## [17] IRanges_2.44.0                     S4Vectors_0.48.0                  
## [19] BiocGenerics_0.56.0                generics_0.1.4                    
## [21] MatrixGenerics_1.22.0              matrixStats_1.5.0                 
## [23] patchwork_1.3.2                    ggplot2_4.0.0                     
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3        jsonlite_2.0.0           
##   [3] magrittr_2.0.4            ggbeeswarm_0.7.2         
##   [5] magick_2.9.0              farver_2.1.2             
##   [7] rmarkdown_2.30            fs_1.6.6                 
##   [9] ragg_1.5.0                vctrs_0.6.5              
##  [11] memoise_2.0.1             DelayedMatrixStats_1.32.0
##  [13] prettydoc_0.4.1           htmltools_0.5.8.1        
##  [15] S4Arrays_1.10.0           BiocNeighbors_2.4.0      
##  [17] SparseArray_1.10.1        sass_0.4.10              
##  [19] bslib_0.9.0               htmlwidgets_1.6.4        
##  [21] desc_1.4.3                plyr_1.8.9               
##  [23] cachem_1.1.0              igraph_2.2.1             
##  [25] lifecycle_1.0.4           pkgconfig_2.0.3          
##  [27] rsvd_1.0.5                Matrix_1.7-4             
##  [29] R6_2.6.1                  fastmap_1.2.0            
##  [31] aricode_1.0.3             digest_0.6.37            
##  [33] AnnotationDbi_1.72.0      dqrng_0.4.1              
##  [35] irlba_2.3.5.1             textshaping_1.0.4        
##  [37] RSQLite_2.4.3             beachmat_2.26.0          
##  [39] sccore_1.0.6              labeling_0.4.3           
##  [41] httr_1.4.7                abind_1.4-8              
##  [43] compiler_4.5.2            bit64_4.6.0-1            
##  [45] withr_3.0.2               S7_0.2.0                 
##  [47] BiocParallel_1.44.0       viridis_0.6.5            
##  [49] DBI_1.2.3                 DelayedArray_0.36.0      
##  [51] rjson_0.2.23              bluster_1.20.0           
##  [53] tools_4.5.2               vipor_0.4.7              
##  [55] beeswarm_0.4.0            glue_1.8.0               
##  [57] dbscan_1.2.3              grid_4.5.2               
##  [59] cluster_2.1.8.1           reshape2_1.4.4           
##  [61] gtable_0.3.6              tidyr_1.3.1              
##  [63] data.table_1.17.8         BiocSingular_1.26.0      
##  [65] ScaledMatrix_1.18.0       metapod_1.18.0           
##  [67] XVector_0.50.0            RcppAnnoy_0.0.22         
##  [69] RANN_2.6.2                ggrepel_0.9.6            
##  [71] pillar_1.11.1             stringr_1.6.0            
##  [73] limma_3.66.0              RcppHungarian_0.3        
##  [75] splines_4.5.2             dplyr_1.1.4              
##  [77] lattice_0.22-7            bit_4.6.0                
##  [79] annotate_1.88.0           tidyselect_1.2.1         
##  [81] locfit_1.5-9.12           Biostrings_2.78.0        
##  [83] knitr_1.50                gridExtra_2.3            
##  [85] edgeR_4.8.0               scattermore_1.2          
##  [87] xfun_0.54                 statmod_1.5.1            
##  [89] stringi_1.8.7             yaml_2.3.10              
##  [91] evaluate_1.0.5            codetools_0.2-20         
##  [93] tibble_3.3.0              BiocManager_1.30.26      
##  [95] graph_1.88.0              cli_3.6.5                
##  [97] uwot_0.2.3                leidenAlg_1.1.5          
##  [99] xtable_1.8-4              systemfonts_1.3.1        
## [101] jquerylib_0.1.4           Rcpp_1.1.0               
## [103] png_0.1-8                 XML_3.99-0.20            
## [105] parallel_4.5.2            pkgdown_2.2.0            
## [107] blob_1.2.4                mclust_6.1.2             
## [109] ggalluvial_0.12.5         sparseMatrixStats_1.22.0 
## [111] viridisLite_0.4.2         GSEABase_1.72.0          
## [113] scales_1.4.0              purrr_1.2.0              
## [115] crayon_1.5.3              rlang_1.1.6              
## [117] cowplot_1.2.0             KEGGREST_1.50.0

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2025. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Amezquita, Robert A, Aaron TL Lun, Etienne Becht, Vince J Carey, Lindsay N Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17 (2): 137–45.

Aran, Dvir, Agnieszka P Looney, Leqian Liu, Esther Wu, Valerie Fong, Austin Hsu, Suzanna Chak, et al. 2019. “Reference-Based Analysis of Lung Single-Cell Sequencing Reveals a Transitional Profibrotic Macrophage.” Nature Immunology 20 (2): 163–72.

Bhuva, Dharmesh D, Chin Wee Tan, Agus Salim, Claire Marceaux, Marie A Pickering, Jinjin Chen, Malvika Kharbanda, et al. 2024. “Library Size Confounds Biology in Spatial Transcriptomics Data.” Genome Biology 25 (1): 99.

Fu, Xiaohang, Yingxin Lin, David M Lin, Daniel Mechtersheimer, Chuhan Wang, Farhan Ameen, Shila Ghazanfar, Ellis Patrick, Jinman Kim, and Jean YH Yang. 2024. “BIDCell: Biologically-Informed Self-Supervised Learning for Segmentation of Subcellular Spatial Transcriptomics Data.” Nature Communications 15 (1): 509.

Liu, Ning, Jarryd Martin, Dharmesh D Bhuva, Jinjin Chen, Mengbo Li, Samuel C Lee, Malvika Kharbanda, et al. 2024. “hoodscanR: Profiling Single-Cell Neighborhoods in Spatial Transcriptomics Data.” bioRxiv, 2024–03.

Qiu, Yixuan. 2021. Prettydoc: Creating Pretty Documents from r Markdown. https://github.com/yixuan/prettydoc.

Salim, Agus, Dharmesh D Bhuva, Carissa Chen, Pengyi Yang, Melissa J Davis, and Jean YH Yang. 2024. “SpaNorm: Spatially-Aware Normalisation for Spatial Transcriptomics Data.” bioRxiv, 2024–05.

Singhal, Vipul, Nigel Chou, Joseph Lee, Yifei Yue, Jinyue Liu, Wan Kee Chock, Li Lin, et al. 2024. “BANKSY Unifies Cell Typing and Tissue Domain Segmentation for Scalable Spatial Omics Data Analysis.” Nature Genetics 56 (3): 431–41.

Totty, Michael, Stephanie C Hicks, and Boyi Guo. 2024. “SpotSweeper: Spatially-Aware Quality Control for Spatial Transcriptomics.” bioRxiv.

Townes, F William, Stephanie C Hicks, Martin J Aryee, and Rafael A Irizarry. 2019. “Feature Selection and Dimension Reduction for Single-Cell RNA-Seq Based on a Multinomial Model.” Genome Biology 20 (1): 295.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2025. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.

Xie, Yihui. 2025. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.