This function will find the most spatially relevant cluster label for each gene.
Usage
lasso_markers(
gene_mt,
cluster_mt,
sample_names,
keep_positive = TRUE,
coef_cutoff = 0.05,
background = NULL,
n_fold = 10
)Arguments
- gene_mt
A matrix contains the transcript count in each grid. Each row refers to a grid, and each column refers to a gene. The column names must be specified and refer to the genes. This can be the output from the function
get_vectors.- cluster_mt
A matrix contains the number of cells in a specific cluster in each grid. Each row refers to a grid, and each column refers to a cluster. The column names must be specified and refer to the clusters. Please do not assign integers as column names. This can be the output from the function
get_vectors.- sample_names
A vector specifying the names for the samples.
- keep_positive
A logical flag indicating whether to return positively correlated clusters or not.
- coef_cutoff
A positive number giving the coefficient cutoff value. Genes whose top cluster showing a coefficient vlaue smaller than the cutoff will be . Default is 0.05.
- background
Optional. A matrix providing the background information. Each row refers to a grid, and each column refers to one category of background information. Number of rows must equal to the number of rows in
gene_mtandcluster_mt. Can be obtained by only providing coordinates matricescluster_info. to functionget_vectors.- n_fold
Optional. A positive number giving the number of folds used for cross validation. This parameter will pass to
cv.glmnetto calculate a penalty term for every gene.
Value
a list of two matrices with the following components
lasso_top_resultA matrix with detailed information for each gene and the most relevant cluster label.
geneGene nametop_clusterThe name of the most revelant cluster after thresholding the coefficients.glm_coefThe coefficient of the selected cluster in the generalised linear model.pearsonPearson correlation between the gene vector and the selected cluster vector.max_gg_corrA number showing the maximum pearson correlation for this gene vector and all other gene vectors in the inputgene_mtmax_gc_corrA number showing the maximum pearson correlation for this gene vector and every cluster vectors in the inputcluster_mt
lasso_full_resultA matrix with detailed information for each gene and the most relevant cluster label.
geneGene nameclusterThe name of the significant cluster afterglm_coefThe coefficient of the selected cluster in the generalised linear model.pearsonPearson correlation between the gene vector and the selected cluster vector.max_gg_corrA number showing the maximum pearson correlation for this gene vector and all other gene vectors in the inputgene_mtmax_gc_corrA number showing the maximum pearson correlation for this gene vector and every cluster vectors in the inputcluster_mt
Details
This function will take the converted gene and cluster vectors from function
get_vectors, and return the most relevant cluster label for
each gene. If there are multiple samples in the dataset, this function
will find shared markers across different samples by including additional
sample vectors in the input cluster_mt.
This function treats all input cluster vectors as features, and create a penalized linear model for one gene vector with lasso regularization. Clusters with non-zero coefficient will be selected, and these clusters will be used to formulate a generalised linear model for this gene vector.
If the input
keep_positiveis TRUE, the clusters with positive coefficient and significant p-value will be saved in the output matrixlasso_full_result. The cluster with a positive coefficient and the minimum p-value will be regarded as the most relevant cluster to this gene and be saved in the output matrixlasso_result.If the input
keep_positiveis FALSE, the clusters with negative coefficient and significant p-value will be saved in the output matrixlasso_full_result. The cluster with a negative coefficient and the minimum p-value will be regarded as the most relevant cluster to this gene and be saved in the output matrixlasso_result.
If there is no clusters with significant p-value, the a string "NoSig" will be returned for this gene.
The parameter background can be used to capture unwanted noise
pattern in the dataset. For example, we can include negative control
genes as a background cluster in the model. If the most relevant cluster
selected by one gene matches the background "clusters",
we will return "NoSig" for this gene.
Examples
set.seed(100)
# simulate coordinates for clusters
df_clA = data.frame(x = rnorm(n=100, mean=20, sd=5),
y = rnorm(n=100, mean=20, sd=5), cluster="A")
df_clB = data.frame(x = rnorm(n=100, mean=100, sd=5),
y = rnorm(n=100, mean=100, sd=5), cluster="B")
clusters = rbind(df_clA, df_clB)
clusters$sample="rep1"
# simulate coordinates for genes
trans_info = data.frame(rbind(cbind(x = rnorm(n=100, mean=20,sd=5),
y = rnorm(n=100, mean=20, sd=5),
feature_name="gene_A1"),
cbind(x = rnorm(n=100, mean=20, sd=5),
y = rnorm(n=100, mean=20, sd=5),
feature_name="gene_A2"),
cbind(x = rnorm(n=100, mean=100, sd=5),
y = rnorm(n=100, mean=100, sd=5),
feature_name="gene_B1"),
cbind(x = rnorm(n=100, mean=100, sd=5),
y = rnorm(n=100, mean=100, sd=5),
feature_name="gene_B2")))
trans_info$x=as.numeric(trans_info$x)
trans_info$y=as.numeric(trans_info$y)
w_x = c(min(floor(min(trans_info$x)),
floor(min(clusters$x))),
max(ceiling(max(trans_info$x)),
ceiling(max(clusters$x))))
w_y = c(min(floor(min(trans_info$y)),
floor(min(clusters$y))),
max(ceiling(max(trans_info$y)),
ceiling(max(clusters$y))))
data = list(trans_info = trans_info)
vecs_lst = get_vectors(data_lst=list(rep1=data), cluster_info = clusters,
bin_type = "square",
bin_param = c(20,20),
all_genes =c("gene_A1","gene_A2","gene_B1","gene_B2"),
w_x = w_x, w_y=w_y)
lasso_res = lasso_markers(gene_mt=vecs_lst$gene_mt,
cluster_mt = vecs_lst$cluster_mt,
sample_names=c("rep1"),
keep_positive=TRUE,
coef_cutoff=0.05,
background=NULL)
#> Warning: NAs introduced by coercion