Find marker genes with spatial coordinates

This function will find the most spatially relevant cluster label for each gene.

Usage

lasso_markers(
  gene_mt,
  cluster_mt,
  sample_names,
  keep_positive = TRUE,
  coef_cutoff = 0.05,
  background = NULL,
  n_fold = 10
)

Arguments

gene_mt: A matrix contains the transcript count in each grid. Each row refers to a grid, and each column refers to a gene. The column names must be specified and refer to the genes. This can be the output from the function get_vectors.
cluster_mt: A matrix contains the number of cells in a specific cluster in each grid. Each row refers to a grid, and each column refers to a cluster. The column names must be specified and refer to the clusters. Please do not assign integers as column names. This can be the output from the function get_vectors.
sample_names: A vector specifying the names for the samples.
keep_positive: A logical flag indicating whether to return positively correlated clusters or not.
coef_cutoff: A positive number giving the coefficient cutoff value. Genes whose top cluster showing a coefficient vlaue smaller than the cutoff will be . Default is 0.05.
background: Optional. A matrix providing the background information. Each row refers to a grid, and each column refers to one category of background information. Number of rows must equal to the number of rows in gene_mt and cluster_mt. Can be obtained by only providing coordinates matrices cluster_info. to function get_vectors.
n_fold: Optional. A positive number giving the number of folds used for cross validation. This parameter will pass to cv.glmnet to calculate a penalty term for every gene.

Value

a list of two matrices with the following components

lasso_top_result

A matrix with detailed information for each gene and the most relevant cluster label.

gene Gene name
top_cluster The name of the most revelant cluster after thresholding the coefficients.
glm_coef The coefficient of the selected cluster in the generalised linear model.
pearson Pearson correlation between the gene vector and the selected cluster vector.
max_gg_corr A number showing the maximum pearson correlation for this gene vector and all other gene vectors in the input gene_mt
max_gc_corr A number showing the maximum pearson correlation for this gene vector and every cluster vectors in the input cluster_mt

lasso_full_result

A matrix with detailed information for each gene and the most relevant cluster label.

gene Gene name
cluster The name of the significant cluster after
glm_coef The coefficient of the selected cluster in the generalised linear model.
pearson Pearson correlation between the gene vector and the selected cluster vector.
max_gg_corr A number showing the maximum pearson correlation for this gene vector and all other gene vectors in the input gene_mt
max_gc_corr A number showing the maximum pearson correlation for this gene vector and every cluster vectors in the input cluster_mt

Details

This function will take the converted gene and cluster vectors from function get_vectors, and return the most relevant cluster label for each gene. If there are multiple samples in the dataset, this function will find shared markers across different samples by including additional sample vectors in the input cluster_mt.

This function treats all input cluster vectors as features, and create a penalized linear model for one gene vector with lasso regularization. Clusters with non-zero coefficient will be selected, and these clusters will be used to formulate a generalised linear model for this gene vector.

If the input keep_positive is TRUE, the clusters with positive coefficient and significant p-value will be saved in the output matrix lasso_full_result. The cluster with a positive coefficient and the minimum p-value will be regarded as the most relevant cluster to this gene and be saved in the output matrix lasso_result.
If the input keep_positive is FALSE, the clusters with negative coefficient and significant p-value will be saved in the output matrix lasso_full_result. The cluster with a negative coefficient and the minimum p-value will be regarded as the most relevant cluster to this gene and be saved in the output matrix lasso_result.

If there is no clusters with significant p-value, the a string "NoSig" will be returned for this gene.

The parameter background can be used to capture unwanted noise pattern in the dataset. For example, we can include negative control genes as a background cluster in the model. If the most relevant cluster selected by one gene matches the background "clusters", we will return "NoSig" for this gene.

Examples


set.seed(100)
#  simulate coordinates for clusters
df_clA = data.frame(x = rnorm(n=100, mean=20, sd=5),
                 y = rnorm(n=100, mean=20, sd=5), cluster="A")
df_clB = data.frame(x = rnorm(n=100, mean=100, sd=5),
                y = rnorm(n=100, mean=100, sd=5), cluster="B")

clusters = rbind(df_clA, df_clB)
clusters$sample="rep1"

# simulate coordinates for genes
trans_info = data.frame(rbind(cbind(x = rnorm(n=100, mean=20,sd=5),
                                y = rnorm(n=100, mean=20, sd=5),
                                 feature_name="gene_A1"),
                           cbind(x = rnorm(n=100, mean=20, sd=5),
                                 y = rnorm(n=100, mean=20, sd=5),
                                 feature_name="gene_A2"),
                           cbind(x = rnorm(n=100, mean=100, sd=5),
                                 y = rnorm(n=100, mean=100, sd=5),
                                 feature_name="gene_B1"),
                           cbind(x = rnorm(n=100, mean=100, sd=5),
                                 y = rnorm(n=100, mean=100, sd=5),
                                 feature_name="gene_B2")))
trans_info$x=as.numeric(trans_info$x)
trans_info$y=as.numeric(trans_info$y)
w_x =  c(min(floor(min(trans_info$x)),
         floor(min(clusters$x))),
      max(ceiling(max(trans_info$x)),
          ceiling(max(clusters$x))))
w_y =  c(min(floor(min(trans_info$y)),
          floor(min(clusters$y))),
      max(ceiling(max(trans_info$y)),
          ceiling(max(clusters$y))))
data = list(trans_info = trans_info)
vecs_lst = get_vectors(data_lst=list(rep1=data), cluster_info = clusters,
                    bin_type = "square",
                    bin_param = c(20,20),
                    all_genes =c("gene_A1","gene_A2","gene_B1","gene_B2"),
                    w_x = w_x, w_y=w_y)
lasso_res = lasso_markers(gene_mt=vecs_lst$gene_mt,
                        cluster_mt = vecs_lst$cluster_mt,
                        sample_names=c("rep1"),
                        keep_positive=TRUE,
                        coef_cutoff=0.05,
                        background=NULL)
#> Warning: NAs introduced by coercion

Usage

Arguments

Value

Details

See also

Examples