cell2clone - identifying lymphocyte clonal expansions from a single-cell dataset

[1]:

import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import cell2clone as c2c

%config InlineBackend.figure_format = "retina"

import warnings
warnings.filterwarnings("ignore")

In this notebook we demonstrate the use of cell2clone to estimate clonality of cell type compartments across samples, and subsequently to assign single cells into clones in highly clonal samples. Taking advantage of mutually exclusive expression of immunoglobulin light chains (IGKC/IGLC) in single B cells, and T cell receptor constant chains (TRBC1/TRBC2) in single T cells, cell2clone classifies each B or T cell into kappa/lambda or TRBC1/TRBC2 classes to infer clonal restriction within phenotypically defined cell populations. Non-clonal cells, such as naïve B or T cells show constant proportions of the two classes across samples, whereas cell types containing clonal expansions exhibit skewing of the fraction to either class in samples harbouring clonal expansions.

[2]:

# Load data (https://covid19.cog.sanger.ac.uk/submissions/release1/BCR_Moderate.h5ad, from https://www.covid19cellatlas.org/index.patient.html)
adata = sc.read_h5ad("data/BCR_Moderate.h5ad")

We load a dataset of B cells from COVID-19 patients (Stephenson et al. Nature Medicine 2021) with paired transcriptome and BCR sequencing data. Cell2clone is specifically intended for datasets without VDJ data. Here we use BCR sequencing information to validate cell2clone inferences.

[3]:

adata

[3]:

AnnData object with n_obs × n_vars = 11613 × 31279
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'initial_clustering', 'Resample', 'Collection_Day', 'patient_id', 'Sex', 'Age', 'Ethnicity', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Status_3_days_post_collection', 'Status_7_days_post_collection', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'initial_clustering_B', 'leiden_B', 'leiden_B2', 'celltype_B', 'celltype_B_v2', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'clone_centrality', 'clone_degree', 'clone_id_size', 'mu_freq_heavy', 'mu_freq_light', 'clone_id_size_max_3', 'clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_size_gini', 'clone_centrality_gini', 'junction_heavy', 'junction_light', 'junction_aa_heavy', 'junction_aa_light'
    var: 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'Site_colors', 'Status_on_day_collection_summary_colors', 'bcr_QC_pass_colors', 'celltype_B_colors', 'celltype_B_v2_colors', 'filter_bcr_colors', 'hvg', 'initial_clustering_colors', 'isotype_colors', 'leiden', 'leiden_B_colors', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'rna_neighbors', 'umap'
    obsm: 'X_bcr', 'X_pca', 'X_pca_harmony', 'X_umap'
    obsp: 'bcr_connectivities', 'bcr_distances', 'connectivities', 'distances', 'rna_connectivities', 'rna_distances'

[4]:

# Visualise broad cell types
sc.pl.umap(adata, color="initial_clustering")

../_images/notebooks_001_cell2clone_identify_clonal_expansions_6_0.png

[5]:

# Visualise fine-grained cell types
sc.pl.umap(adata, color="celltype_B_v2")

../_images/notebooks_001_cell2clone_identify_clonal_expansions_7_0.png

[6]:

# Visualise sample IDs
sc.pl.umap(adata, color="sample_id")

../_images/notebooks_001_cell2clone_identify_clonal_expansions_8_0.png

[7]:

# Subset to B cells for simplicity
adata = adata[adata.obs["initial_clustering"].isin(["B_cell"])]

Classifying single cells based on light chain or constant chain usage

[8]:

# Define metadata columns to use
celltype = "celltype_B_v2"
sample = "sample_id"

We define each B cell as a kappa type if the expression of kappa gene is higher than a gene set of lambda genes, and as lambda type if the expression of the lambda gene set is higher thankappa gene expression. For T cells, we do the same using β-chain constant region genes TRBC1 and TRBC2. Cells in which neither gene or gene set is detected are classified as ambiguous.

[9]:

# Classify light chain class (B cells) or constant chain class (T cells) for each cell
c2c.classify_lightchain_or_trbc(
    adata,
    mode="B" # mode="T" if working with T cells
)

[10]:

adata

[10]:

AnnData object with n_obs × n_vars = 8604 × 31279
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'initial_clustering', 'Resample', 'Collection_Day', 'patient_id', 'Sex', 'Age', 'Ethnicity', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Status_3_days_post_collection', 'Status_7_days_post_collection', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'initial_clustering_B', 'leiden_B', 'leiden_B2', 'celltype_B', 'celltype_B_v2', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'clone_centrality', 'clone_degree', 'clone_id_size', 'mu_freq_heavy', 'mu_freq_light', 'clone_id_size_max_3', 'clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_size_gini', 'clone_centrality_gini', 'junction_heavy', 'junction_light', 'junction_aa_heavy', 'junction_aa_light', 'kappa_sum', 'lambda_sum', 'KL_class', 'KL_log_ratio'
    var: 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'Site_colors', 'Status_on_day_collection_summary_colors', 'bcr_QC_pass_colors', 'celltype_B_colors', 'celltype_B_v2_colors', 'filter_bcr_colors', 'hvg', 'initial_clustering_colors', 'isotype_colors', 'leiden', 'leiden_B_colors', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'rna_neighbors', 'umap', 'sample_id_colors'
    obsm: 'X_bcr', 'X_pca', 'X_pca_harmony', 'X_umap'
    obsp: 'bcr_connectivities', 'bcr_distances', 'connectivities', 'distances', 'rna_connectivities', 'rna_distances'

[11]:

# Check added columns
adata.obs[[celltype, "kappa_sum", "lambda_sum", "KL_class", "KL_log_ratio"]] # for T cells "TRBC1_sum", "TRBC2_sum", "TRBC_class", "TRBC_log_ratio"]

[11]:

	celltype_B_v2	kappa_sum	lambda_sum	KL_class	KL_log_ratio
BGCV01_AACCATGTCATTCACT-1	B_naive	1.192134	0.000000	kappa	13.991257
BGCV01_AACCGCGAGCGTGAAC-1	B_naive	0.000000	0.000000	ambiguous	0.000000
BGCV01_AACTCAGGTGGCGAAT-1	B_naive	1.965800	4.370118	lambda	-0.798891
BGCV01_AACTCCCCAGTGGAGT-1	B_naive	0.000000	1.927203	lambda	-14.471581
BGCV01_ACATGGTAGCTCTCGG-1	B_immature	0.000000	4.017279	lambda	-15.206116
...	...	...	...	...	...
TTTACTGTCAAGGCTT-MH9179825	B_naive	1.384348	0.000000	kappa	14.140740
TTTATGCCACATTTCT-MH9179825	B_naive	0.000000	2.449453	lambda	-14.711376
TTTCCTCAGACTAAGT-MH9179825	B_naive	1.950898	0.000000	kappa	14.483801
TTTGGTTCAGGAATCG-MH9179825	B_naive	2.209406	0.000000	kappa	14.608234
TTTGTCACAGAGTGTG-MH9179825	B_naive	0.000000	0.000000	ambiguous	0.000000

8604 rows × 5 columns

We see that most cells have been classified to either kappa or lambda light chain, while a minority remain ambiguous as neither gene is detected, likely due to dropouts.

Identifying clonal expansions across cell types and samples

We compute median-centered log2 ratio of kappa and lambda cells for each cell type in each sample. This metric will be highly positive or negative in cell types and samples harbouring large clonal expansions, and close to 0 in fully polyclonal cell populations. We additionally derive a directionless clonality metric as the absolute value of the log2 ratio and clone size score to quantify the deviation of the largest clone fraction of each cell type from the expected value in a fully polyclonal cell population. A score of 0 indicates that the observed clone fraction equals the expected value (i.e. fully polyclonal population), whereas a score of 1 corresponds to the largest possible deviation (i.e. fully monoclonal population).

We set the minimum number of cells required for clonality estimation relatively high (40 cells) to increase confidence. This also makes sense as we are looking for clonal expansions.

[12]:

# Estimate clonality for each cell type for each donor
df_class = c2c.compute_sample_celltype_clonality(
    adata,
    sample_col=sample,
    celltype_col=celltype,
    class_col="KL_class",
    state_labels=("kappa", "lambda"),
    clone_score_method="global",
    min_cells=40,
)

[13]:

columns_to_list = [sample, celltype, "kappa", "lambda", "frac_kappa", "KL_class_log2_ratio_scaled", "clonality", "clone_size_score"]

[14]:

df_class.sort_values("clone_size_score", ascending=False)[columns_to_list].head(60)

[14]:

	sample_id	celltype_B_v2	kappa	lambda	frac_kappa	KL_class_log2_ratio_scaled	clonality	clone_size_score
78	MH8919330	B_exhausted	50	2	0.961538	4.335733	4.335733	0.738166
51	BGCV07_CV0094	B_naive	207	12	0.945205	3.800402	3.800402	0.708641
4	AP3	B_exhausted	40	10	0.800000	1.691878	1.691878	0.446154
146	MH9143424	B_switched_memory	32	10	0.761905	1.369950	1.369950	0.377289
68	BGCV11_CV0257	B_naive	154	54	0.740385	1.203777	1.203777	0.338388
143	MH9143424	B_immature	34	57	0.373626	-1.053549	1.053549	0.324598
58	BGCV07_CV0134	B_naive	68	25	0.731183	1.135484	1.135484	0.321754
63	BGCV07_CV0137	B_naive	47	20	0.701493	0.924538	0.924538	0.268083
1	AP3	B_naive	107	47	0.694805	0.878756	0.878756	0.255994
21	BGCV02_CV0052	B_naive	67	31	0.683673	0.803771	0.803771	0.235871
136	MH9143423	B_immature	17	23	0.425000	-0.744221	0.744221	0.231731
102	MH9143322	B_naive	44	58	0.431373	-0.706672	0.706672	0.220211
177	MH9179825	B_naive	190	98	0.659722	0.647023	0.647023	0.192575
109	MH9143323	B_naive	71	86	0.452229	-0.584640	0.584640	0.182509
8	AP9	B_naive	27	32	0.457627	-0.553235	0.553235	0.172751
38	BGCV06_CV0037	B_naive	83	45	0.648438	0.575064	0.575064	0.172175
116	MH9143324	B_naive	100	55	0.645161	0.554374	0.554374	0.166253
97	MH9143275	B_switched_memory	24	27	0.470588	-0.478047	0.478047	0.149321
82	MH8919331	B_naive	55	61	0.474138	-0.457500	0.457500	0.142905
137	MH9143423	B_naive	77	85	0.475309	-0.450727	0.450727	0.140788
163	MH9179823	B_naive	71	74	0.489655	-0.367829	0.367829	0.114854
130	MH9143371	B_naive	37	38	0.493333	-0.346596	0.346596	0.108205
123	MH9143327	B_naive	44	44	0.500000	-0.308122	0.308122	0.096154
144	MH9143424	B_naive	299	285	0.511986	-0.238939	0.238939	0.074486
95	MH9143275	B_naive	135	127	0.515267	-0.219991	0.219991	0.068555
44	BGCV06_CV0234	B_naive	87	61	0.587838	0.204084	0.204084	0.062630
139	MH9143423	B_switched_memory	23	21	0.522727	-0.176878	0.176878	0.055070
157	MH9179821	B_naive	69	62	0.526718	-0.153794	0.153794	0.047857
15	BGCV01_CV0025	B_naive	39	29	0.573529	0.119299	0.119299	0.036765
170	MH9179824	B_naive	129	113	0.533058	-0.117074	0.117074	0.036395
89	MH9143270	B_naive	43	33	0.565789	0.073748	0.073748	0.022773
172	MH9179824	B_switched_memory	32	25	0.561404	0.048022	0.048022	0.014845
75	MH8919330	B_naive	26	21	0.553191	0.000000	0.000000	0.000000

[15]:

df_class[df_class[celltype]=="B_exhausted"].sort_values("clone_size_score", ascending=False)

[15]:

	sample_id	celltype_B_v2	kappa	lambda	n_cells	n_cells_sample_all	frac_cells_in_sample	frac_kappa	KL_class_abs_skew	KL_class_entropy	KL_class_log2_ratio	KL_class_log2_ratio_scaled	clonality	clone_size_score	clone_size_score_center
78	MH8919330	B_exhausted	50	2	52	229	0.227074	0.961538	0.461538	0.163024	4.643855	4.335733	4.335733	0.738166	0.553191
4	AP3	B_exhausted	40	10	50	400	0.125000	0.800000	0.300000	0.500402	2.000000	1.691878	1.691878	0.446154	0.553191

[16]:

fig, ax = c2c.plot_chain_skew_violin(
    df_class,
    adata,
    celltype_col=celltype,
    mode="B",
    xlim=(-5,5)
)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_23_0.png

We perform a permutation test to identify cell types enriched for cells with unusually large clone size scores. Cell type labels are randomly shuffled while keeping clone size scores fixed, and the proportion of cells exceeding the reference threshold are recalculated for each permutation. This generates a null distribution against which the observed proportion is compared to determine whether the enrichment of high clone size scores is greater than expected by chance.

We use naive B cells as reference as they are not supposed to be clonal and set a percentile of the reference distribution that we consider to indicate sufficient deviation from polyclonality.

[17]:

res = c2c.permutation_test(
    df_class,
    reference_celltype="B_naive",
    celltype_col=celltype,
    sample_col=sample,
    percentile=0.975,
    n_perms=10000,
    min_samples=2
)
res

[17]:

	celltype	reference_celltype	percentile	threshold	n_samples	n_exceeding	pct_exceeding	p_perm	q_perm
0	B_exhausted	B_naive	0.975	0.486489	2	1	50.0	0.129287	0.387861
1	B_switched_memory	B_naive	0.975	0.486489	4	0	0.0	1.000000	1.000000
2	B_immature	B_naive	0.975	0.486489	2	0	0.0	1.000000	1.000000

Here none of the cell types reach significance due to a relatively small dataset.

[18]:

# define significance
alpha = 0.05
res["significant"] = res["q_perm"] < alpha

# table of significant hits (for figure annotation or reporting)
res_sig = res.loc[res["significant"]].sort_values("q_perm")
res_sig

[18]:

	celltype	reference_celltype	percentile	threshold	n_samples	n_exceeding	pct_exceeding	p_perm	q_perm	significant

[19]:

fig, ax = c2c.plot_clone_size_score_violin(
    df_class,
    res=res,
    celltype_col=celltype,
)
plt.show()

../_images/notebooks_001_cell2clone_identify_clonal_expansions_28_0.png

We identify two samples above the threshold defined based on naive B cell distribution. One sample has a putative clonal expansion in exhausted B cells, and the other in naive B cells, which is unexpected.

[20]:

# Add clonality information to adata
adata = c2c.add_df_class_to_adata_obs(
    adata,
    df_class,
    sample_col=sample,
    celltype_col=celltype,
)

[21]:

sc.pl.umap(
    adata,
    color=["clone_size_score", "frac_kappa", celltype],
    frameon=False,
)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_31_0.png

Here the clonality information is defined at the level of pre-annotated cell types. We next look at how to assign single cells to clones independent of prior cell type annotation, which may provide higher resolution and identify clones comprising only a subset of a cell type in a sample.

Assigning single cells to clones

We next select a highly clonal sample/cell type and examine it as single-cell resolution, starting with exhausted B cells.

[22]:

res = c2c.subset_top_bottom_clonal_samples(
    df_clono=df_class,
    adata=adata,
    celltype_col=celltype,
    sample_col=sample,
    frac_celltype="B_exhausted",
    frac_percentile=0.5,
    top_n=1,
    bottom_n=0,
    score_agg="max",
)

print("Celltype frac threshold used:", res["frac_threshold"])
print("Top samples:", res["top_samples"])
print("Bottom samples:", res["bottom_samples"])

adata_sub = res["adata_selected"]

Celltype frac threshold used: 0.17603711790393012
Top samples: ['MH8919330']
Bottom samples: []

We identify highly variable genes after excluding TCR/BCR genes and compute PCA, neighbours and UMAP.

[23]:

# --- define gene families to exclude ---
exclude_regex = (
    r"^TR[ABDG][CDVJ]"   # TCR alpha/beta/delta/gamma (V/C/J)
    r"|^TRAC|^TRBC"     # TCR constant chains
    r"|^IG[HKL][VCJD]" # Ig heavy/kappa/lambda
    r"|^IGH|^IGK|^IGL"  # safety net
)

exclude_mask = adata_sub.var_names.str.contains(
    exclude_regex, regex=True
)

print(f"Excluding {exclude_mask.sum()} TCR/Ig genes from HVG selection")

# --- HVG selection on filtered gene set ---
adata_hvg = adata_sub[:, ~exclude_mask].copy()

sc.pp.highly_variable_genes(adata_hvg)

# --- map HVGs back to full adata_sub ---
hvg = np.zeros(adata_sub.n_vars, dtype=bool)
hvg[~exclude_mask] = adata_hvg.var["highly_variable"].values
adata_sub.var["highly_variable"] = hvg

# --- downstream ---
sc.tl.pca(adata_sub, use_highly_variable=True)
sc.pp.neighbors(adata_sub, use_rep="X_pca")
sc.tl.umap(adata_sub)

Excluding 644 TCR/Ig genes from HVG selection

We assign single cells to clones based on Leiden clusters containing one light/constant chain class above a purity cutoff. Here clones are called if a Leiden cluster is at least 80% pure. If multiple samples are included in the object, Leiden clustering is performed sample-wise as putative clones will be restricted to individuals.

[24]:

c2c.leiden_chain_clones_sampleaware(
    adata_sub,
    class_col="KL_class",
    valid_states=("kappa","lambda"),
    sample_col=sample,
    resolution=1,
    purity_cutoff=0.8,
    key_added="clone_call",
)

[24]:

AnnData object with n_obs × n_vars = 229 × 31279
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'initial_clustering', 'Resample', 'Collection_Day', 'patient_id', 'Sex', 'Age', 'Ethnicity', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Status_3_days_post_collection', 'Status_7_days_post_collection', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'initial_clustering_B', 'leiden_B', 'leiden_B2', 'celltype_B', 'celltype_B_v2', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'clone_centrality', 'clone_degree', 'clone_id_size', 'mu_freq_heavy', 'mu_freq_light', 'clone_id_size_max_3', 'clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_size_gini', 'clone_centrality_gini', 'junction_heavy', 'junction_light', 'junction_aa_heavy', 'junction_aa_light', 'kappa_sum', 'lambda_sum', 'KL_class', 'KL_log_ratio', 'kappa', 'lambda', 'frac_kappa', 'KL_class_log2_ratio_scaled', 'clonality', 'clone_size_score', 'sample_rank_class', 'sample_rank_score', 'sample_frac_B_exhausted', 'clone_call', 'clone_call_id'
    var: 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'Site_colors', 'Status_on_day_collection_summary_colors', 'bcr_QC_pass_colors', 'celltype_B_colors', 'celltype_B_v2_colors', 'filter_bcr_colors', 'hvg', 'initial_clustering_colors', 'isotype_colors', 'leiden', 'leiden_B_colors', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'rna_neighbors', 'umap', 'sample_id_colors', 'clone_call_summary'
    obsm: 'X_bcr', 'X_pca', 'X_pca_harmony', 'X_umap'
    varm: 'PCs'
    obsp: 'bcr_connectivities', 'bcr_distances', 'connectivities', 'distances', 'rna_connectivities', 'rna_distances'

[25]:

sc.pl.umap(adata_sub, color=[sample, celltype, 'KL_class', 'clone_call'], ncols=2, wspace=0.3)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_40_0.png

We identify a large expansion of exhausted B cells in this sample, which cell2clone infers to be clonal and kappa light chain-restricted. From KL_class we see how the polyclonal cell population contains a mixture of kappa and lambda B cells, whereas the clonal population contains almost exclusively kappa B cells (in addition to ambiguous cells due to dropouts). Here essentially all exhausted B cells in this sample appear to consist of a single expanded clone.

We next examine BCR sequencing data to confirm the inference.

[26]:

sc.pl.umap(adata_sub, color=['clone_id_size'])

../_images/notebooks_001_cell2clone_identify_clonal_expansions_43_0.png

[27]:

# Find the largest clone and plot it on UMAP

# largest clone
largest_clone = (
    adata_sub.obs["clone_id"]
    .astype(str)
    .value_counts()
    .idxmax()
)

largest_n = (
    adata_sub.obs["clone_id"]
    .astype(str)
    .value_counts()
    .max()
)

# annotation for plotting
adata_sub.obs["largest_clone"] = pd.Categorical(
    adata_sub.obs["clone_id"]
    .astype(str)
    .eq(largest_clone)
    .map({True: largest_clone, False: "other"}),
    categories=["other", largest_clone]
)

# UMAP
sc.pl.umap(
    adata_sub,
    color="largest_clone",
    palette=["lightgrey", "darkred"],
    title=f"Largest clone: {largest_clone} (n={largest_n})"
)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_44_0.png

We next check the other sample suggested to contain a clonal expansion based on cell2clone (in naive B cells).

[28]:

res = c2c.subset_top_bottom_clonal_samples(
    df_clono=df_class,
    adata=adata,
    celltype_col=celltype,
    sample_col=sample,
    frac_celltype="B_naive",
    frac_percentile=0,
    top_n=1,
    bottom_n=0,
    score_agg="max",
)

print("Celltype frac threshold used:", res["frac_threshold"])
print("Top samples:", res["top_samples"])
print("Bottom samples:", res["bottom_samples"])

adata_sub = res["adata_selected"]

Celltype frac threshold used: 0.2052401746724891
Top samples: ['BGCV07_CV0094']
Bottom samples: []

[29]:

# --- define gene families to exclude ---
exclude_regex = (
    r"^TR[ABDG][CDVJ]"   # TCR alpha/beta/delta/gamma (V/C/J)
    r"|^TRAC|^TRBC"     # TCR constant chains
    r"|^IG[HKL][VCJD]" # Ig heavy/kappa/lambda
    r"|^IGH|^IGK|^IGL"  # safety net
)

exclude_mask = adata_sub.var_names.str.contains(
    exclude_regex, regex=True
)

print(f"Excluding {exclude_mask.sum()} TCR/Ig genes from HVG selection")

# --- HVG selection on filtered gene set ---
adata_hvg = adata_sub[:, ~exclude_mask].copy()

sc.pp.highly_variable_genes(adata_hvg)

# --- map HVGs back to full adata_sub ---
hvg = np.zeros(adata_sub.n_vars, dtype=bool)
hvg[~exclude_mask] = adata_hvg.var["highly_variable"].values
adata_sub.var["highly_variable"] = hvg

# --- downstream ---
sc.tl.pca(adata_sub, use_highly_variable=True)
sc.pp.neighbors(adata_sub, use_rep="X_pca")
sc.tl.umap(adata_sub)

Excluding 644 TCR/Ig genes from HVG selection

[30]:

c2c.leiden_chain_clones_sampleaware(
    adata_sub,
    class_col="KL_class",
    valid_states=("kappa","lambda"),
    sample_col=sample,
    resolution=1,
    purity_cutoff=0.9,
    key_added="clone_call",
)

[30]:

AnnData object with n_obs × n_vars = 336 × 31279
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'initial_clustering', 'Resample', 'Collection_Day', 'patient_id', 'Sex', 'Age', 'Ethnicity', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Status_3_days_post_collection', 'Status_7_days_post_collection', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'initial_clustering_B', 'leiden_B', 'leiden_B2', 'celltype_B', 'celltype_B_v2', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'clone_centrality', 'clone_degree', 'clone_id_size', 'mu_freq_heavy', 'mu_freq_light', 'clone_id_size_max_3', 'clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_size_gini', 'clone_centrality_gini', 'junction_heavy', 'junction_light', 'junction_aa_heavy', 'junction_aa_light', 'kappa_sum', 'lambda_sum', 'KL_class', 'KL_log_ratio', 'kappa', 'lambda', 'frac_kappa', 'KL_class_log2_ratio_scaled', 'clonality', 'clone_size_score', 'sample_rank_class', 'sample_rank_score', 'sample_frac_B_naive', 'clone_call', 'clone_call_id'
    var: 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'Site_colors', 'Status_on_day_collection_summary_colors', 'bcr_QC_pass_colors', 'celltype_B_colors', 'celltype_B_v2_colors', 'filter_bcr_colors', 'hvg', 'initial_clustering_colors', 'isotype_colors', 'leiden', 'leiden_B_colors', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'rna_neighbors', 'umap', 'sample_id_colors', 'clone_call_summary'
    obsm: 'X_bcr', 'X_pca', 'X_pca_harmony', 'X_umap'
    varm: 'PCs'
    obsp: 'bcr_connectivities', 'bcr_distances', 'connectivities', 'distances', 'rna_connectivities', 'rna_distances'

[31]:

sc.pl.umap(adata_sub, color=[sample, celltype, 'KL_class', 'clone_call'], ncols=2, wspace=0.3)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_49_0.png

We again identify a large B cell expansion predicted to be clonal and kappa light chain-restricted, in a subset of B cells annotated as B_naive. We again examine BCR sequencing data to confirm the inference.

[32]:

sc.pl.umap(adata_sub, color=['clone_id_size'])

../_images/notebooks_001_cell2clone_identify_clonal_expansions_51_0.png

[33]:

# Find the largest clone and plot it on UMAP

# largest clone
largest_clone = (
    adata_sub.obs["clone_id"]
    .astype(str)
    .value_counts()
    .idxmax()
)

largest_n = (
    adata_sub.obs["clone_id"]
    .astype(str)
    .value_counts()
    .max()
)

# annotation for plotting
adata_sub.obs["largest_clone"] = pd.Categorical(
    adata_sub.obs["clone_id"]
    .astype(str)
    .eq(largest_clone)
    .map({True: largest_clone, False: "other"}),
    categories=["other", largest_clone]
)

# UMAP
sc.pl.umap(
    adata_sub,
    color="largest_clone",
    palette=["lightgrey", "red"],
    title=f"Largest clone: {largest_clone} (n={largest_n})"
)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_52_0.png

[34]:

adata

[34]:

AnnData object with n_obs × n_vars = 8604 × 31279
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'initial_clustering', 'Resample', 'Collection_Day', 'patient_id', 'Sex', 'Age', 'Ethnicity', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Status_3_days_post_collection', 'Status_7_days_post_collection', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'initial_clustering_B', 'leiden_B', 'leiden_B2', 'celltype_B', 'celltype_B_v2', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'clone_centrality', 'clone_degree', 'clone_id_size', 'mu_freq_heavy', 'mu_freq_light', 'clone_id_size_max_3', 'clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_size_gini', 'clone_centrality_gini', 'junction_heavy', 'junction_light', 'junction_aa_heavy', 'junction_aa_light', 'kappa_sum', 'lambda_sum', 'KL_class', 'KL_log_ratio', 'kappa', 'lambda', 'frac_kappa', 'KL_class_log2_ratio_scaled', 'clonality', 'clone_size_score'
    var: 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'Site_colors', 'Status_on_day_collection_summary_colors', 'bcr_QC_pass_colors', 'celltype_B_colors', 'celltype_B_v2_colors', 'filter_bcr_colors', 'hvg', 'initial_clustering_colors', 'isotype_colors', 'leiden', 'leiden_B_colors', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'rna_neighbors', 'umap', 'sample_id_colors'
    obsm: 'X_bcr', 'X_pca', 'X_pca_harmony', 'X_umap'
    obsp: 'bcr_connectivities', 'bcr_distances', 'connectivities', 'distances', 'rna_connectivities', 'rna_distances'

Cluster samples based on clonality across cell types

Finally, we identify groups of samples with clonal expansions in same cell types. This analysis is more informative with larger datasets but the approch is shown here as an example.

[35]:

# Make data frame of sample x celltype-specific clonality/clone size score

def wide_multi(df, value_cols):
    wide = (
        df
        .pivot_table(
            index=sample,
            columns=celltype,
            values=value_cols,
            aggfunc="first"
        )
    )

    # flatten multiindex columns
    wide.columns = [
        f"{ct}_{metric}"
        for metric, ct in wide.columns
    ]

    return wide


df_class_wide = wide_multi(df_class, ["clonality", "clone_size_score"])

[36]:

df_class_wide

[36]:

	B_immature_clonality	B_naive_clonality	B_switched_memory_clonality	B_exhausted_clonality	B_immature_clone_size_score	B_naive_clone_size_score	B_switched_memory_clone_size_score	B_exhausted_clone_size_score
sample_id
AP3	NaN	0.878756	NaN	1.691878	NaN	0.255994	NaN	0.446154
AP9	NaN	0.553235	NaN	NaN	NaN	0.172751	NaN	NaN
BGCV01_CV0025	NaN	0.119299	NaN	NaN	NaN	0.036765	NaN	NaN
BGCV02_CV0052	NaN	0.803771	NaN	NaN	NaN	0.235871	NaN	NaN
BGCV06_CV0037	NaN	0.575064	NaN	NaN	NaN	0.172175	NaN	NaN
BGCV06_CV0234	NaN	0.204084	NaN	NaN	NaN	0.062630	NaN	NaN
BGCV07_CV0094	NaN	3.800402	NaN	NaN	NaN	0.708641	NaN	NaN
BGCV07_CV0134	NaN	1.135484	NaN	NaN	NaN	0.321754	NaN	NaN
BGCV07_CV0137	NaN	0.924538	NaN	NaN	NaN	0.268083	NaN	NaN
BGCV11_CV0257	NaN	1.203777	NaN	NaN	NaN	0.338388	NaN	NaN
MH8919330	NaN	0.000000	NaN	4.335733	NaN	0.000000	NaN	0.738166
MH8919331	NaN	0.457500	NaN	NaN	NaN	0.142905	NaN	NaN
MH9143270	NaN	0.073748	NaN	NaN	NaN	0.022773	NaN	NaN
MH9143275	NaN	0.219991	0.478047	NaN	NaN	0.068555	0.149321	NaN
MH9143322	NaN	0.706672	NaN	NaN	NaN	0.220211	NaN	NaN
MH9143323	NaN	0.584640	NaN	NaN	NaN	0.182509	NaN	NaN
MH9143324	NaN	0.554374	NaN	NaN	NaN	0.166253	NaN	NaN
MH9143327	NaN	0.308122	NaN	NaN	NaN	0.096154	NaN	NaN
MH9143371	NaN	0.346596	NaN	NaN	NaN	0.108205	NaN	NaN
MH9143423	0.744221	0.450727	0.176878	NaN	0.231731	0.140788	0.055070	NaN
MH9143424	1.053549	0.238939	1.369950	NaN	0.324598	0.074486	0.377289	NaN
MH9179821	NaN	0.153794	NaN	NaN	NaN	0.047857	NaN	NaN
MH9179823	NaN	0.367829	NaN	NaN	NaN	0.114854	NaN	NaN
MH9179824	NaN	0.117074	0.048022	NaN	NaN	0.036395	0.014845	NaN
MH9179825	NaN	0.647023	NaN	NaN	NaN	0.192575	NaN	NaN

[37]:

## Create anndata based on clone size score matrix

# 1. Select clone_size_score columns

df_class_wide['sample'] = df_class_wide.index

clone_cols = [c for c in df_class_wide.columns if c.endswith("_clone_size_score")]

# 2. Build sample × celltype matrix
X = (
    df_class_wide[["sample"] + clone_cols]
    .groupby("sample")
    .mean()              # in case sample appears multiple times
)

# 3. Replace NaNs (absence of that celltype in sample at sufficient numbers → 0 clonality)
X = X.fillna(0)

# 4. Create AnnData
adata_clone = sc.AnnData(X)
adata_clone.obs_names = X.index
adata_clone.var_names = X.columns

[38]:

# 5. Highly variable features
sc.pp.highly_variable_genes(adata_clone)

# 6. PCA
sc.tl.pca(adata_clone, svd_solver="arpack")

# 7. Neighbours
sc.pp.neighbors(adata_clone, n_neighbors=15, n_pcs=20) # increase neigbour count when using with more samples

# 8. UMAP
sc.tl.umap(adata_clone)

# 9. Leiden clustering
sc.tl.leiden(adata_clone, resolution=1)

# 10. Plot
sc.pl.umap(adata_clone, color=["leiden"], size=200)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_59_0.png

[39]:

sc.pl.umap(adata_clone, color=adata_clone.var_names, size=200)

../_images/notebooks_001_cell2clone_identify_clonal_expansions_60_0.png

Using this approach, we can visualise cell type-specific clonality across donors and identify groups of donors where certain cell types are clonally expanded.