SynthST for generation of synthetic spatial gene expression - DLPFC 151508

#Restart runtime after every run
!git clone https://github.com/Zafar-Lab/spDDB.git
Cloning into 'spDDB'...
remote: Enumerating objects: 230, done.
remote: Counting objects: 100% (230/230), done.
remote: Compressing objects: 100% (187/187), done.
remote: Total 230 (delta 64), reused 196 (delta 42), pack-reused 0 (from 0)
Receiving objects: 100% (230/230), 28.38 MiB | 23.78 MiB/s, done.
Resolving deltas: 100% (64/64), done.
%cd spDDB/SynthST/Simulator_gene_expression/
!ls
/content/spDDB/SynthST/Simulator_gene_expression
Simulate_gene_expression_notebook.ipynb  simulate_gene_expression.py

Mounting google drive for accessing the input data

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Installing Libraries

!pip install scanpy
Collecting scanpy
  Downloading scanpy-1.12.1-py3-none-any.whl.metadata (8.4 kB)
Collecting anndata>=0.10.8 (from scanpy)
  Downloading anndata-0.12.16-py3-none-any.whl.metadata (9.9 kB)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from scanpy) (2026.4.22)
Collecting fast-array-utils>=1.4 (from fast-array-utils[accel,sparse]>=1.4->scanpy)
  Downloading fast_array_utils-1.4.1-py3-none-any.whl.metadata (2.7 kB)
Requirement already satisfied: h5py>=3.11 in /usr/local/lib/python3.12/dist-packages (from scanpy) (3.16.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from scanpy) (1.5.3)
Collecting legacy-api-wrap>=1.5 (from scanpy)
  Downloading legacy_api_wrap-1.5-py3-none-any.whl.metadata (2.2 kB)
Requirement already satisfied: matplotlib>=3.9 in /usr/local/lib/python3.12/dist-packages (from scanpy) (3.10.0)
Requirement already satisfied: natsort in /usr/local/lib/python3.12/dist-packages (from scanpy) (8.4.0)
Requirement already satisfied: networkx>=2.8.8 in /usr/local/lib/python3.12/dist-packages (from scanpy) (3.6.1)
Requirement already satisfied: numba>=0.60 in /usr/local/lib/python3.12/dist-packages (from scanpy) (0.60.0)
Requirement already satisfied: numpy>=2 in /usr/local/lib/python3.12/dist-packages (from scanpy) (2.0.2)
Requirement already satisfied: packaging>=25 in /usr/local/lib/python3.12/dist-packages (from scanpy) (26.1)
Collecting pandas>=2.3 (from scanpy)
  Downloading pandas-3.0.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.5/79.5 kB 2.4 MB/s eta 0:00:00
?25hRequirement already satisfied: patsy in /usr/local/lib/python3.12/dist-packages (from scanpy) (1.0.2)
Requirement already satisfied: pynndescent>=0.5.13 in /usr/local/lib/python3.12/dist-packages (from scanpy) (0.6.0)
Requirement already satisfied: scikit-learn>=1.6 in /usr/local/lib/python3.12/dist-packages (from scanpy) (1.6.1)
Requirement already satisfied: scipy>=1.13 in /usr/local/lib/python3.12/dist-packages (from scanpy) (1.16.3)
Requirement already satisfied: seaborn>=0.13.2 in /usr/local/lib/python3.12/dist-packages (from scanpy) (0.13.2)
Collecting session-info2 (from scanpy)
  Downloading session_info2-0.4.1-py3-none-any.whl.metadata (2.5 kB)
Requirement already satisfied: statsmodels>=0.14.5 in /usr/local/lib/python3.12/dist-packages (from scanpy) (0.14.6)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from scanpy) (4.67.3)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.12/dist-packages (from scanpy) (4.15.0)
Requirement already satisfied: umap-learn>=0.5.12 in /usr/local/lib/python3.12/dist-packages (from scanpy) (0.5.12)
Collecting array-api-compat>=1.7.1 (from anndata>=0.10.8->scanpy)
  Downloading array_api_compat-1.14.0-py3-none-any.whl.metadata (2.5 kB)
Collecting pandas>=2.3 (from scanpy)
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.2/91.2 kB 5.9 MB/s eta 0:00:00
?25hCollecting scverse-misc>=0.0.3 (from anndata>=0.10.8->scanpy)
  Downloading scverse_misc-0.0.7-py3-none-any.whl.metadata (4.5 kB)
Collecting zarr!=3.0.*,>=2.18.7 (from anndata>=0.10.8->scanpy)
  Downloading zarr-3.2.1-py3-none-any.whl.metadata (8.7 kB)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.9->scanpy) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.9->scanpy) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.9->scanpy) (4.62.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.9->scanpy) (1.5.0)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.9->scanpy) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.9->scanpy) (3.3.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.9->scanpy) (2.9.0.post0)
Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.12/dist-packages (from numba>=0.60->scanpy) (0.43.0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.3->scanpy) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.3->scanpy) (2026.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.6->scanpy) (3.6.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.7->matplotlib>=3.9->scanpy) (1.17.0)
Collecting donfig>=0.8 (from zarr!=3.0.*,>=2.18.7->anndata>=0.10.8->scanpy)
  Downloading donfig-0.8.1.post1-py3-none-any.whl.metadata (5.0 kB)
Requirement already satisfied: google-crc32c>=1.5 in /usr/local/lib/python3.12/dist-packages (from zarr!=3.0.*,>=2.18.7->anndata>=0.10.8->scanpy) (1.8.0)
Collecting numcodecs>=0.14 (from zarr!=3.0.*,>=2.18.7->anndata>=0.10.8->scanpy)
  Downloading numcodecs-0.16.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.12/dist-packages (from donfig>=0.8->zarr!=3.0.*,>=2.18.7->anndata>=0.10.8->scanpy) (6.0.3)
Downloading scanpy-1.12.1-py3-none-any.whl (2.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 34.7 MB/s eta 0:00:00
?25hDownloading anndata-0.12.16-py3-none-any.whl (175 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 175.3/175.3 kB 8.2 MB/s eta 0:00:00
?25hDownloading fast_array_utils-1.4.1-py3-none-any.whl (39 kB)
Downloading legacy_api_wrap-1.5-py3-none-any.whl (10 kB)
Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 98.6 MB/s eta 0:00:00
?25hDownloading session_info2-0.4.1-py3-none-any.whl (17 kB)
Downloading array_api_compat-1.14.0-py3-none-any.whl (60 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.1/60.1 kB 3.2 MB/s eta 0:00:00
?25hDownloading scverse_misc-0.0.7-py3-none-any.whl (13 kB)
Downloading zarr-3.2.1-py3-none-any.whl (319 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 319.6/319.6 kB 12.8 MB/s eta 0:00:00
?25hDownloading donfig-0.8.1.post1-py3-none-any.whl (21 kB)
Downloading numcodecs-0.16.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (9.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.2/9.2 MB 108.6 MB/s eta 0:00:00
?25hInstalling collected packages: session-info2, numcodecs, legacy-api-wrap, fast-array-utils, donfig, array-api-compat, zarr, scverse-misc, pandas, anndata, scanpy
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.3 which is incompatible.
Successfully installed anndata-0.12.16 array-api-compat-1.14.0 donfig-0.8.1.post1 fast-array-utils-1.4.1 legacy-api-wrap-1.5 numcodecs-0.16.5 pandas-2.3.3 scanpy-1.12.1 scverse-misc-0.0.7 session-info2-0.4.1 zarr-3.2.1
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import anndata
import scipy
import seaborn as sns
from matplotlib.pyplot import figure
from sklearn.preprocessing import MinMaxScaler
import torch
from simulate_gene_expression import *

np.random.seed(0)
torch.manual_seed(0)
<torch._C.Generator at 0x7a85246c4b70>
"""
Guidelines:
scale_factor: user defined hyper-paramters, by default = 10000,
Large dataset (SlideseqV2/VisiumHD: no of cells >= 20000):100

All dataset scale_factor = 10000, except the following:
Prostate Cancer, Hippocampus, Cerebellum, Kidney Human6 = 100
Lung cancer Visium HD and Kidney Human1 = 1000
No sampling from muti-variate Gaussian is performed for VisiumHD datasets due to memory constraints (original umi counts used).
i.e. sample_UMI() function is skipped and original_umi_counts to be passed in place of sampled_umi in scale_adata function()

For certain dataset, Average_SynthST_ReX_Norm is used in place of Average_SynthST_ReX
"""

Initiating path

dataset_name = "DLPFC"
cell_type_col = "cell_type"
scale_factor = 10000
simulator_column = "Average_SynthST_ReX_Norm"

data_path = "/content/drive/MyDrive/Major_project/Benchmarking_Shared/spDDB_tutorials/1_data/"
cell2loc_st = data_path + "cell2location_parameters/st_151508.h5ad"

signature_path = data_path + "reference_signatures/all_data.h5ad"

st_path = data_path + "ST.h5ad"
simulated_data_path = data_path + "synthetic_spatial_gene_exp/simulated_spatial_gene_expression.h5ad"

sim_adata = sc.read_h5ad(data_path+ "/output_CTP/simulated_st.h5ad")
ctp_var_names = sim_adata.var_names
# ST Output of Cell2location
sam = sc.read_h5ad(cell2loc_st)
ctps = sam.obsm["q05_cell_abundance_w_sf"].sum(axis = 1)
dt_N = pd.DataFrame({"spot_id" : ctps.index, "sum" : ctps.values})
m_g = sam.uns["mod"]["post_sample_means"]["m_g"][0]
#N_s  = sam.uns["mod"]["post_sample_means"]["n_s_cells_per_location"]
print ("m_g", m_g)

# Output of SynthST
dt_N[dt_N["spot_id"].isin(sim_adata.obs_names)]
N_cell2loc = np.ceil(dt_N[dt_N.columns[1:]].sum(axis = 1))
#print (np.min(N_cell2loc), np.max(N_cell2loc))
obs_names = sim_adata.obs_names
decon_matrix = sim_adata.obsm[simulator_column]
spot_barcodes = sim_adata.obs_names
num_spots = len(decon_matrix)
num_celltypes = decon_matrix.shape[1]

# Original Spatial dataset
original_st_data = sc.read_h5ad(st_path)
original_st_data = original_st_data[original_st_data.obs_names.isin(spot_barcodes), :]

cov_matrix = np.zeros((num_spots,num_spots))
tissue_positions_list = original_st_data.obsm["spatial"]

# Single cell data signature
ann = sc.read_h5ad(signature_path)
ann = ann[:, ann.var_names.isin(sam.var_names)]
m_g [0.750724   1.3788478  0.7270636  ... 0.88990664 1.3223586  0.73166245]
/usr/local/lib/python3.12/dist-packages/anndata/_core/anndata.py:1884: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

Generating Simulated datasets


# scaled no of cells in each spot between 5-15.
cells_at_spot = plot_histogram(N_cell2loc)
#cells_at_spot.to_csv(simulated_data_path + "simulated_cells_at_spot.csv")
print ("Generating simulated gene expression data")
adata = generate_expression(ann, sam, original_st_data, decon_matrix, cells_at_spot, ctp_var_names, m_g, spot_barcodes)
Generating simulated gene expression data
(52556, 11403)
(11403, 17)
../_images/a118a215aae2b9b5d2321229245dced31f13f35f897879e5fd397e1c4d234857.png

Downsampling the simulated datasets

print ("Compute distance matrix")
distance_matrix = compute_distane_matrix(num_spots, tissue_positions_list)

print ("Sample UMI from Multi-variate Gaussian distribution")
sampled_umi, original_umi_counts = sample_UMI(num_spots, distance_matrix, original_st_data, cov_matrix, scale_factor)
Compute distance matrix
Sample UMI from Multi-variate Gaussian distribution
0.5050671320639292 1.8163958966697298
../_images/3c55f4eef99e73d4142176be619591e5ccf5e2bb7dfe5e537d4657ac87a8a7f9.png
print ("Scale Anndata")
adata = scale_anndata(adata, original_st_data, sampled_umi, original_umi_counts)
Scale Anndata
scaling  0.025484677667040788 2.3554850087601245
False
/content/spDDB/SynthST/Simulator_gene_expression/simulate_gene_expression.py:174: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  original_st_data.obs['umi_counts_actual'] = original_umi_counts
/usr/local/lib/python3.12/dist-packages/anndata/_core/anndata.py:1884: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/usr/local/lib/python3.12/dist-packages/anndata/_core/anndata.py:1884: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
WARNING: saving figure to file figures/spatialoriginal umi.png
/content/spDDB/SynthST/Simulator_gene_expression/simulate_gene_expression.py:179: FutureWarning: Argument `save` is deprecated and will be removed in a future version. Use `sc.pl.plot(show=False).figure.savefig()` instead.
  sc.pl.embedding(original_st_data, basis = "spatial", color = ["umi_counts_actual", "umi_sampled"], cmap='Reds',
../_images/2acf4e29fd2015501a3b9c5661d522190a9ae642a38ca742a90c7c2be8049e33.png
WARNING: saving figure to file figures/spatialoriginal umi_spectral.png
/content/spDDB/SynthST/Simulator_gene_expression/simulate_gene_expression.py:181: FutureWarning: Argument `save` is deprecated and will be removed in a future version. Use `sc.pl.plot(show=False).figure.savefig()` instead.
  sc.pl.embedding(original_st_data, basis = "spatial", color = ["diff"], cmap='Spectral',
../_images/9d32ed03b8e41f5071d72b8fe72d255967b76a30a8f34a2f8d9d439c9c677e7f.png
WARNING: saving figure to file figures/spatialsimulated umi.png
/content/spDDB/SynthST/Simulator_gene_expression/simulate_gene_expression.py:184: FutureWarning: Argument `save` is deprecated and will be removed in a future version. Use `sc.pl.plot(show=False).figure.savefig()` instead.
  sc.pl.embedding(adata, basis = "spatial", color = ["umi_count_before_scaling", "umi_count_after_scaling"], cmap='Reds',
../_images/5f66e5fe31c1639d413dd52ea41cb7aa74b4809badb338cbf19f5135515c28c3.png
WARNING: saving figure to file figures/spatialsimulated umi_spectral.png
/content/spDDB/SynthST/Simulator_gene_expression/simulate_gene_expression.py:186: FutureWarning: Argument `save` is deprecated and will be removed in a future version. Use `sc.pl.plot(show=False).figure.savefig()` instead.
  sc.pl.embedding(adata, basis = "spatial", color = ["diff"], cmap='Spectral',
../_images/08d4d114be1a4a5fb8c394abc518cdfc297c27e01629ca9fa14a59299aeece0d.png
adata.X = np.round(adata.X)
adata.X = adata.X.astype(np.uint64)
print (np.min(adata.X))
0
print (adata)
adata.write_h5ad(simulated_data_path)
AnnData object with n_obs × n_vars = 4382 × 11403
    obs: 'in_tissue', 'array_row', 'array_col', 'scaling_factors', 'umi_count_before_scaling', 'umi_count_after_scaling', 'diff'
    uns: 'spatial'
    obsm: 'spatial'