Phenocoder

The main class driving the end-to-end spatial phenotyping workflow.

class phenocoder.Phenocoder(**kwargs)[source]

Bases: object

A class for performing unsupervised morphometric spatial phenotyping on microscopy image data.

The Phenocoder class provides a complete workflow for spatial phenotyping using variational autoencoders. It supports both conditional and non-conditional models, and integrates with SpatialData objects for handling spatial omics data.

Parameters:

kwargs (Any)

sdata

The SpatialData object containing spatial omics data.

Type:

SpatialData | None

adata

AnnData object (deprecated, data should be in sdata.tables).

Type:

AnnData | None

model

The variational autoencoder model for phenotyping.

Type:

CVAE | CondCVAE | None

model_dir

Path to model directory for saving/loading models.

Type:

str | Path | None

model_oh_enc

One-hot encoder for conditional model inputs.

model_config

Configuration parameters for the model.

Type:

dict | None

sample_key

The key for identifying samples in the SpatialData object.

Type:

str | None

data_dir

Directory path for dataset storage.

Type:

str | Path | None

data_generator_train

Training data generator for model training.

data_generator_val

Validation data generator for model training.

df_conditions

DataFrame containing condition information for conditional models.

Type:

DataFrame | None

Example

>>> phenocoder = Phenocoder(
...     table_key="nuclei_features", sample_key="well", image_key="IF"
... )
>>> phenocoder.add_sdata(sdata)
>>> phenocoder.generate_dataset(
...     dataset="dataset_1",
...     dir_dataset="data/phenocoder",
...     spatial_key_index="spatial_index",
... )
>>> phenocoder.initialize_model(n_latent_dim=64, n_dense_dim=256, conditions=[])
>>> phenocoder.train(n_epochs=100)
>>> phenocoder.encode()  # writes latents to sdata.tables['phenocoder']
__init__(**kwargs)[source]

Initialize a new Phenocoder instance.

Parameters:

**kwargs (Any) – Optional keyword arguments to set initial attributes. Supported keys: - sdata: SpatialData instance - adata: AnnData instance - model: preconstructed model (CVAE or CondCVAE) - model_dir: path to model directory - model_oh_enc: one-hot encoder for conditional models - model_config: model configuration dict or path to config - sample_key: key used to identify samples in sdata tables - spatial_key: spatial key name/index - table_key: table key in sdata.tables - data_dir: base directory for datasets - datasets: list of dataset identifiers - data_generator_train: training data generator - data_generator_val: validation data generator - df_conditions: DataFrame of condition labels - image_key: key of images in sdata.images

Return type:

None

Any attributes not provided in kwargs will be initialized to None.

__repr__()[source]

Return a formatted string representation of the Phenocoder instance.

Returns:

A summary of the Phenocoder object’s structure and configuration.

Return type:

str

add_sdata(sdata)[source]

Add a SpatialData object to the Phenocoder instance.

Parameters:

sdata (SpatialData) – The SpatialData object containing spatial omics data and microscopy images to be processed.

Returns:

None

Return type:

None

Example

>>> phenocoder = Phenocoder()
>>> phenocoder.add_sdata(sdata)
generate_dataset(dataset, dir_dataset, patch_size=(128, 128), spatial_key_index=None, scale=True, metadata_keys=None, scale_percentile=1, scale_per_sample=True)[source]

Generate an image patch dataset for phenotyping from input microscopy images.

Creates image patches from input images and segmentation masks, with options for sampling strategies and multi-channel processing. The generated dataset is used for training the variational autoencoder model.

Parameters:
  • dataset (str) – Name/identifier for the dataset being generated.

  • dir_dataset (str | Path) – Directory path for storing the generated dataset.

  • patch_size (tuple[int, int], optional) – Patch (height, width) extracted around each object. Must match the height/width of the model’s input_shape. Defaults to (128, 128).

  • spatial_key_index (str | None, optional) – Spatial key index to use, integer relating to z-index in image array.

  • None (If)

  • None. (uses the instance's spatial_key attribute. Defaults to)

  • scale (bool, optional) – Whether to scale the image patches. Defaults to True.

  • metadata_keys (list[str] | None, optional) – Additional columns from the table’s .obs to carry into patches.csv so they can be used as conditioning variables (e.g. a sample/donor column). Defaults to None.

  • scale_percentile (float, optional) – Percentile (0-100) for the per-slice low/high used in normalization; the high uses 100 - scale_percentile. Defaults to 1 (1/99 stretch).

  • scale_per_sample (bool, optional) – If True (default), normalize each sample to its own intensity range (per sample+channel). If False, use one global range per channel across all samples (original behaviour). NOTE: pass the SAME value to encode so training and inference scale identically.

Returns:

None

Return type:

None

Example

>>> phenocoder.generate_dataset(
...     dataset="experiment_001",
...     dir_dataset="/path/to/datasets",
...     patch_size=(32, 32),
...     spatial_key_index="spatial_index",
... )
initialize_model(n_latent_dim, n_dense_dim, conditions, dropout=0.25, batch_size=64, n_workers=1, input_shape=(128, 128, 4), conv_layers=(8, 16, 32, 64, 128), beta=0.01, flip=False, shuffle=True)[source]

Initialize a CVAE or conditional CVAE model with specified parameters.

Sets up the model architecture, data generators, and saves configuration files. Creates model directory structure and prepares the model for training.

Parameters:
  • n_latent_dim (int) – Dimensionality of the latent space.

  • n_dense_dim (int) – Dimensionality of dense layers in the model.

  • conditions (list[str]) – List of column names in the data to use as conditions for conditional VAE. If empty list or None, uses non-conditional CVAE.

  • dropout (float, optional) – Dropout rate for regularization. Defaults to 0.25.

  • batch_size (int, optional) – Batch size for training. Defaults to 64.

  • n_workers (int, optional) – Number of workers for data loading. Defaults to 1.

  • input_shape (tuple[int, int, int], optional) – Input image shape (height, width, channels). Defaults to (128, 128, 4).

  • conv_layers (tuple[int, ...], optional) – Number of filters in each convolutional layer. Defaults to (8, 16, 32, 64, 128).

  • beta (float, optional) – Beta parameter for beta-VAE (controls KL divergence weight). Defaults to 0.01.

  • flip (bool, optional) – Whether to flip images horizontally during training. Defaults to False.

  • shuffle (bool, optional) – Whether to shuffle the training data. Defaults to True.

Returns:

None

Raises:
Return type:

None

Example

>>> # Non-conditional model
>>> phenocoder.initialize_model(
...     n_latent_dim=64,
...     n_dense_dim=256,
...     conditions=[],
...     dropout=0.25,
...     beta=0.01
... )
>>> # Conditional model
>>> phenocoder.initialize_model(
...     n_latent_dim=64,
...     n_dense_dim=256,
...     conditions=['dataset', 'z'],
...     dropout=0.25,
...     beta=0.01
... )
load_model()[source]

Load a pre-trained phenocoder model from disk.

Reconstructs the model architecture from saved configuration and loads the trained weights. Also loads the one-hot encoder for conditional models.

Returns:

None

Return type:

None

Note

Requires model_config to be set to the path of the configuration file.

Example

>>> phenocoder.model_config = "path/to/config.yaml"
>>> phenocoder.load_model()
summarize_model()[source]

Display model architecture and configuration summary.

Prints the model configuration parameters and architecture summaries for both encoder and decoder components.

Returns:

None

Return type:

None

Example

>>> phenocoder.summarize_model()
train(n_epochs=100, learning_rate=0.001, min_learning_rate=0.0001, factor_learning_rate=0.2, learning_rate_patience=3, early_stopping_patience=5, plot=True, n_preview=300, plot_frac=0.001)[source]

Train the initialized model with specified hyperparameters and callbacks.

Performs model training with early stopping, learning rate reduction, and TensorBoard logging. Optionally generates visualization plots.

Parameters:
  • n_epochs (int, optional) – Maximum number of training epochs. Defaults to 100.

  • learning_rate (float, optional) – Initial learning rate for optimizer. Defaults to 0.001.

  • min_learning_rate (float, optional) – Minimum learning rate for learning rate scheduler. Defaults to 0.0001.

  • factor_learning_rate (float, optional) – Multiplicative factor by which the learning rate is reduced on plateau (ReduceLROnPlateau). Defaults to 0.2.

  • learning_rate_patience (int, optional) – Number of epochs without improvement before reducing learning rate. Defaults to 3.

  • early_stopping_patience (int, optional) – Number of epochs without improvement before stopping training. Defaults to 5.

  • plot (bool, optional) – Whether to generate visualization plots after training. Defaults to True.

  • n_preview (int, optional) – Number of samples to use for reconstruction plots. Defaults to 300.

  • plot_frac (float, optional) – Fraction of data to use for latent space visualization. Defaults to 0.001.

Returns:

None

Return type:

None

Example

>>> phenocoder.train(
...     n_epochs=200,
...     learning_rate=0.0005,
...     early_stopping_patience=10,
...     plot=True
... )
encode(batch_size=64, scale=True, spatial_key_index=None, scale_percentile=1, scale_per_sample=True, spatial_message_passing_radius=None)[source]

Encode nuclei patches into latent space representations using the trained model.

Processes all samples in the SpatialData object, extracts nuclei patches, and encodes them into the learned latent space. Results are aggregated by nucleus label and returned as an AnnData object.

Parameters:
  • batch_size (int, optional) – Batch size for encoding predictions. Defaults to 64.

  • scale (bool, optional) – Whether to intensity-scale patches before encoding. Used only when the patch_generator is (re)built here. Defaults to True.

  • spatial_key_index (str | None, optional) – obsm key (integer z-index coords) used to extract patches. If None, falls back to self.spatial_key. Defaults to None.

  • scale_percentile (float, optional) – Percentile (0-100) for per-slice low/high, used only when the patch_generator is (re)built here. MUST match the value used in generate_dataset for this model’s dataset. Defaults to 1.

  • scale_per_sample (bool, optional) – Per-sample vs global normalization, used only when the patch_generator is (re)built here. MUST match generate_dataset for this model’s dataset, else inference scales differently than training. Defaults to True.

  • spatial_message_passing_radius (int | None, optional) – If set, smooth each sample’s latents over a spatial neighborhood graph of this radius (degree-normalized aggregation), stored in .layers['spatial_message_passing']. If None, no message passing is applied. Defaults to None.

Returns:

The encoded latents are written to self.sdata.tables['phenocoder'] as an

AnnData object (latents in .X, object metadata carried in .obs/.obsm, latent dimensions named phc_latent_{i} in .var).

Return type:

None

Note

This method contains dataset-specific code (the ‘z’/’dataset’ conditions) that should be generalized.

Example

>>> phenocoder.encode(batch_size=128, spatial_message_passing_radius=50)
>>> phenocoder.sdata.tables['phenocoder']  # AnnData of latents
spatialgraph_stats(cluster_key='leiden', spatial_key='spatial', radii=(25, 50), table_key=None, stats=None, chull_min_nds=10, chull_min_degree=3, use_subunits=False, dim_subunit=(500, 500, 100), min_obs_per_subunit=100, max_obs_per_subunit=None, verbose=False)[source]

Generate statistics for spatial neighborhood graphs of each sample or subunit.

Computes spatial graph-based statistics such as neighborhood composition, spatial clustering coefficients, and other graph-based metrics for each sample (or spatial subunit within samples) using the SpatialGraphAnalyzer.

Parameters:
  • cluster_key (str, optional) – Key in adata.obs containing cluster labels. Defaults to ‘leiden’.

  • spatial_key (str, optional) – Key in adata.obsm containing spatial coordinates. Defaults to ‘spatial’.

  • radii (tuple[int, ...], optional) – Tuple of radii to use for spatial neighbor calculations. Defaults to (25, 50).

  • table_key (str | None, optional) – Key in sdata.tables to analyze. If None, uses self.table_key. Defaults to None.

  • stats (list[str] | None, optional) – Which stat groups to compute. Valid options: ‘interactions’, ‘centrality’, ‘connectivity’, ‘moran_features’, ‘moran_clusters’, ‘chull’. If None, all groups are computed. Defaults to None.

  • chull_min_nds (int, optional) – Minimum number of nodes per connected component for convex-hull statistics. Only used if ‘chull’ is in stats. Defaults to 10.

  • chull_min_degree (int, optional) – Minimum node degree before extracting convex-hull connected components. Only used if ‘chull’ is in stats. Defaults to 3.

  • use_subunits (bool, optional) – Whether to partition samples into spatial subunits and compute statistics per subunit instead of per sample. Defaults to False.

  • dim_subunit (tuple[int, int, int], optional) – Dimensions (x, y, z) of each spatial subunit in micrometers. Only used if use_subunits=True. Defaults to (500, 500, 100).

  • min_obs_per_subunit (int, optional) – Minimum number of observations required per subunit. Subunits with fewer observations are filtered out. Only used if use_subunits=True. Defaults to 100.

  • max_obs_per_subunit (int | None, optional) – Maximum number of observations per subunit. Subunits with more observations are randomly subsampled. If None, no subsampling is performed. Only used if use_subunits=True. Defaults to None.

  • verbose (bool, optional) – Whether to print progress information during subunit partitioning. Defaults to False.

Returns:

None

Raises:
  • ValueError – If table_key is not specified and self.table_key is None.

  • ValueError – If cluster_key is not found in the table’s obs.

Return type:

None

Note

Results are stored in self.adata as an AnnData object with one row per sample (if use_subunits=False) or per subunit (if use_subunits=True) containing all computed spatial statistics.

Example

>>> # Sample-level analysis
>>> phenocoder.spatialgraph_stats(
...     cluster_key='leiden',
...     radii=(25, 50, 100)
... )
>>> # Subunit-level analysis
>>> phenocoder.spatialgraph_stats(
...     cluster_key='leiden',
...     radii=(25, 50),
...     use_subunits=True,
...     dim_subunit=(500, 500, 100),
...     min_obs_per_subunit=100
... )
spatialgraph_embedding(n_dim, scale=True, variable_features=False, batch_correction=False, batch_key=None, confounder_key=None, n_neighbors=15, umap=True, obs_keys=None)[source]

Generate spatial graph embeddings from all samples.

Creates low-dimensional embeddings that capture spatial relationships between nuclei across all samples in the dataset. This can be used for sample-level comparisons and spatial pattern analysis.

Parameters:
  • n_dim (int) – Number of principal components to compute.

  • scale (bool, optional) – Whether to scale the data. Defaults to True.

  • variable_features (bool, optional) – Whether to select highly variable features. Defaults to False.

  • batch_correction (bool, optional) – Whether to apply batch correction using BBKNN. Defaults to False.

  • batch_key (str | None, optional) – Column name in adata.obs or sdata.tables to use for batch correction. Required if batch_correction=True. Defaults to None.

  • confounder_key (str | list[str] | None, optional) – Column name(s) to use as confounders in batch correction. Defaults to None.

  • n_neighbors (int, optional) – Number of neighbors for neighbor graph construction. Used in both bbknn.bbknn and sc.pp.neighbors. Defaults to 15.

  • umap (bool, optional) – Whether to compute UMAP embedding. Defaults to True.

  • obs_keys (str | list[str] | None, optional) – Column name(s) in sdata.tables[table_key].obs to carry into self.adata.obs as per-sample metadata (e.g. condition/treatment groups), so the UMAP can be colored by them. Each value is taken per sample via groupby(sample_key).first() and must be constant within a sample. Defaults to None.

Returns:

None

Raises:
  • ValueError – If self.adata is None or not set.

  • ValueError – If batch_correction=True but batch_key is None.

  • ValueError – If batch_key is not found in adata.obs or sdata.tables.

Return type:

None

Note

Results are stored in self.adata with: - .layers[‘raw’]: Raw data before scaling - .obsm[‘X_pca’]: PCA coordinates - .obsm[‘X_umap’]: UMAP coordinates (if umap=True) plus the neighbor graph in .obsp. (Clustering, e.g. leiden, is left to the caller.)

Example

>>> phenocoder.spatialgraph_embedding(
...     n_dim=50,
...     batch_correction=True,
...     batch_key='plate_id',
...     n_neighbors=20
... )