Data generation

Patch extraction and the Keras data generators / dataset loaders used for training and encoding.

class phenocoder.generator.PatchGenerator(sdata, image_key, spatial_key, table_key, sample_key, scale, patch_size=(128, 128), metadata_keys=None, scale_percentile=1, scale_per_sample=True)[source]

Bases: object

Generator for image patches and image patch datasets from spatial data.

This class handles the extraction of image patches and statistics from spatial data objects, primarily for use in deep learning workflows.

Parameters:
init_patches()[source]

Initialize patch positions from spatial coordinates.

Extracts spatial coordinates from the data, filters positions that would result in patches extending beyond image boundaries, and assigns batch IDs.

Return type:

None

extract_patch(img, id)[source]

Extract a patch from an image centered on specified coordinates.

Parameters:
  • img (ndarray) – Input image array

  • id (int) – Batch ID corresponding to patch position

Returns:

Extracted image patch

Return type:

ndarray

generate_image_stats(sample_id)[source]

Generate statistics for all patches in a sample.

Parameters:

sample_id (str or int) – Sample identifier for which to generate statistics

Return type:

None

select_patches(sample_id)[source]

Select all patches of a given sample.

Parameters:

sample_id (str or int) – Sample identifier for which to select patches

Returns:

DataFrame containing patch information img (np.ndarray): Image array

Return type:

df_patches_sample (pd.DataFrame)

write_patches(sample_id)[source]

Write all patches of a given samples to disk as numpy arrays.

Parameters:

sample_id (str or int) – Sample identifier for which to write patches

Return type:

None

get_patches(sample_id)[source]

Return all patches of a given sample.

Parameters:

sample_id (str or int) – Sample identifier for which to retrieve patches

Returns:

List of patches as numpy arrays pd.DataFrame: DataFrame containing patch information

Return type:

list of np.ndarray

get_scaling_percentiles()[source]

Extract and set scaling percentiles from computed statistics.

Aggregates the per-slice percentile_low / percentile_high values in df_stats into a conservative range – minimum of lows (darkest) and maximum of highs (brightest) – used to normalize patches in extract_patch. The grouping depends on scale_per_sample:

  • scale_per_sample=True (default): aggregate per (sample, channel), so each sample is scaled to its own intensity range. Stored in sample_percentiles_low / sample_percentiles_high keyed by sample; select_patches activates the right one per sample.

  • scale_per_sample=False: aggregate per channel across all samples/slices (the original global behaviour). Stored directly in percentiles_low / percentiles_high.

Raises:

ValueError – If statistics have not been computed yet (df_stats is None or empty)

Return type:

None

generate_dataset(dataset, dir_output, n_samples=None, n_patches=None)[source]

Generate complete dataset with patches and statistics.

Parameters:
  • dataset (str) – Name/identifier for the dataset being generated

  • dir_output (str) – Directory path for storing the generated dataset

  • n_samples (int, optional) – Number of samples to randomly select for processing. If None, processes all samples.

  • n_patches (int, optional) – Number of patches to randomly sample from all available patches. If None, uses all patches.

Return type:

None

class phenocoder.generator.SequenceGenerator(*args, **kwargs)[source]

Bases: Sequence

Keras Sequence generator for loading image patches from disk during training.

This generator loads patches from disk and applies optional data augmentation and normalization for training deep learning models.

Parameters:
  • ids (list)

  • batch_size (int)

  • dim (tuple)

  • n_channels (int)

  • shuffle (bool)

  • flip (bool)

  • conditions (np.ndarray | None)

  • return_conditions (bool)

on_epoch_end()[source]

Update indexes after each epoch.

Shuffles the order of patches if shuffle is enabled.

class phenocoder.generator.DatasetLoader(datasets, dir_datasets, sample_key)[source]

Bases: object

Utility class for merging multiple datasets and their statistics.

This class combines statistics from multiple dataset directories and provides unified access to files and scaling parameters.

Parameters:
  • datasets (list)

  • dir_datasets (str)

  • sample_key (str)

load_datasets()[source]

Loads and merge statistics from all specified datasets.

Combines stats.csv files from each dataset directory and creates unified dataframes with file paths.

Return type:

None

set_train_val_split(batch_size=64, split=0.8)[source]

Assign each patch to a train or validation split.

Splits are made at the sample level (grouped by sample_key and dataset) so all patches of a sample land in the same split, then each split is truncated to a whole number of batches. Adds split and file_path columns to self.patches.

Parameters:
  • batch_size (int) – Batch size used to drop the remainder so each split is batch-aligned. Defaults to 64.

  • split (float) – Fraction of samples assigned to the training split. Defaults to 0.8.

Return type:

None

get_generators(conditions, batch_size=64, dim=(128, 128), n_channels=4, shuffle=True, flip=False, n_workers=1)[source]

Build the training and validation Keras Sequence generators.

Requires set_train_val_split to have been called (patches must have split and file_path columns).

Parameters:
  • conditions (list of str) – obs/patch columns to one-hot encode and feed as conditions. If empty, plain (non-conditional) generators are returned

  • batch_size (int) – Number of patches per batch. Defaults to 64.

  • dim (tuple) – Spatial (height, width) of patches. Defaults to (128, 128).

  • n_channels (int) – Number of image channels. Defaults to 4.

  • shuffle (bool) – Whether to shuffle patch order each epoch. Defaults to True.

  • n_workers (int) – Number of worker processes for the Keras Sequence. Defaults to 1.

  • flip (bool)

Returns:

(train_generator, val_generator, one_hot_encoder) if conditions is non-empty,

otherwise (train_generator, val_generator)

Return type:

tuple