_data_loader.py module

class _data_loader.FFPE_dataset(configs, learning_type, parent)

Bases: Dataset

FFPE_dataset: start of data loader

MCAR(arguments)

Simulates Missing Completely At Random (MCAR) by zeroing a fraction of elements in each tensor of the lists self.Xs and self.As_orig. The fraction of dropout assigned to each gene is a random variable between 0 and the given lambda value.

Parameters:

arguments (dict): A dictionary containing the following keys:

'lambda_counts': The maximum dropout rate for any gene in the expression matrix.
'lambda_edges': The maximum dropout rate for the adjacency matrix.

Returns:

tuple: A tuple containing the following elements:

List[torch.Tensor]: A list of new tensors with zeroed elements in the expression matrix.
List[torch.Tensor]: A list of tensors indicating the indices of zeroed locations within the count tensor.
List[torch.Tensor]: A list of new tensors with zeroed elements in the edge matrix.
List[torch.Tensor]: A list of tensors indicating the indices of zeroed locations within the edge tensor.
List[torch.Tensor]: A list of tensors indicating the indices of non-zeros in masked, padded adjacencies.

check_fidelity_of_anndata()

This function checks each input AnnData object for the following:

A valid expression matrix (AnnData 'X' matrix).
Gene IDs (AnnData 'var' must exist with column 'gene').
Sample IDs (AnnData 'obs' table must exist; expects 'obs_names' to be sample IDs).
Presence of an Aj correlation matrix. If not found, the configs is modified so that adjacency information is not considered in downstream computations.
A "sample_association" must exist in the obsm of each AnnData, indicating which samples are associated with each other (tau_1, tau_2, etc.). Consequently, the sample names between the AnnData tables (N=calT) are expected to be unique. If a sample is not associated with a certain tissue, the cell for that unassociated tissue should read NA or NaN.

Args:: adatas (List[AnnData]): A list of original AnnData objects loaded from files.
Returns:: bool: True if all checks pass.
Raises:: None

clamp_tensor_values(tensor_list)

Clamps values of tensors in a list based on pre-defined thresholds:

Expression is clamped at >10^6.
Any given value less than 10^(-10) is zeroed.
Tensors are also converted to float.

Parameters:: tensor_list (List[torch.Tensor]): PyTorch tensors to be clamped.
Returns:: torch.Tensor: A new tensor obtained by stacking the clamped tensors along a new dimension (dim=0).

filter_NaNs_from_COO(Ms)

Filters NaN from data arrays of COO (Coordinate) sparse matrices

This function takes lists of COO matrices (Ms), filters out the NaN values from their data arrays, and returns new lists of filtered COO matrices.

Args:: Ms (list): List of COO matrices that need to be filtered
Returns:: filtered_Ms (list): List of filtered COO matrices
Notes:: This function assumes that the input matrices are in COO format

from_anndata_2_numpy(adata)

Extracts info from anndata, creates several self vars for this dataset

Args:: anndata_orig: raw anndata structures read in from file. They have been checked for their fidelity (no information is missing)
Returns:: A series of self variables are created in the dataset object
Raises:: Not sure yet if there is anything to raise here.

from_numpy_2_tensors()

Converts multiple attributes from NumPy arrays to PyTorch tensors. It also clamps raw Rs, filters NaNs from As and Ss, and pads the adjacency matrices As and Ss.

Attributes Transformed:

Rs: Expression matrices are clamped (max 100000) and converted.
Xs: Same as Rs for zeroed expression matrices.
As: Nans in Adjacency matrices are filtered.
Ks: Converted to tensor after concatenating dataframes.
ks: Converted to tensor.
As_ej_index: Indices of non-zeros in padded adjacencies.

Note:

Assumes that self.Rs, As, and Ss are NumPy arrays or similar.
Assumes that self.Ks contains Pandas DataFrames.
Rs and Xs are logged.

Returns:: None. This function modifies the attributes in place.

get_ghost_indices(tissue)

Retrieves indices of "ghost" samples in a particular tissue sample set.

Parameters:

tissue (int): An integer representing the index of the tissue in the self.calT range.

Returns:

indices (list[int]): List of indices in self.anndatas[tissue].obs.index.value that correspond to the ghost samples in self.sample_tissue_map.

Raises:

AssertionError: If the input tissue is not within the range defined by self.calT.

harmonize_samples(anndata)

This function reads "sample_association" in the obsm, and uses it to order associated samples to the same rows in X.

"sample_association" is a table of calT cols ("tau_1", etc) and indicates which tissues are related.
"sample_association" combined and used to order the table X, obs, and obsm table "sample_sample_adj" (var/varm unaffected).
If a sample doesn't have an associated sample in one or more tissues, the other cols for those tissues in this table should be 'NaN'.
We expect AnnData 'obs_names' should be unique across all AnnData tables provided.

Returns:

List of Anndata of length calT; restructured to order rows the same across associated samples; if no association, a "ghost" entry is added:

NA/NaN are converted to a unique name.
A new row with this name is added to 'obs'; 'NaN' added to each entry (for easy identification of ghost samples).
A new row with this name is added to 'X', where expression of each gene is zero.

The AnnData tables X, obs, and the required obsm sparse matrix "sample_sample_adj" are then re-arranged to align associated samples to the same row. It is expected that each sample has a maximum of one associated sample per calT (e.g. a sample can only have one association to a sample in a different tissue).

Raises:

None due to filtering step in 'check_fidelity_of_anndata'.

load_anndata()

Load AnnData files (h5ad) from a specified directory.

This method loads AnnData files from a directory specified in the configs dictionary. The expected file naming structure is *tau_x.h5ad, where x is an integer determining the order of tissues [[1]]. The loaded AnnData objects are sorted based on the tau value [[2]].

The method handles different learning types ('train', 'validation', 'test', 'inference', 'impute_experiment') and adjusts the file loading path accordingly. It also checks for the presence of adjacency matrices in the AnnData objects and defaults to a simple model if they are missing [[8]].

Returns:: A list of AnnData objects, sorted by the tau value.
Return type:: list of anndata.AnnData

Note

The method expects the AnnData files to follow the *tau_x.h5ad naming convention.

pad_adjacency_matrices(adj_mat, final_shape)

Pads input adjacency matrix with 0s to achieve specified shape.

Parameters:

adj_mat (torch.Tensor): 2D tensor adjacency matrix to be padded.
final_shape (int): The final shape that the adjacency matrix should have after padding.

Returns:

adj_mat_new (torch.Tensor): The padded adjacency matrix with shape as specified by final_shape.

Notes:

Padding is added to the right and bottom of the matrix.

prep_batch_iterator(trivial_batch=False)

Splits data into multiple batches of data tensors for model training (divided by sample).

Parameters:

configs (dict): A dictionary with the key 'mini_batch_size', specifying the number of samples/elements in each sample mini-batch.

Returns:

final_X_batches, final_R_batches, final_K_batches: List of original expression, zeroed expression, sample information (from Obs) randomly separated into multiple "mini-batches" by sample.
final_idx_batches: Indices of which samples are in which minibatch.

Notes:

The original tensors in self.Xs, self.Rs, and self.Ks are not modified.

select_genes(anndatas)

Perform gene filtering steps, if desired.

Limit to highest variable genes based on 'select_genes' parameter.
Remove highest expressed genes using trimming, if 'trim_high_expressed_genes' is true.

select_samples(anndata)

Randomly selects samples from AnnDatas table based on the value set by configs['select_samples']. If configs['select_samples'] = float('inf') or <=0, all samples will be chosen normally.

Args:

anndata (list): A list of anndata files (X sorted to match each other after harmonize_samples().

Returns:

anndatas (list): Updated list of anndata files.

setup_categorical_variables()

Configures the categorical variables for each tissue and sample in the dataset.

Uses:

self.configs['vars_to_correct'] (list[str]): List of column names in sample_metadata DataFrames that need to be one-hot encoded.
self.configs['adjust_target_batch'] (list of tuples): List of tuples where the first element is the column name to adjust for and the second element is a target value for the adjustment. This is used during inference (only) to adjust the expression levels to a specific target batch.

Returns:

None: This function modifies the instance variables in place.

Attributes Modified:

self.Ks: List of one-hot encoded DataFrames for each sample's correction variables.
self.ks: List of integers representing the number of one-hot encoded columns for each sample's correction variables.

Raises:

AssertionError: The input column names in 'correction_vars' should match with the column names in 'self.sample_metadata'.