BaseDataSynthesizer

Overview¶

A data_synthesizer is a component that is able to synthesizer a dataset given a sample dataset. The intention it to allow building of datasets of arbitrarily large size given a representative sample.

Attributes¶

BaseDataSynthesizer contains no default attributes.

Configuration¶

BaseDataSynthesizer contains no default or required configuration.

Interface¶

The following methods are part of BaseDataSynthesizer and should be implemented in any class that inherits from this base class:

spit_data¶

Performs a data split on the given data.

def load_data(self, source_file, *args, **kwargs) -> tupe[Any, Any]

Arguments:

source_file (object): File location of the source data to use as a sample for synthesis.

Returns:

data (object): An object representing the loaded data. Very like a pandas.DataFrame.
metadata (Any): A python object containing metadata about the loaded data.

model_data¶

Creates a model for synthesizing data given a sample.

def model_data(self, data, *args, **kwargs) -> Any

Arguments:

data (object): Data to model.

Returns:

model (Any): A model that can synthesize new data.

sample_data¶

Generates new sample data given a synthetic model.

def sample_data(self, model, num_rows, *args, **kwargs) -> Any

Arguments:

model (object): The synthetic model to use.
num_rows (int): The number of rows to generate.

Returns:

data (object): The generated data. Most likely a pandas.DataFrame, or similar.

evaluate_data¶

Evaluates synthetic data given a sample of real data.

def evaluate_data(self, real_data, synthetic_data, *args, **kwargs) -> list[Any]:

Arguments:

real_data (object): Real data from the original dataset.
synthetic_data (object): A sample of synthetic data, likely generated from sample_data.

Returns:

list (object): A list of reports generated to evaluate the model.