Skip to content

SDVDataSynthesizer

This class is a sub-class of BaseDataSynthesizer that operates in the Synthatic Data Vault (SDV) environment. It contains 5 methods to load, model, sample and evaluate synthetic data.

Configuration

Required Configuration

The SDV data synthesizer requires the following configuration:

  • local_dir: Location of a local directory to output files generated by this component.

Optional Configuration

The SDV data synthesizer has no optional configuration.

Default Configuration

The SDV data synthesizer uses the following optional configuration:

  • sdv_quality_report_name: The file name of the generated report. Defaults to sdv_quality_report.pkl.
  • sdv_diagnostic_report_name: The file name of the generated report. Defaults to sdv_diagnostic_report.pkl.
  • synthesizer : The synthesizer class to user. Defaults to SingleTablePreset.

Methods

load_data

This method loads tabular data from a given file. It returns a pandas dataFrame of the data and metadata of the loaded data.

def load_data(self, source_file_path, file_type="csv", `*args`, `**kwargs`)

Arguments:

  • source_file_path: (String). Path to the file containing the data to be loaded
  • file_type: (String). Type of file containing the data. Default is "csv".

Returns:

  • a pandas dataFrane of the loaded data
  • metadata of the data

model_data

This method creates a synthesizer that is fit on given data using the sdv environment. The method returns a synthesizer model fit on the data.

def model_data(self, data, metadata, synthesizer_str=None, `*args`, `**kwargs`)

Arguments:

  • data: (object) pandas dataframe of the input data
  • metadata: (object) dictionary object representing metadata of the input data which should be the output of load_data method.
  • synthesizer_str: (str) Optional string representing synthesizer to use. If not provided, a preset is used as default.

Returns:

  • The synthesizer object of the model fit on the input data.

sample_data

This method samples generated data from an already created synthesizer object and returns a pandas dataframe of synthesized data.

sample_data(self, synthesizer, num_rows, `*args`, `**kwargs`)

Arguments:

  • synthesizer: synthesizer object obtained from the output of model_data function
  • num_rows: integer. number of rows of data to generate.

Returns:

  • An object of pandas dataframe as the generated synthetically data.

evaluate_data

This function returns reports on the quality of the synthetic data generated.

evaluate_data(self, real_data, synthetic_data, metadata, synthesizer_str, *args, **kwargs)

Arguments:

  • real_data: pandas dataframe of the original data
  • synthetic_data: pandas dataframe of the generated synthetic data
  • metadata: dictionary object obtained from the output of load_data method.
  • synthesizer_str: string representing the synthesizer. If not specified, it uses the default synthesizer.

Returns:

  • quality_report: report on the quality of the fake data obtained as an object
  • diagnostic_report: diagnostic results about the fake data obtained as an object

_get_synthesizer_class

This method returns a class object as specified by the synthesizer string. Private method used internally by the class.

_get_synthesizer_class(self, synthesizer)`

Arguments:

  • synthesizer: string representing the synthesizer class to be returned.

Returns:

  • corresponding synthesizer class

_get_evaluator_class

This method returns an evaluator class object as specified by the synthesizer string. Private method used internally by the class.

_get_evaluator_class(self, synthesizer)

Arguments:

  • synthesizer: string representing the evaluator class to be returned.

Returns:

  • corresponding evaluator class

_get_diagnostic_class

his method returns a diagnostic class object as specified by the synthesizer. Private method used internally by the class.

_get_diagnostic_class(self, synthesizer)

Arguments:

  • synthesizer: string representing the diagnostic class to be returned

Returns:

  • corresponding diagnostic class

Usage Example

Below is an example of creating, modeling, and sampling synthetic data using SDVDataSynthesizer.

from lopop.component import SDVDataSynthesizer

config = {
    #insert component config here 
}

sdv_synthesizer = SDVDataSynthesizer(conf=config)

# load data
DATA_PATH = "./data/adult_data.csv"
data, metadata = sdv_synthesizer.load_data(DATA_PATH)

# fit the model
synthesizer = sdv_synthesizer.model_data(data, metadata)

# Generate synthetic data
syn_data = sdv_synthesizer.sample_data(synthesizer, num_rows=10000)

# Evaluate synthetic data
reports = sdv_synthesizer.evaluate_data(data, syn_data, metadata)