SDVDataSynthesizer¶
This class is a sub-class of BaseDataSynthesizer that operates in the Synthatic Data Vault (SDV) environment. It contains 5 methods to load, model, sample and evaluate synthetic data.
Configuration¶
Required Configuration¶
The SDV data synthesizer requires the following configuration:
local_dir: Location of a local directory to output files generated by this component.
Optional Configuration¶
The SDV data synthesizer has no optional configuration.
Default Configuration¶
The SDV data synthesizer uses the following optional configuration:
sdv_quality_report_name: The file name of the generated report. Defaults tosdv_quality_report.pkl.sdv_diagnostic_report_name: The file name of the generated report. Defaults tosdv_diagnostic_report.pkl.synthesizer: The synthesizer class to user. Defaults toSingleTablePreset.
Methods¶
load_data¶
This method loads tabular data from a given file. It returns a pandas dataFrame of the data and metadata of the loaded data.
def load_data(self, source_file_path, file_type="csv", `*args`, `**kwargs`)
Arguments:
source_file_path: (String). Path to the file containing the data to be loadedfile_type: (String). Type of file containing the data. Default is "csv".
Returns:
- a pandas dataFrane of the loaded data
- metadata of the data
model_data¶
This method creates a synthesizer that is fit on given data using the sdv environment. The method returns a synthesizer model fit on the data.
def model_data(self, data, metadata, synthesizer_str=None, `*args`, `**kwargs`)
Arguments:
data: (object) pandas dataframe of the input datametadata: (object) dictionary object representing metadata of the input data which should be the output ofload_datamethod.synthesizer_str: (str) Optional string representing synthesizer to use. If not provided, a preset is used as default.
Returns:
- The synthesizer object of the model fit on the input data.
sample_data¶
This method samples generated data from an already created synthesizer object and returns a pandas dataframe of synthesized data.
sample_data(self, synthesizer, num_rows, `*args`, `**kwargs`)
Arguments:
synthesizer: synthesizer object obtained from the output ofmodel_datafunctionnum_rows: integer. number of rows of data to generate.
Returns:
- An object of pandas dataframe as the generated synthetically data.
evaluate_data¶
This function returns reports on the quality of the synthetic data generated.
evaluate_data(self, real_data, synthetic_data, metadata, synthesizer_str, *args, **kwargs)
Arguments:
real_data: pandas dataframe of the original datasynthetic_data: pandas dataframe of the generated synthetic datametadata: dictionary object obtained from the output ofload_datamethod.synthesizer_str: string representing the synthesizer. If not specified, it uses the default synthesizer.
Returns:
- quality_report: report on the quality of the fake data obtained as an object
- diagnostic_report: diagnostic results about the fake data obtained as an object
_get_synthesizer_class¶
This method returns a class object as specified by the synthesizer string. Private method used internally by the class.
_get_synthesizer_class(self, synthesizer)`
Arguments:
synthesizer: string representing the synthesizer class to be returned.
Returns:
- corresponding synthesizer class
_get_evaluator_class¶
This method returns an evaluator class object as specified by the synthesizer string. Private method used internally by the class.
_get_evaluator_class(self, synthesizer)
Arguments:
synthesizer: string representing the evaluator class to be returned.
Returns:
- corresponding evaluator class
_get_diagnostic_class¶
his method returns a diagnostic class object as specified by the synthesizer. Private method used internally by the class.
_get_diagnostic_class(self, synthesizer)
Arguments:
synthesizer: string representing the diagnostic class to be returned
Returns:
- corresponding diagnostic class
Usage Example¶
Below is an example of creating, modeling, and sampling synthetic data using SDVDataSynthesizer.
from lopop.component import SDVDataSynthesizer
config = {
#insert component config here
}
sdv_synthesizer = SDVDataSynthesizer(conf=config)
# load data
DATA_PATH = "./data/adult_data.csv"
data, metadata = sdv_synthesizer.load_data(DATA_PATH)
# fit the model
synthesizer = sdv_synthesizer.model_data(data, metadata)
# Generate synthetic data
syn_data = sdv_synthesizer.sample_data(synthesizer, num_rows=10000)
# Evaluate synthetic data
reports = sdv_synthesizer.evaluate_data(data, syn_data, metadata)