BaseProcess
Overview¶
The process
pipeline is a pipeline that performs common data processing activities, such as making data transformations, versioning data, performing data checks and data drift analysis, etc.
Attributes¶
BaseProcess
contains the following attributes:
datasets_used
(list): A list of datasets used in the workflow. The intention here is to gather relevant datasets used in model construction for purposes of generating lineage downstream.
Configuration¶
BaseProcess
contains no default or required configuration.
Interface¶
The following methods are part of BaseProcess
and should be implemented in any class that inherits from this base class:
transform_data¶
Executes a data transformation
def transform_data(self, source, *args, **kwargs) -> Any
Arguments:
source
(str): The source identifier of the input data (or possibly transform job). This should be something like a file path or a table name, etc.
Returns:
data
(object): The transformed data
track_data¶
Versions a dataset.
def track_data(self, data, id, *args, **kwargs) -> Any
Arguments:
data
(object): The dataset to version.id
(object): The dataset id.
Returns:
dataset_version
(object): The dataset_version corresponding to the versioned data.
profile_data¶
Creates an EDA-style data profile.
def profile_data(self, data, dataset_version, *args, **kwargs)
Arguments:
data
(object): The data to profile.dataset_version
(object): The dataset version to save the profile in.
Returns:
Nothing
check data¶
Runs data checks on the dataset to detect issues, such as large number of nulls, class imbalance, etc. Checks will vary based on problem type.
def check_data(self, data, dataset_version, *args, **kwargs)
Arguments:
data
(object): The data to check.dataset_version
(object): The dataset version to save the data check report in.
Returns:
Nothing
compare_data¶
Runs a data drift analysis and generates a drift report.
def compare_data(self, data, dataset_version, *args, **kwargs)
Arguments:
data
(object): The data to analyze.dataset_version
(object): The dataset version to save the data drift report in.
Returns:
Nothing