Time Series Forecasting
This guide will walk us through a quick example of forecasting the sales in a grocery store chain in Ecuador.
Setup¶
-
First, let's create a virtual environment for this example:
Feel free to replacepython3 -m venv ~/venv/grocery_sales
~/venv/grocery_sales
with any path where you'd like to store the virtual environment.Now, activate the virtual environment:
source ~/venv/grocery_sales/bin/activate
-
Now let's install the packages we'll need for this example:
pip3 install 'lolpop[cli,mlflow,duckdb,timeseries,dvc,optuna]' pip3 install tslumen
Note
The
tslumen
package contains some conflicts with some other dependencies in lolpop that are not utilized in the time series examples. As such, it's not included in an extra, but it's safe to install it for time series workflows. -
This example will version data with dvc. In order to use it, we'll need a git repo to use with dvc. For the purposes of this guide, we'll create a git repository in our provider of choice named
lolpop_grocery_sales_example
. You'll then want to clone this repo locally via something like:git clone https://github.com/<git_user>/lolpop_grocery_sales_example.git
git clone git@github.com:<git_user>/lolpop_grocery_sales_example.git
Now let's clone the
lolpop
repository to get the files we need for our example.git clone https://github.com/jordanvolz/lolpop.git
git clone git@github.com:jordanvolz/lolpop.git
And then we'll move the grocery_sales example files into our new repository:
cp -r lolpop/examples/time_series/grocery_sales/* lolpop_grocery_sales_example
Now, we'll move into the
lolpop_grocery_sales_example
directory and set up dvc:cd lolpop_grocery_sales_example mkdir dvc dvc init mkdir /tmp/artifacts dvc remote add -d local /tmp/artifacts
Note: if you already use dvc, feel free to use a different remote. In this example, we're simply using the local directory
/tmp/artifacts
to store our dvc artifacts. If you use a different remote than local, you'll need to configure that in thedev.yaml
file to point to the correct remote for the dvcVersionControl class. -
Now we'll download the data for the example from Kaggle. If you already use kaggle from the command line you can simply execute the following:
kaggle competitions download -c store-sales-time-series-forecasting unzip -j -o store-sales-time-series-forecasting.zip train.csv test.csv holidays_events.csv -d data
Note
You'll need to accept the terms of the kaggle contest to be able to download the data. If you've not already done that, you may wish to just manually download the data from the link below.
Or, manually download the data from the following link and unzip it. You should now have
train.csv
,test.csv
, andholidays_events.csv
files in thelolpop_grocery_sales_example/data
directory. -
In this example we'll also use duckdb as our main data source. To do this, we'll want to ingest our data into duckdb. You can do this via the following command:
This sets up two tables in duckdb that we'll use in this examplepython3 setup_duckdb.py
total_store_forecast_train
andtotal_store_forecast_test
.
Running the Workflow¶
-
To run our workflow, we'll simply need to execute the following command:
python3 run.py
-
You should see a bunch of
INFO
messages generated in your terminal. After several minutes your workflow should be finished. The end of your output will look something like:... 2023/08/18 03:50:00.606712 [INFO] <MLFlowMetricsTracker> ::: Saving metric=store_sales_forecast_predictions.num_predictions, value=16 in run 410932cae602426dbe56f4de494f0c68 2023/08/18 03:50:00.622685 [INFO] <dvcVersionControl> ::: Executing command: `dvc add dvc/store_sales_forecast_predictions.csv` 100% Adding...|█████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 17.35file/s] 2023/08/18 03:50:03.269537 [INFO] <dvcVersionControl> ::: Executing command: `dvc commit dvc/store_sales_forecast_predictions.csv` 2023/08/18 03:50:06.988464 [INFO] <dvcVersionControl> ::: Committed and pushed file dvc/store_sales_forecast_predictions.csv.dvc. Result: 37c96fb..fb16892 2023/08/18 03:50:06.989114 [INFO] <dvcVersionControl> ::: Executing command: `dvc push --remote local` 2023/08/18 03:50:08.824089 [INFO] <dvcVersionControl> ::: Executing command: `dvc get /Users/jordanvolz/github/lolpop_petfinder_example dvc/store_sales_forecast_predictions.csv --show-url` 2023/08/18 03:50:10.795488 [INFO] <MLFlowMetadataTracker> ::: Saving tag key=store_sales_forecast_predictions.exists, value=True to run 410932cae602426dbe56f4de494f0c68 2023/08/18 03:50:10.803912 [INFO] <MLFlowMetadataTracker> ::: Saving tag key=store_sales_forecast_predictions.hexsha, value=fb16892da5eee6d9f7940c8150e49ac0730f9f6f to run 410932cae602426dbe56f4de494f0c68 2023/08/18 03:50:10.806612 [INFO] <MLFlowMetadataTracker> ::: Saving tag key=store_sales_forecast_predictions.uri, value=/tmp/artifacts/5b/1b70482d648e3aa779f559946ed75c to run 410932cae602426dbe56f4de494f0c68 exiting...
Exploring the Example¶
-
This example uses MLFlow as your metadata tracker, which tracks all the artifacts we're creating. To view them, simply launch MLflow locally:
mlflow ui
-
You can then open up your web browser and navigate to
http://localhost:5000
. You'll notice an experiment namedstore_sales_forecast
with a recently completed run. Feel free to dig around and look at what information got logged. -
Predictions from the model get saved to duck db in the
total_store_sales_predictions
table. You can view them via:import duckdb con = duckdb.connect(database="duckdb/duck.db") con.sql("select * from total_store_sales_predictions")
Understanding the Example¶
To get a feel for tracing through the runner, pipeline and component code, see the quickstart guide. This example follows the same principles, it just does much more. As a summary for what this workflow accomplishes:
Data Processing:
-
Data is transformed via a
data_transformer
component. This would typically involve doing some light data engineering or feature engineering to get data into a single dataset. -
Data is versioned and tracked using the
metadata_tracker
and theresource_version_control
-
Data is profiled via a
data_profiler
. This performs an EDA style analysis of the data and save the output report. -
Data is run through a series of data checks, via a
data_checker
. These checks are typically associated with data quality/integrity. Again, the output report should be saved into themetadata_tracker
.
Model Training:
-
The input dataset is split into training, validation, and test datasets using a
data_splitter
. -
A model is trained! This fits a model using a
hyperparamter_tuner
or just a singlemodel_trainer
, depending on the configuration. All experiments will be tracked into themetadata_tracker
. All models created will be versioned and tracked into theresource_version_control
, also with the data splits themselves. -
The model is analyzed. This will perform tasks like running cross validation on the model, running a baseline comparison, and creating visualizations with a
model_visualizer
. -
The model lineage is created and tracked in the
metadata_tracker
. -
Optionally, if configured by the user, the workflow can retrain the model using all available data. This retrains the model given the configuration of the best experiment on the entire dataset.
Model Deployment:
-
Promotes a model using a
model_repository
. -
Checks if a model has been approved. If so, the model will be deployed using a
model_deployer
.
Model Inference:
-
Uses the
model_trainer
to generate predictions for the new data. -
Tracks and versions the new predictions using the
metadata_tracker
and theresource_version_control
. -
Saves the predictions to a target location via a
data_connector
.