autogluon 是一个非常方便的机器学习库 它可以只用三行代码就解决了复杂的机器学习问题
下面附上我在官网文档看到的时间序列预测的样例
[机器学习]: https://auto.gluon.ai/stable/tutorials/timeseries/forecasting-indepth.html “”title””
1 2 import pandas as pdfrom autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
这个是进行库的导入,注意autogluon库可以通过pycharm直接安装,我开VPN的时候安装失败了
带静态特征的时间序列
导入数据
1 2 3 4 train_data = TimeSeriesDataFrame.from_path( "https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_daily_subset/train.csv" , ) train_data.head()
autogluon期望将静态特征作为dataframe对象,其中索引应该包括item_ids
1 2 3 4 5 static_features = pd.read_csv( "https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_daily_subset/metadata.csv", index_col="item_id", ) static_features.head()
将静态特征附着到时间序列中的dataframe来
1 train_data.static_features = static_features
domain:表示分类特征
1 2 3 4 5 6 7 8 9 predictor = TimeSeriesPredictor(prediction_length=14 ).fit(train_data) ... Following types of static features have been inferred: categorical: ['domain' ] continuous (float ): [] ...
带变化的时间序列
生成covariates
1 2 3 4 5 6 7 8 import numpy as nptrain_data["log_target" ] = np.log(train_data["target" ]) WEEKEND_INDICES = [5 , 6 ] timestamps = train_data.index.get_level_values("timestamp" ) train_data["weekend" ] = timestamps.weekday.isin(WEEKEND_INDICES).astype(float ) train_data.head()
注意代码中的weekend,这就代表了它的”季节性”
1 2 3 4 5 6 ... Provided dataset contains following columns: target: 'target' known covariates: ['weekend' ] past covariates: ['log_target' ] ...
最后,为了进行预测,我们生成预测范围的已知协变量
1 2 3 4 5 6 7 8 from autogluon.timeseries.utils.forecast import get_forecast_horizon_index_ts_dataframefuture_index = get_forecast_horizon_index_ts_dataframe(train_data, prediction_length=14 ) future_timestamps = future_index.get_level_values("timestamp" ) known_covariates = pd.DataFrame(index=future_index) known_covariates["weekend" ] = future_timestamps.weekday.isin(WEEKEND_INDICES).astype(float ) known_covariates.head()
训练数据必须至少包含一些长度 ≥ 2 * Prediction_length + 1 的时间序列。这是确保有一些数据可用作内部验证集所必需的。
所有时间序列都必须定期采样
数据中不存在缺失值
如何处理不规则数据和缺失数据 以下是具有不规则时间索引的数据集的示例:
1 2 3 4 5 6 7 8 9 10 df_irregular = TimeSeriesDataFrame( pd.DataFrame( { "item_id" : [0 , 0 , 0 , 1 , 1 ], "timestamp" : ["2022-01-01" , "2022-01-02" , "2022-01-04" , "2022-01-01" , "2022-01-04" ], "target" : [1 , 2 , 3 , 4 , 5 ], } ) ) df_irregular
显然在时间戳缺失了
我们可以验证该索引现在是有规律的并且具有每日频率
1 print (f"Data has frequency '{df_regular.freq} '" )
这里的结果显示为D,表示时间戳具有每日规律
但是,现在数据包含用 NaN 表示的缺失值。为了填充 NaN,我们使用 TimeSeriesDataFrame.fill_missing_values() 方法来实现各种插补策略。
1 2 df_filled = df_regular.fill_missing_values() df_filled
How to evaluate forecast accuracy?
数据集的划分很重要
1 2 3 prediction_length = 48 data = TimeSeriesDataFrame.from_path("https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_hourly_subset/train.csv" ) train_data, test_data = data.train_test_split(prediction_length)
进行绘图
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import matplotlib.pyplot as pltimport numpy as npitem_id = "H1" fig, (ax1, ax2) = plt.subplots(nrows=2 , figsize=[10 , 4 ], sharex=True ) train_ts = train_data.loc[item_id] test_ts = test_data.loc[item_id] ax1.set_title("Train data (past time series values)" ) ax1.plot(train_ts) ax2.set_title("Test data (past + future time series values)" ) ax2.plot(test_ts) for ax in (ax1, ax2): ax.fill_between(np.array([train_ts.index[-1 ], test_ts.index[-1 ]]), test_ts.min (), test_ts.max (), color="C1" , alpha=0.3 , label="Forecast horizon" ) plt.legend() plt.show()
模型评估
1 2 3 predictor = TimeSeriesPredictor(prediction_length=prediction_length, eval_metric="MASE" ).fit(train_data) predictor.evaluate(test_data)