sktime 中基于区间的时序分类#
基于区间的方法着眼于完整序列中依赖于相位的区间,计算选定子序列的统计量以用于分类。
目前 sktime 中实现了 5 种单变量的基于区间的方法:时序森林 (TSF) [1]、随机区间谱集成 (RISE) [2]、监督时序森林 (STSF) [3]、规范区间森林 (CIF) [4] 和多样表示规范区间森林 (DrCIF)。CIF 和 DrCIF 都具有多变量能力。
在本 notebook 中,我们将演示如何在 ItalyPowerDemand 和 BasicMotions 数据集上使用这些分类器。
参考文献:#
[1] Deng, H., Runger, G., Tuv, E., & Vladimir, M. (2013). A time series forest for classification and feature extraction. Information Sciences, 239, 142-153.
[2] Flynn, M., Large, J., & Bagnall, T. (2019). The contract random interval spectral ensemble (c-RISE): the effect of contracting a classifier on accuracy. In International Conference on Hybrid Artificial Intelligence Systems (pp. 381-392). Springer, Cham.
[3] Cabello, N., Naghizade, E., Qi, J., & Kulik, L. (2020). Fast and Accurate Time Series Classification Through Supervised Interval Search. In IEEE International Conference on Data Mining.
[4] Middlehurst, M., Large, J., & Bagnall, A. (2020). The Canonical Interval Forest (CIF) Classifier for Time Series Classification. arXiv preprint arXiv:2008.09172.
[5] Lubba, C. H., Sethi, S. S., Knaute, P., Schultz, S. R., Fulcher, B. D., & Jones, N. S. (2019). catch22: CAnonical Time-series CHaracteristics. Data Mining and Knowledge Discovery, 33(6), 1821-1852.
1. 导入#
[ ]:
import numpy as np
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sktime.classification.interval_based import (
CanonicalIntervalForest,
DrCIF,
RandomIntervalSpectralEnsemble,
SupervisedTimeSeriesForest,
TimeSeriesForestClassifier,
)
from sktime.datasets import load_basic_motions, load_italy_power_demand
from sktime.transformations.panel.compose import ColumnConcatenator
2. 加载数据#
[ ]:
X_train, y_train = load_italy_power_demand(split="train", return_X_y=True)
X_test, y_test = load_italy_power_demand(split="test", return_X_y=True)
X_test = X_test[:50]
y_test = y_test[:50]
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
X_train_mv, y_train_mv = load_basic_motions(split="train", return_X_y=True)
X_test_mv, y_test_mv = load_basic_motions(split="test", return_X_y=True)
X_train_mv = X_train_mv[:50]
y_train_mv = y_train_mv[:50]
X_test_mv = X_test_mv[:50]
y_test_mv = y_test_mv[:50]
print(X_train_mv.shape, y_train_mv.shape, X_test_mv.shape, y_test_mv.shape)
3. 时序森林 (TSF)#
TSF 是一个基于随机选择区间的统计量构建的树分类器集成。对于每棵树,随机选择 sqrt(序列长度) 个区间。从这些区间中的每一个中,提取每个时序的均值、标准差和斜率,并将其连接成一个特征向量。然后使用这些新特征构建一棵树,并将其添加到集成中。
[ ]:
tsf = TimeSeriesForestClassifier(n_estimators=50, random_state=47)
tsf.fit(X_train, y_train)
tsf_preds = tsf.predict(X_test)
print("TSF Accuracy: " + str(metrics.accuracy_score(y_test, tsf_preds)))
[ ]:
tsf = Pipeline(
[
("column_concatenar", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=50, random_state=47)),
]
)
tsf.fit(X_train_mv, y_train_mv)
tsf_preds = tsf.predict(X_test_mv)
print("TSF Accuracy: " + str(metrics.accuracy_score(y_test_mv, tsf_preds)))
[ ]:
temporal_feature_importance = tsf["classify"].feature_importances_
separators = range(0, tsf["classify"].series_length, len(X_train_mv.iloc[0, 0]))
ax = temporal_feature_importance.plot(figsize=(20, 10))
for index, separator in enumerate(separators):
ax.vlines(
separator,
temporal_feature_importance.min().min(),
temporal_feature_importance.max().max(),
color="r",
alpha=0.3,
)
ax.text(
separator, temporal_feature_importance.max().max(), X_train_mv.columns[index]
)
[ ]:
X_train_mv_columns = list(X_train_mv.columns)
np.random.shuffle(X_train_mv_columns)
X_train_shuffled = X_train_mv[X_train_mv_columns]
X_train_shuffled.columns = X_train_mv.columns
X_test_shuffled = X_test_mv[X_train_mv_columns]
X_test_shuffled.columns = X_test_mv.columns
tsf = Pipeline(
[
("column_concatenator", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=50, random_state=47)),
]
)
tsf.fit(X_train_shuffled, y_train_mv)
tsf_preds = tsf.predict(X_test_shuffled)
print("TSF Accuracy: " + str(metrics.accuracy_score(y_test_mv, tsf_preds)))
[ ]:
temporal_feature_importance = tsf["classify"].feature_importances_
separators = range(0, tsf["classify"].series_length, len(X_train_mv.iloc[0, 0]))
ax = temporal_feature_importance.plot(figsize=(20, 10))
for index, separator in enumerate(separators):
ax.vlines(
separator,
temporal_feature_importance.min().min(),
temporal_feature_importance.max().max(),
color="r",
alpha=0.3,
)
ax.text(
separator, temporal_feature_importance.max().max(), X_train_mv_columns[index]
)
[ ]:
tsf = Pipeline(
[
("column_concatenator", ColumnConcatenator()),
(
"classify",
TimeSeriesForestClassifier(
n_estimators=50, random_state=47, inner_series_length=100
),
),
]
)
tsf.fit(X_train_mv, y_train_mv)
tsf_preds = tsf.predict(X_test_mv)
print("TSF Accuracy: " + str(metrics.accuracy_score(y_test_mv, tsf_preds)))
[ ]:
temporal_feature_importance = tsf["classify"].feature_importances_
separators = range(0, tsf["classify"].series_length, len(X_train_mv.iloc[0, 0]))
ax = temporal_feature_importance.plot(figsize=(20, 10))
for index, separator in enumerate(separators):
ax.vlines(
separator,
temporal_feature_importance.min().min(),
temporal_feature_importance.max().max(),
color="r",
alpha=0.3,
)
ax.text(
separator, temporal_feature_importance.max().max(), X_train_mv.columns[index]
)
[ ]:
X_train_mv_columns = list(X_train_mv.columns)
np.random.shuffle(X_train_mv_columns)
X_train_shuffled = X_train_mv[X_train_mv_columns]
X_train_shuffled.columns = X_train_mv.columns
X_test_shuffled = X_test_mv[X_train_mv_columns]
X_test_shuffled.columns = X_test_mv.columns
tsf = Pipeline(
[
("column_concatenator", ColumnConcatenator()),
(
"classify",
TimeSeriesForestClassifier(
n_estimators=50, random_state=47, inner_series_length=100
),
),
]
)
tsf.fit(X_train_shuffled, y_train_mv)
tsf_preds = tsf.predict(X_test_shuffled)
print("TSF Accuracy: " + str(metrics.accuracy_score(y_test_mv, tsf_preds)))
[ ]:
temporal_feature_importance = tsf["classify"].feature_importances_
separators = range(0, tsf["classify"].series_length, len(X_train_mv.iloc[0, 0]))
ax = temporal_feature_importance.plot(figsize=(20, 10))
for index, separator in enumerate(separators):
ax.vlines(
separator,
temporal_feature_importance.min().min(),
temporal_feature_importance.max().max(),
color="r",
alpha=0.3,
)
ax.text(
separator, temporal_feature_importance.max().max(), X_train_mv_columns[index]
)
4. 随机区间谱集成 (RISE)#
RISE 是一个基于树的区间集成,旨在对音频数据进行分类。与 TSF 不同,它为每棵树使用单个区间,并且它使用谱特征而不是统计量。
[ ]:
rise = RandomIntervalSpectralEnsemble(n_estimators=50, random_state=47)
rise.fit(X_train, y_train)
rise_preds = rise.predict(X_test)
print("RISE Accuracy: " + str(metrics.accuracy_score(y_test, rise_preds)))
5. 监督时序森林 (STSF)#
STSF 在原始 TSF 算法基础上进行了一些调整。用一种监督方法选择区间取代了随机选择。特征从周期图和一阶差分等额外表示生成的区间中提取。提取的统计量包括中位数、最小值、最大值和四分位数范围。
[ ]:
stsf = SupervisedTimeSeriesForest(n_estimators=50, random_state=47)
stsf.fit(X_train, y_train)
stsf_preds = stsf.predict(X_test)
print("STSF Accuracy: " + str(metrics.accuracy_score(y_test, stsf_preds)))
6. 规范区间森林 (CIF)#
CIF 是 TSF 算法的扩展。除了 TSF 使用的 3 个统计量外,CIF 还利用 Catch22
[5] 变换的特征。为了增加集成的多样性,每棵树的 TSF 和 Catch22 属性数量是随机抽样的。
单变量#
[ ]:
cif = CanonicalIntervalForest(n_estimators=50, att_subsample_size=8, random_state=47)
cif.fit(X_train, y_train)
cif_preds = cif.predict(X_test)
print("CIF Accuracy: " + str(metrics.accuracy_score(y_test, cif_preds)))
多变量#
[ ]:
cif_m = CanonicalIntervalForest(n_estimators=50, att_subsample_size=8, random_state=47)
cif_m.fit(X_train_mv, y_train_mv)
cif_m_preds = cif_m.predict(X_test_mv)
print("CIF Accuracy: " + str(metrics.accuracy_score(y_test_mv, cif_m_preds)))
6. 多样表示规范区间森林 (DrCIF)#
DrCIF 利用了 STSF 使用的周期图和差分表示以及 CIF 中的额外统计量。
单变量#
[ ]:
drcif = DrCIF(n_estimators=5, att_subsample_size=10, random_state=47)
drcif.fit(X_train, y_train)
drcif_preds = drcif.predict(X_test)
print("DrCIF Accuracy: " + str(metrics.accuracy_score(y_test, drcif_preds)))
多变量#
[ ]:
drcif_m = DrCIF(n_estimators=5, att_subsample_size=10, random_state=47)
drcif_m.fit(X_train_mv, y_train_mv)
drcif_m_preds = drcif_m.predict(X_test_mv)
print("DrCIF Accuracy: " + str(metrics.accuracy_score(y_test_mv, drcif_m_preds)))