时间序列的典型特征 (catch22) 转换#
catch22[1] 是从 hctsa [2][3] 工具箱中存在的 7000 多个时间序列特征中提取的 22 个特征的集合。对性能优于随机水平的特征相关矩阵进行了层次聚类,以消除冗余。这些聚类使用决策树分类器按平衡准确率排序,并从形成的 22 个聚类中选择一个特征,同时考虑到平衡准确率结果、计算效率和可解释性。
在本笔记本中,我们将演示如何在 ItalyPowerDemand 单变量和 BasicMotions 多变量数据集上使用 catch22 转换器。我们还将展示 catch22 与随机森林分类器一起用于分类。
参考文献:#
[1] Lubba, C. H., Sethi, S. S., Knaute, P., Schultz, S. R., Fulcher, B. D., & Jones, N. S. (2019). catch22: CAnonical Time-series CHaracteristics. 数据挖掘与知识发现, 33(6), 1821-1852.
[2] Fulcher, B. D., & Jones, N. S. (2017). hctsa: 用于使用大规模特征提取进行自动化时间序列表型分析的计算框架. Cell systems, 5(5), 527-531.
[3] Fulcher, B. D., Little, M. A., & Jones, N. S. (2013). 高度比较时间序列分析:时间序列及其方法的经验结构. Journal of the Royal Society Interface, 10(83), 20130048.
1. 导入#
[1]:
from sklearn import metrics
from sktime.classification.feature_based import Catch22Classifier
from sktime.datasets import load_basic_motions, load_italy_power_demand
from sktime.transformations.panel.catch22 import Catch22
2. 加载数据#
[2]:
IPD_X_train, IPD_y_train = load_italy_power_demand(split="train", return_X_y=True)
IPD_X_test, IPD_y_test = load_italy_power_demand(split="test", return_X_y=True)
IPD_X_test = IPD_X_test[:50]
IPD_y_test = IPD_y_test[:50]
print(IPD_X_train.shape, IPD_y_train.shape, IPD_X_test.shape, IPD_y_test.shape)
BM_X_train, BM_y_train = load_basic_motions(split="train", return_X_y=True)
BM_X_test, BM_y_test = load_basic_motions(split="test", return_X_y=True)
print(BM_X_train.shape, BM_y_train.shape, BM_X_test.shape, BM_y_test.shape)
(67, 1) (67,) (50, 1) (50,)
(40, 6) (40,) (40, 6) (40,)
3. catch22 转换#
单变量#
catch22 特征以转换器 Catch22
的形式提供。通过它可以将转换后的数据用于各种时间序列分析任务。
[3]:
c22_uv = Catch22()
c22_uv.fit(IPD_X_train, IPD_y_train)
[3]:
Catch22()请重新运行此单元格以显示 HTML 表示或信任笔记本。
Catch22()
[4]:
transformed_data_uv = c22_uv.transform(IPD_X_train)
transformed_data_uv.head()
/opt/homebrew/Caskroom/miniforge/base/envs/sktime/lib/python3.9/site-packages/numba/cpython/hashing.py:482: UserWarning: FNV hashing is not implemented in Numba. See PEP 456 https://pythonlang.cn/dev/peps/pep-0456/ for rationale over not using FNV. Numba will continue to work, but hashes for built in types will be computed using siphash24. This will permit e.g. dictionaries to continue to behave as expected, however anything relying on the value of the hash opposed to hash as a derived property is likely to not work as expected.
warnings.warn(msg)
[4]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.158630 | -0.217227 | 8.0 | 0.291667 | -0.625000 | 3.0 | 6.0 | 0.468052 | 0.589049 | 0.836755 | ... | 3.0 | 1.000000 | 5.0 | 1.778748 | 0.750000 | 0.240598 | NaN | NaN | 0.040000 | NaN |
1 | 0.918162 | -0.214762 | 15.0 | 0.208333 | -0.666667 | 4.0 | 8.0 | 0.702775 | 0.196350 | 0.666160 | ... | 4.0 | 0.869565 | 5.0 | 1.730238 | 0.500000 | 0.388217 | NaN | NaN | 0.111111 | NaN |
2 | -0.273180 | -0.085856 | 4.0 | 0.875000 | 0.250000 | 2.0 | 5.0 | 0.310567 | 0.589049 | 0.865073 | ... | 2.0 | 0.913043 | 5.0 | 1.836012 | 0.666667 | 0.089104 | NaN | NaN | 0.034014 | NaN |
3 | 0.048411 | -0.450080 | 13.0 | 0.166667 | -0.625000 | 4.0 | 10.0 | 0.804047 | 0.196350 | 0.648309 | ... | 4.0 | 0.869565 | 6.0 | 1.605420 | 0.666667 | 0.332436 | NaN | NaN | 0.111111 | NaN |
4 | 0.426379 | 0.572566 | 16.0 | 0.291667 | -0.666667 | 4.0 | 7.0 | 0.675485 | 0.196350 | 0.657946 | ... | 4.0 | 0.913043 | 6.0 | 1.730238 | 0.500000 | 0.318405 | NaN | NaN | 0.111111 | NaN |
5 行 × 22 列
请注意,Catch22 在 fit(x, y=None)
方法中不考虑标签 (y
),因此我们可以轻松地用单步 fit_transform
方法替换它。
[5]:
c22_uv_single_step = Catch22()
transformed_data_uv_single_step = c22_uv.fit_transform(IPD_X_train)
transformed_data_uv_single_step.equals(transformed_data_uv)
[5]:
True
多变量#
Catch22
支持对多变量数据进行转换。默认过程将在转换之前连接每一列。
[6]:
c22_mv = Catch22()
transformed_data_mv = c22_mv.fit_transform(BM_X_train)
transformed_data_mv.head()
[6]:
dim_0__0 | dim_0__1 | dim_0__2 | dim_0__3 | dim_0__4 | dim_0__5 | dim_0__6 | dim_0__7 | dim_0__8 | dim_0__9 | ... | dim_5__12 | dim_5__13 | dim_5__14 | dim_5__15 | dim_5__16 | dim_5__17 | dim_5__18 | dim_5__19 | dim_5__20 | dim_5__21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.140988 | -0.268073 | 6.0 | -0.890 | 0.160 | 2.0 | 3.0 | 0.042638 | 0.736311 | 0.314500 | ... | 2.0 | 0.707071 | 7.0 | 1.907929 | 1.00 | 0.658286 | 0.828571 | 0.228571 | 0.012550 | 9.0 |
1 | -0.387256 | -0.126246 | 6.0 | -0.920 | -0.600 | 2.0 | 4.0 | 0.269591 | 0.490874 | 0.614552 | ... | 2.0 | 0.727273 | 6.0 | 1.875354 | 0.50 | 0.206944 | 0.600000 | 0.257143 | 0.028935 | 9.0 |
2 | 0.028412 | -0.224988 | 9.0 | -0.335 | -0.045 | 1.0 | 3.0 | 0.036650 | 1.030835 | 0.352408 | ... | 2.0 | 0.818182 | 7.0 | 1.789838 | 0.75 | 0.791912 | 0.828571 | 0.228571 | 0.054977 | 11.0 |
3 | -0.147338 | -0.199523 | 8.0 | -0.540 | 0.180 | 1.0 | 5.0 | 0.013833 | 1.030835 | 0.212988 | ... | 2.0 | 0.717172 | 6.0 | 1.904917 | 1.00 | 1.191592 | 0.600000 | 0.171429 | 0.015611 | 9.0 |
4 | -0.217645 | -0.252015 | 7.0 | -0.130 | 0.020 | 1.0 | 6.0 | 0.008072 | 0.883573 | 0.150597 | ... | 2.0 | 0.707071 | 7.0 | 1.880930 | 1.00 | 3.141568 | 0.800000 | 0.200000 | 0.002449 | 10.0 |
5 行 × 132 列
我们也可以设置特定的列名,例如 "short_str_feat"
,它将在列名中显示特征的短名称。
如果原始时间序列分布的位置和范围可能很重要,请将 catch24 = true
设置为包含额外的 Mean
和 StandardDeviation
值。
[7]:
c24_mv = Catch22(col_names="short_str_feat", catch24=True)
c24_mv.fit(BM_X_train)
[7]:
Catch22(catch24=True, col_names='short_str_feat')请重新运行此单元格以显示 HTML 表示或信任笔记本。
Catch22(catch24=True, col_names='short_str_feat')
[8]:
c24_mv.transform(BM_X_train).head()
[8]:
dim_0__mode_5 | dim_0__mode_10 | dim_0__stretch_decreasing | dim_0__outlier_timing_pos | dim_0__outlier_timing_neg | dim_0__acf_timescale | dim_0__acf_first_min | dim_0__centroid_freq | dim_0__low_freq_power | dim_0__forecast_error | ... | dim_5__stretch_high | dim_5__rs_range | dim_5__whiten_timescale | dim_5__embedding_dist | dim_5__dfa | dim_5__rs_range | dim_5__transition_matrix | dim_5__periodicity | dim_5__mean | dim_5__std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.140988 | -0.268073 | 6.0 | -0.890 | 0.160 | 2.0 | 3.0 | 0.042638 | 0.736311 | 0.314500 | ... | 7.0 | 1.907929 | 1.00 | 0.658286 | 0.828571 | 0.228571 | 0.012550 | 9.0 | 0.054413 | 0.510274 |
1 | -0.387256 | -0.126246 | 6.0 | -0.920 | -0.600 | 2.0 | 4.0 | 0.269591 | 0.490874 | 0.614552 | ... | 6.0 | 1.875354 | 0.50 | 0.206944 | 0.600000 | 0.257143 | 0.028935 | 9.0 | -0.102407 | 0.661172 |
2 | 0.028412 | -0.224988 | 9.0 | -0.335 | -0.045 | 1.0 | 3.0 | 0.036650 | 1.030835 | 0.352408 | ... | 7.0 | 1.789838 | 0.75 | 0.791912 | 0.828571 | 0.228571 | 0.054977 | 11.0 | 0.031881 | 0.499788 |
3 | -0.147338 | -0.199523 | 8.0 | -0.540 | 0.180 | 1.0 | 5.0 | 0.013833 | 1.030835 | 0.212988 | ... | 6.0 | 1.904917 | 1.00 | 1.191592 | 0.600000 | 0.171429 | 0.015611 | 9.0 | 0.029537 | 0.248161 |
4 | -0.217645 | -0.252015 | 7.0 | -0.130 | 0.020 | 1.0 | 6.0 | 0.008072 | 0.883573 | 0.150597 | ... | 7.0 | 1.880930 | 1.00 | 3.141568 | 0.800000 | 0.200000 | 0.002449 | 10.0 | 0.013344 | 0.163754 |
5 行 × 144 列
4. catch22 森林分类器#
对于分类任务,与 catch22 特征一起使用的默认分类器是随机森林分类器。为了便于使用,提供了基于 catch22 特征构建的、利用 sklearn 中的 RandomForestClassifier
的 Catch22Classifier
实现。
[9]:
c22f = Catch22Classifier(random_state=0)
c22f.fit(IPD_X_train, IPD_y_train)
[9]:
Catch22Classifier(random_state=0)请重新运行此单元格以显示 HTML 表示或信任笔记本。
Catch22Classifier(random_state=0)
[10]:
c22f_preds = c22f.predict(IPD_X_test)
print("C22F Accuracy: " + str(metrics.accuracy_score(IPD_y_test, c22f_preds)))
C22F Accuracy: 0.86