使用 sktime 进行时间序列插值#
假设我们有一组长度不同的时间序列,即时间点数量不同。目前,sktime 的大多数功能要求时间序列具有相同的长度,因此要使用 sktime,我们需要先将数据转换为等长的时间序列。在本教程中,您将学习如何使用 TSInterpolator
来实现这一目标。
[1]:
import random
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator
普通情况#
这是一种普通情况,所有时间序列具有相同的长度。我们从 sktime 加载示例数据集并训练一个分类器。
[2]:
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)
steps = [
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[2]:
1.0
如果时间序列长度不等,sktime 的算法可能会引发错误#
现在我们将通过随机截断时间序列来稍微破坏一下数据集。这将导致时间序列长度不等。因此,在尝试训练分类器时,我们会遇到错误。
[3]:
def random_cut(df):
"""Randomly cut the data series in-place."""
for row_i in range(df.shape[0]):
for dim_i in range(df.shape[1]):
ts = df.iloc[row_i][f"dim_{dim_i}"]
df.iloc[row_i][f"dim_{dim_i}"] = pd.Series(
ts.tolist()[: random.randint(len(ts) - 5, len(ts) - 3)] # noqa: S311
) # here is a problem
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)
for df in [X_train, X_test]:
random_cut(df)
try:
steps = [
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
except ValueError as e:
print(f"IndexError: {e}")
IndexError: Tabularization failed, it's possible that not all series were of equal length
/Users/mloning/.conda/envs/sktime-dev/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)
现在插值器登场#
现在我们使用插值器将不同长度的时间序列重塑为用户定义的长度。在内部,它使用 scipy 的线性插值并在用户定义的点数上绘制等距样本。
对数据进行插值后,分类器再次工作。
[4]:
from sktime.transformations.panel.interpolate import TSInterpolator
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)
for df in [X_train, X_test]:
random_cut(df)
steps = [
("transform", TSInterpolator(50)),
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[4]:
1.0