binder

使用 sktime 进行时间序列插值#

假设我们有一组长度不同的时间序列,即时间点数量不同。目前,sktime 的大多数功能要求时间序列具有相同的长度,因此要使用 sktime,我们需要先将数据转换为等长的时间序列。在本教程中,您将学习如何使用 TSInterpolator 来实现这一目标。

[1]:
import random

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator

普通情况#

这是一种普通情况,所有时间序列具有相同的长度。我们从 sktime 加载示例数据集并训练一个分类器。

[2]:
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)

steps = [
    ("concatenate", ColumnConcatenator()),
    ("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[2]:
1.0

如果时间序列长度不等,sktime 的算法可能会引发错误#

现在我们将通过随机截断时间序列来稍微破坏一下数据集。这将导致时间序列长度不等。因此,在尝试训练分类器时,我们会遇到错误。

[3]:
def random_cut(df):
    """Randomly cut the data series in-place."""
    for row_i in range(df.shape[0]):
        for dim_i in range(df.shape[1]):
            ts = df.iloc[row_i][f"dim_{dim_i}"]
            df.iloc[row_i][f"dim_{dim_i}"] = pd.Series(
                ts.tolist()[: random.randint(len(ts) - 5, len(ts) - 3)]  # noqa: S311
            )  # here is a problem


X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)

for df in [X_train, X_test]:
    random_cut(df)

try:
    steps = [
        ("concatenate", ColumnConcatenator()),
        ("classify", TimeSeriesForestClassifier(n_estimators=100)),
    ]
    clf = Pipeline(steps)
    clf.fit(X_train, y_train)
    clf.score(X_test, y_test)
except ValueError as e:
    print(f"IndexError: {e}")
IndexError: Tabularization failed, it's possible that not all series were of equal length
/Users/mloning/.conda/envs/sktime-dev/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)

现在插值器登场#

现在我们使用插值器将不同长度的时间序列重塑为用户定义的长度。在内部,它使用 scipy 的线性插值并在用户定义的点数上绘制等距样本。

对数据进行插值后,分类器再次工作。

[4]:
from sktime.transformations.panel.interpolate import TSInterpolator

X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)

for df in [X_train, X_test]:
    random_cut(df)

steps = [
    ("transform", TSInterpolator(50)),
    ("concatenate", ColumnConcatenator()),
    ("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[4]:
1.0

使用 nbsphinx 生成。Jupyter notebook 可以在这里找到。