Benchmarking with sktime#

The benchmarking modules allows you to easily orchestrate benchmarking experiments in which you want to compare the performance of one or more algorithms over one or more data sets. It also provides a number of statistical tests to check if observed performance differences are statistically significant.

The benchmarking modules is based on mlaut.

Preliminaries#

 [1]:

 # import required functions and classes
import os
import warnings
from sklearn.metrics import accuracy_score
from sktime.benchmarking.data import UEADataset, make_datasets
from sktime.benchmarking.evaluation import Evaluator
from sktime.benchmarking.metrics import PairwiseMetric
from sktime.benchmarking.orchestration import Orchestrator
from sktime.benchmarking.results import HDDResults
from sktime.benchmarking.strategies import TSCStrategy
from sktime.benchmarking.tasks import TSCTask
from sktime.classification.interval_based import (
    RandomIntervalSpectralEnsemble,
    TimeSeriesForestClassifier,
)
from sktime.series_as_features.model_selection import PresplitFilesCV
# hide warnings
warnings.filterwarnings("ignore")

Set up paths#

 [2]:

 # set up paths to data and results folder
import sktime
DATA_PATH = os.path.join(os.path.dirname(sktime.__file__), "datasets/data")
RESULTS_PATH = "results"

Create pointers to datasets on hard drive#

Here we use the UEADataset which follows the UEA/UCR format and some of the time series classification datasets included in sktime.

 [3]:

 # Create individual pointers to dataset on the disk
datasets = [
    UEADataset(path=DATA_PATH, name="ArrowHead"),
    UEADataset(path=DATA_PATH, name="ItalyPowerDemand"),
]

 [4]:

 # Alternatively, we can use a helper function to create them automatically
datasets = make_datasets(
    path=DATA_PATH, dataset_cls=UEADataset, names=["ArrowHead", "ItalyPowerDemand"]
)

For each dataset, we also need to specify a learning task#

The learning task encapsulate all the information and instructions that define the problem we’re trying to solve. In our case, we’re trying to solve classification tasks and the key information we need is the name of the target variable in the data set that we’re trying to predict. Here all tasks are the same because the target variable has the same name in all data sets.

 [5]:

 tasks = [TSCTask(target="target") for _ in range(len(datasets))]

Specify learning strategies#

Having set up the data sets and corresponding learning tasks, we need to define the algorithms we want to evaluate and compare.

 [6]:

 # Specify learning strategies
strategies = [
    TSCStrategy(TimeSeriesForestClassifier(n_estimators=10), name="tsf"),
    TSCStrategy(RandomIntervalSpectralEnsemble(n_estimators=10), name="rise"),
]

Set up a results object#

The results object encapsulates where and how benchmarking results are stored, here we choose to output them to the hard drive.

 [7]:

 # Specify results object which manages the output of the benchmarking
results = HDDResults(path=RESULTS_PATH)

Run benchmarking#

Finally, we pass all specifications to the orchestrator. The orchestrator will automatically train and evaluate all algorithms on all data sets and write out the results.

 [8]:

 # run orchestrator
orchestrator = Orchestrator(
    datasets=datasets,
    tasks=tasks,
    strategies=strategies,
    cv=PresplitFilesCV(),
    results=results,
)
orchestrator.fit_predict(save_fitted_strategies=False, overwrite_predictions=True)

Evaluate and compare results#

Having run the orchestrator, we can evaluate and compare the prediction strategies.

 [9]:

 evaluator = Evaluator(results)
metric = PairwiseMetric(func=accuracy_score, name="accuracy")
metrics_by_strategy = evaluator.evaluate(metric=metric)
metrics_by_strategy.head()

 [9]:

	strategy	accuracy_mean	accuracy_stderr
0	rise	0.850126	0.019921
1	tsf	0.829563	0.020512

The evaluator offers a number of additional methods for evaluating and comparing strategies, including statistical hypothesis tests and visualisation tools, for example:

 [10]:

 evaluator.rank()

 [10]:

	strategy	accuracy_mean_rank
0	rise	1.5
1	tsf	1.5

Currently, the following functions are implemented:

evaluator.plot_boxplots()
evaluator.ranks()
evaluator.t_test()
evaluator.sign_test()
evaluator.ranksum_test()
evaluator.t_test_with_bonferroni_correction()
evaluator.wilcoxon_test()
evaluator.friedman_test()
evaluator.nemenyi()

Generated using nbsphinx. The Jupyter notebook can be found here.