Shapelets and the Shapelet Transform with sktime¶

Introduced in [1], a shapelet is a time series subsequences that is identified as being representative of class membership. Shapelets are a powerful approach for measuring phase-independent similarity between time series; they can occur at any point within a series and offer interpretable results for how matches occur. The original research extracted shapelets to build a decision tree classifier.

The example below illustrates how leaf shape can be represented as a one-dimensional time series (blue line) to distinguish between two species.[2]

The highlighted red subsection of the time series (i.e., “subsequences”) above is the shapelet that distinguishes Verbena urticifolia from Urtica dioica.

The Shapelet Transform¶

Much research emphasis has been placed on shapelet-based approaches for time series classification (TSC) since the original research was proposed. The current state-of-the-art for shapelets is the shapelet transform (ST) [3, 4]. The transform improves upon the original use of shapelets by separating shapelet extraction from the classification algorithm, allowing interpretable phase-independent classification of time series with any standard classification algorithm (such as random/rotation forest, neural networks, nearest neighbour classifications, ensembles of all, etc.). To facilitate this, rather than recursively assessing data for the best shapelet, the transform evaluates candidate shapelets in a single procedure to rank them based on information gain. Then, given a set of k shapelets, a time series can be transformed into k features by calculating the distance from the series to each shapelet. By transforming a dataset in this manner any vector-based classification algorithm can be applied to a shapelet-transformed time series problem while the interpretability of shapelets is maintained through the ranked list of the best shapelets during transformation.

Shapelets can provide interpretable results, as seen in the figure below:

The shapelet has “discovered” where the two plant species distinctly differ. Urtica dioica has a stem that connects to the leaf at almost 90 degrees, whereas the stem of Verbena urticifolia connects to the leaf at a wider angle.

Having found shapelet, its distance to the nearest matching subsequence in all objects in the database can be recorded. Finally, a simple decision tree classifier can be built to determine whether an object \(Q\) has a subsequence within a certain distance from shapelet \(I\).

References¶

[1] Ye, Lexiang, and Eamonn Keogh. “Time series shapelets: a novel technique that allows accurate, interpretable and fast classification.” Data mining and knowledge discovery 22, no. 1-2 (2011): 149-182.

[2] Ye, Lexiang, and Eamonn Keogh. “Time series shapelets: a new primitive for data mining.” In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 947-956. 2009.

[3] Lines, Jason, Luke M. Davis, Jon Hills, and Anthony Bagnall. “A shapelet transform for time series classification.” In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 289-297. ACM, 2012.

[4] Hills, Jon, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. “Classification of time series by shapelet transformation.” Data Mining and Knowledge Discovery 28, no. 4 (2014): 851-881.

[5] Bostrom, Aaron, and Anthony Bagnall. “Binary shapelet transform for multiclass time series classification.” In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXII, pp. 24-46. Springer, Berlin, Heidelberg, 2017.

Example: The Shapelet Transform in sktime¶

The following workbook demonstrates a full workflow of using the shapelet transform in sktime with a scikit-learn classifier with the OSU Leaf dataset, which consists of one dimensional outlines of six leaf classes: Acer Circinatum, Acer Glabrum, Acer Macrophyllum, Acer Negundo, Quercus Garryana and Quercus Kelloggii.

 [1]:

 from sktime.datasets import load_osuleaf
from sktime.transformations.panel.shapelets import ContractedShapeletTransform
train_x, train_y = load_osuleaf(split="train", return_X_y=True)
test_x, test_y = load_osuleaf(split="test", return_X_y=True)

 [2]:

 # How long (in minutes) to extract shapelets for.
# This is a simple lower-bound initially;
# once time is up, no further shapelets will be assessed
time_contract_in_mins = 1
# The initial number of shapelet candidates to assess per training series.
# If all series are visited and time remains on the contract then another
# pass of the data will occur
initial_num_shapelets_per_case = 10
# Whether or not to print on-going information about shapelet extraction.
# Useful for demo/debugging
verbose = 2
st = ContractedShapeletTransform(
    time_contract_in_mins=time_contract_in_mins,
    num_candidates_to_sample_per_case=initial_num_shapelets_per_case,
    verbose=verbose,
)
st.fit(train_x, train_y)

visiting series: 119 (#1)
1.4688408374786377
Candidate finished. 00:58 remaining
5.909873962402344
Candidate finished. 00:52 remaining
Candidate finished. 00:48 remaining
Candidate finished. 00:46 remaining
Candidate finished. 00:42 remaining
Candidate rejected. 00:-20 remaining
Candidate finished. 00:36 remaining
Candidate finished. 00:33 remaining
Candidate rejected. 00:-28 remaining
Candidate rejected. 00:-31 remaining
visiting series: 160 (#2)
Candidate finished. 00:25 remaining
Candidate rejected. 00:-36 remaining
Candidate rejected. 00:-39 remaining
Candidate rejected. 00:-41 remaining
Candidate finished. 00:14 remaining
Candidate finished. 00:12 remaining
Candidate finished. 00:09 remaining
Candidate finished. 00:07 remaining
Candidate rejected. 00:-54 remaining
Candidate finished. 00:02 remaining
visiting series: 12 (#3)
No more time available! It's been 01:00
Stopping search

 [2]:

ContractedShapeletTransform(num_candidates_to_sample_per_case=10,
                            time_contract_in_mins=1, verbose=2)

 [3]:

 %matplotlib inline
import matplotlib.pyplot as plt
# for each extracted shapelet (in descending order of quality/information gain)
for s in st.shapelets[0:5]:
    # summary info about the shapelet
    print(s)
    # plot the series that the shapelet was extracted from
    plt.plot(train_x.iloc[s.series_id, 0], "gray")
    # overlay the shapelet onto the full series
    plt.plot(
        list(range(s.start_pos, (s.start_pos + s.length))),
        train_x.iloc[s.series_id, 0][s.start_pos : s.start_pos + s.length],
        "r",
        linewidth=3.0,
    )
    plt.show()

Series ID: 119, start_pos: 73, length: 167, info_gain: 0.3254739618843089,

Series ID: 12, start_pos: 342, length: 60, info_gain: 0.08997643577425246,

Series ID: 160, start_pos: 103, length: 188, info_gain: 0.07817874804764519,

 [4]:

 # for each extracted shapelet (in descending order of quality/information gain)
for i in range(0, min(len(st.shapelets), 5)):
    s = st.shapelets[i]
    # summary info about the shapelet
    print("#" + str(i) + ": " + str(s))
    # overlay shapelets
    plt.plot(
        list(range(s.start_pos, (s.start_pos + s.length))),
        train_x.iloc[s.series_id, 0][s.start_pos : s.start_pos + s.length],
    )
plt.show()

#0: Series ID: 119, start_pos: 73, length: 167, info_gain: 0.3254739618843089,
#1: Series ID: 12, start_pos: 342, length: 60, info_gain: 0.08997643577425246,
#2: Series ID: 160, start_pos: 103, length: 188, info_gain: 0.07817874804764519,

 [5]:

 import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sktime.datasets import load_osuleaf
train_x, train_y = load_osuleaf(split="train", return_X_y=True)
test_x, test_y = load_osuleaf(split="test", return_X_y=True)
# example pipeline with 1 minute time limit
pipeline = Pipeline(
    [
        (
            "st",
            ContractedShapeletTransform(
                time_contract_in_mins=time_contract_in_mins,
                num_candidates_to_sample_per_case=10,
                verbose=False,
            ),
        ),
        ("rf", RandomForestClassifier(n_estimators=100)),
    ]
)
start = time.time()
pipeline.fit(train_x, train_y)
end_build = time.time()
preds = pipeline.predict(test_x)
end_test = time.time()
print("Results:")
print("Correct:")
correct = sum(preds == test_y)
print("\t" + str(correct) + "/" + str(len(test_y)))
print("\t" + str(correct / len(test_y)))
print("\nTiming:")
print("\tTo build:   " + str(end_build - start) + " secs")
print("\tTo predict: " + str(end_test - end_build) + " secs")

Results:
Correct:
        137/242
        0.5661157024793388
Timing:
        To build:   72.31805896759033 secs
        To predict: 14.819790124893188 secs

Generated using nbsphinx. The Jupyter notebook can be found here.