Shapelets and the Shapelet Transform with sktime¶
Introduced in , a shapelet is a time series subsequences that is identified as being representative of class membership. Shapelets are a powerful approach for measuring phase-independent similarity between time series; they can occur at any point within a series and offer interpretable results for how matches occur. The original research extracted shapelets to build a decision tree classifier.
The example below illustrates how leaf shape can be represented as a one-dimensional time series (blue line) to distinguish between two species.
The highlighted red subsection of the time series (i.e., “subsequences”) above is the shapelet that distinguishes Verbena urticifolia from Urtica dioica.
The Shapelet Transform¶
Much research emphasis has been placed on shapelet-based approaches for time series classification (TSC) since the original research was proposed. The current state-of-the-art for shapelets is the shapelet transform (ST) [3, 4]. The transform improves upon the original use of shapelets by separating shapelet extraction from the classification algorithm, allowing interpretable phase-independent classification of time series with any standard classification algorithm (such as random/rotation forest, neural networks, nearest neighbour classifications, ensembles of all, etc.). To facilitate this, rather than recursively assessing data for the best shapelet, the transform evaluates candidate shapelets in a single procedure to rank them based on information gain. Then, given a set of k shapelets, a time series can be transformed into k features by calculating the distance from the series to each shapelet. By transforming a dataset in this manner any vector-based classification algorithm can be applied to a shapelet-transformed time series problem while the interpretability of shapelets is maintained through the ranked list of the best shapelets during transformation.
Shapelets can provide interpretable results, as seen in the figure below:
The shapelet has “discovered” where the two plant species distinctly differ. Urtica dioica has a stem that connects to the leaf at almost 90 degrees, whereas the stem of Verbena urticifolia connects to the leaf at a wider angle.
Having found shapelet, its distance to the nearest matching subsequence in all objects in the database can be recorded. Finally, a simple decision tree classifier can be built to determine whether an object \(Q\) has a subsequence within a certain distance from shapelet \(I\).
 Ye, Lexiang, and Eamonn Keogh. “Time series shapelets: a novel technique that allows accurate, interpretable and fast classification.” Data mining and knowledge discovery 22, no. 1-2 (2011): 149-182.
 Ye, Lexiang, and Eamonn Keogh. “Time series shapelets: a new primitive for data mining.” In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 947-956. 2009.
 Lines, Jason, Luke M. Davis, Jon Hills, and Anthony Bagnall. “A shapelet transform for time series classification.” In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 289-297. ACM, 2012.
 Hills, Jon, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. “Classification of time series by shapelet transformation.” Data Mining and Knowledge Discovery 28, no. 4 (2014): 851-881.
 Bostrom, Aaron, and Anthony Bagnall. “Binary shapelet transform for multiclass time series classification.” In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXII, pp. 24-46. Springer, Berlin, Heidelberg, 2017.
Example: The Shapelet Transform in sktime¶
The following workbook demonstrates a full workflow of using the shapelet transform in
sktime with a
scikit-learn classifier with the OSU Leaf dataset, which consists of one dimensional outlines of six leaf classes: Acer Circinatum, Acer Glabrum, Acer Macrophyllum, Acer Negundo, Quercus Garryana and Quercus Kelloggii.
from sktime.datasets import load_osuleaf from sktime.transformations.panel.shapelets import ContractedShapeletTransform train_x, train_y = load_osuleaf(split="train", return_X_y=True) test_x, test_y = load_osuleaf(split="test", return_X_y=True)
# How long (in minutes) to extract shapelets for. # This is a simple lower-bound initially; # once time is up, no further shapelets will be assessed time_contract_in_mins = 1 # The initial number of shapelet candidates to assess per training series. # If all series are visited and time remains on the contract then another # pass of the data will occur initial_num_shapelets_per_case = 10 # Whether or not to print on-going information about shapelet extraction. # Useful for demo/debugging verbose = 2 st = ContractedShapeletTransform( time_contract_in_mins=time_contract_in_mins, num_candidates_to_sample_per_case=initial_num_shapelets_per_case, verbose=verbose, ) st.fit(train_x, train_y)
visiting series: 119 (#1) 1.4688408374786377 Candidate finished. 00:58 remaining 5.909873962402344 Candidate finished. 00:52 remaining Candidate finished. 00:48 remaining Candidate finished. 00:46 remaining Candidate finished. 00:42 remaining Candidate rejected. 00:-20 remaining Candidate finished. 00:36 remaining Candidate finished. 00:33 remaining Candidate rejected. 00:-28 remaining Candidate rejected. 00:-31 remaining visiting series: 160 (#2) Candidate finished. 00:25 remaining Candidate rejected. 00:-36 remaining Candidate rejected. 00:-39 remaining Candidate rejected. 00:-41 remaining Candidate finished. 00:14 remaining Candidate finished. 00:12 remaining Candidate finished. 00:09 remaining Candidate finished. 00:07 remaining Candidate rejected. 00:-54 remaining Candidate finished. 00:02 remaining visiting series: 12 (#3) No more time available! It's been 01:00 Stopping search
ContractedShapeletTransform(num_candidates_to_sample_per_case=10, time_contract_in_mins=1, verbose=2)
%matplotlib inline import matplotlib.pyplot as plt # for each extracted shapelet (in descending order of quality/information gain) for s in st.shapelets[0:5]: # summary info about the shapelet print(s) # plot the series that the shapelet was extracted from plt.plot(train_x.iloc[s.series_id, 0], "gray") # overlay the shapelet onto the full series plt.plot( list(range(s.start_pos, (s.start_pos + s.length))), train_x.iloc[s.series_id, 0][s.start_pos : s.start_pos + s.length], "r", linewidth=3.0, ) plt.show()
Series ID: 119, start_pos: 73, length: 167, info_gain: 0.3254739618843089,
Series ID: 12, start_pos: 342, length: 60, info_gain: 0.08997643577425246,
Series ID: 160, start_pos: 103, length: 188, info_gain: 0.07817874804764519,
# for each extracted shapelet (in descending order of quality/information gain) for i in range(0, min(len(st.shapelets), 5)): s = st.shapelets[i] # summary info about the shapelet print("#" + str(i) + ": " + str(s)) # overlay shapelets plt.plot( list(range(s.start_pos, (s.start_pos + s.length))), train_x.iloc[s.series_id, 0][s.start_pos : s.start_pos + s.length], ) plt.show()
#0: Series ID: 119, start_pos: 73, length: 167, info_gain: 0.3254739618843089, #1: Series ID: 12, start_pos: 342, length: 60, info_gain: 0.08997643577425246, #2: Series ID: 160, start_pos: 103, length: 188, info_gain: 0.07817874804764519,
import time from sklearn.ensemble import RandomForestClassifier from sklearn.pipeline import Pipeline from sktime.datasets import load_osuleaf train_x, train_y = load_osuleaf(split="train", return_X_y=True) test_x, test_y = load_osuleaf(split="test", return_X_y=True) # example pipeline with 1 minute time limit pipeline = Pipeline( [ ( "st", ContractedShapeletTransform( time_contract_in_mins=time_contract_in_mins, num_candidates_to_sample_per_case=10, verbose=False, ), ), ("rf", RandomForestClassifier(n_estimators=100)), ] ) start = time.time() pipeline.fit(train_x, train_y) end_build = time.time() preds = pipeline.predict(test_x) end_test = time.time() print("Results:") print("Correct:") correct = sum(preds == test_y) print("\t" + str(correct) + "/" + str(len(test_y))) print("\t" + str(correct / len(test_y))) print("\nTiming:") print("\tTo build: " + str(end_build - start) + " secs") print("\tTo predict: " + str(end_test - end_build) + " secs")
Results: Correct: 137/242 0.5661157024793388 Timing: To build: 72.31805896759033 secs To predict: 14.819790124893188 secs