binder

Loading and working with data in sktime

Python provides a variety of useful ways to represent data, but NumPy arrays and pandas DataFrames are commonly used for data analysis. When using NumPy 2d-arrays or pandas DataFrames to analyze tabular data the rows are commony used to represent each instance (e.g. case or observation) of the data, while the columns are used to represent a given feature (e.g. variable or dimension) for an observation. Since timeseries data also has a time dimension for a given instance and feature, several alternative data formats could be used to represent this data, including nested pandas DataFrame structures, NumPy 3d-arrays, or multi-indexed pandas DataFrames.

Sktime is designed to work with timeseries data stored as nested pandas DataFrame objects. Similar to working with pandas DataFrames with tabular data, this allows instances to be represented by rows and the feature data for each dimension of a problem (e.g. variables or features) to be stored in the DataFrame columns. To accomplish this the timepoints for each instance-feature combination are stored in a single cell in the input Pandas DataFrame (see Sktime pandas DataFrame format for more details).

Users can load or convert data into sktime’s format in a variety of ways. Data can be loaded directly from a bespoke sktime file format (.ts) (see Representing data with .ts files) or supported file formats provided by other existing data sources (such as Weka ARFF and .tsv). Sktime also provides functions to convert data to and from sktime’s nested pandas DataFrame format and several other common ways for representing timeseries data using NumPy arrays or pandas DataFrames. see Converting between sktime and alternative timeseries formats.

The rest of this sktime tutorial will provide a more detailed description of the sktime pandas DataFrame format, a brief description of the .ts file format, how to load data from other supported formats, and how to convert between other common ways of representing timeseries data in NumPy arrays or pandas DataFrames.

## Sktime pandas DataFrame format

The core data structure for storing datasets in sktime is a nested pandas DataFrame, where rows of the dataframe correspond to instances (cases or observations), and columns correspond to dimensions of the problem (features or variables). The multiple timepoints and their corresponding values for each instance-feature pair are stored as pandas Series object nested within the applicable DataFrame cell.

For example, for a problem with n cases that each have data across c timeseries dimensions:

DataFrame:
index |   dim_0   |   dim_1   |    ...    |  dim_c-1
   0  | pd.Series | pd.Series | pd.Series | pd.Series
   1  | pd.Series | pd.Series | pd.Series | pd.Series
  ... |    ...    |    ...    |    ...    |    ...
   n  | pd.Series | pd.Series | pd.Series | pd.Series

Representing timeseries data in this way makes it easy to align the timeseries features for a given instance with non-timeseries information. For example, in a classification problem, it is easy to align the timeseries features for an observation with its (index-aligned) target class label:

index | class_val
  0   |   int
  1   |   int
 ...  |   ...
  n   |   int

While sktime’s format uses pandas Series objects in its nested DataFrame structure, other data structures like NumPy arrays could be used to hold the timeseries values in each cell. However, the use of pandas Series objects helps to facilitate simple storage of sparse data and make it easy to accomodate series with non-integer timestamps (such as dates).

## The .ts file format One common use case is to load locally stored data. To make this easy, the .ts file format has been created for representing problems in a standard format for use with sktime.

Representing data with .ts files

A .ts file include two main parts: * header information * data

The header information is used to facilitate simple representation of the data through including metadata about the structure of the problem. The header contains the following:

@problemName <problem name>
@timeStamps <true/false>
@univariate <true/false>
@classLabel <true/false> <space delimited list of possible class values>
@data

The data for the problem should begin after the @data tag. In the simplest case where @timestamps is false, values for a series are expressed in a comma-separated list and the index of each value is relative to its position in the list (0, 1, …, m). An instance may contain 1 to many dimensions, where instances are line-delimited and dimensions within an instance are colon (:) delimited. For example:

2,3,2,4:4,3,2,2
13,12,32,12:22,23,12,32
4,4,5,4:3,2,3,2

This example data has 3 instances, corresponding to the three lines shown above. Each instance has 2 dimensions with 4 observations per dimension. For example, the intitial instance’s first dimension has the timepoint values of 2, 3, 2, 4 and the second dimension has the values 4, 3, 2, 2.

Missing readings can be specified using ?. For example,

2,?,2,4:4,3,2,2
13,12,32,12:22,23,12,32
4,4,5,4:3,2,3,2

would indicate the second timepoint value of the initial instance’s first dimension is missing.

Alternatively, for sparse datasets, readings can be specified by setting @timestamps to true in the header and representing the data with tuples in the form of (timestamp, value) just for the obser. For example, the first instance in the example above could be specified in this representation as:

(0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)

Equivalently, the sparser example

2,5,?,?,?,?,?,5,?,?,?,?,4

could be represented with just the non-missing timestamps as:

(0,2),(0,5),(7,5),(12,4)

When using the .ts file format to store data for timeseries classification problems, the class label for an instance should be specified in the last dimension and @classLabel should be set to true in the header information and be followed by the set of possible class values. For example, if a case consists of a single dimension and has a class value of 1 it would be specified as:

1,4,23,34:1

Loading from .ts file to pandas DataFrame

A dataset can be loaded from a .ts file using the following method in sktime.utils.data_io.py:

load_from_tsfile_to_dataframe(full_file_path_and_name, replace_missing_vals_with='NaN')

This can be demonstrated using the Arrow Head problem that is included in sktime under sktime/datasets/data

[1]:
import os

import sktime
from sktime.utils.data_io import load_from_tsfile_to_dataframe

DATA_PATH = os.path.join(os.path.dirname(sktime.__file__), "datasets/data")

train_x, train_y = load_from_tsfile_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.ts")
)
test_x, test_y = load_from_tsfile_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TEST.ts")
)

Train and test partitions of the ArrowHead problem have been loaded into nested dataframes with an associated array of class values. As an example, below are the first 5 rows from the train_x and train_y:

[2]:
train_x.head()
[2]:
dim_0
0 0 -1.9630 1 -1.9578 2 -1.9561 3 ...
1 0 -1.7746 1 -1.7740 2 -1.7766 3 ...
2 0 -1.8660 1 -1.8420 2 -1.8350 3 ...
3 0 -2.0738 1 -2.0733 2 -2.0446 3 ...
4 0 -1.7463 1 -1.7413 2 -1.7227 3 ...
[3]:
train_y[0:5]
[3]:
array(['0', '1', '2', '0', '1'], dtype='<U1')

## Loading other file formats Researchers who have made timeseries data available have used two other common formats, including:

  • Weka ARFF files

  • UCR .tsv files

Loading from Weka ARFF files

It is also possible to load data from Weka’s attribute-relation file format (ARFF) files. Data for timeseries problems are made available in this format by researchers at the University of East Anglia (among others) at www.timeseriesclassification.com. The load_from_arff_to_dataframe method in sktime.utils.data_io supports reading data for both univariate and multivariate timeseries problems.

The univariate functionality is demonstrated below using data on the ArrowHead problem again (this time loading from ARFF file).

[4]:
from sktime.utils.data_io import load_from_arff_to_dataframe

X, y = load_from_arff_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.arff")
)
X.head()
[4]:
dim_0
0 0 -1.963009 1 -1.957825 2 -1.95614...
1 0 -1.774571 1 -1.774036 2 -1.77658...
2 0 -1.866021 1 -1.841991 2 -1.83502...
3 0 -2.073758 1 -2.073301 2 -2.04460...
4 0 -1.746255 1 -1.741263 2 -1.72274...

The multivariate BasicMotions problem is used below to illustrate the ability to read multivariate timeseries data from ARFF files into the sktime format.

[5]:
X, y = load_from_arff_to_dataframe(
    os.path.join(DATA_PATH, "BasicMotions/BasicMotions_TRAIN.arff")
)
X.head()
[5]:
dim_0 dim_1 dim_2 dim_3 dim_4 dim_5
0 0 0.079106 1 0.079106 2 -0.903497 3... 0 0.394032 1 0.394032 2 -3.666397 3... 0 0.551444 1 0.551444 2 -0.282844 3... 0 0.351565 1 0.351565 2 -0.095881 3... 0 0.023970 1 0.023970 2 -0.319605 3... 0 0.633883 1 0.633883 2 0.972131 3...
1 0 0.377751 1 0.377751 2 2.952965 3... 0 -0.610850 1 -0.610850 2 0.970717 3... 0 -0.147376 1 -0.147376 2 -5.962515 3... 0 -0.103872 1 -0.103872 2 -7.593275 3... 0 -0.109198 1 -0.109198 2 -0.697804 3... 0 -0.037287 1 -0.037287 2 -2.865789 3...
2 0 -0.813905 1 -0.813905 2 -0.424628 3... 0 0.825666 1 0.825666 2 -1.305033 3... 0 0.032712 1 0.032712 2 0.826170 3... 0 0.021307 1 0.021307 2 -0.372872 3... 0 0.122515 1 0.122515 2 -0.045277 3... 0 0.775041 1 0.775041 2 0.383526 3...
3 0 0.289855 1 0.289855 2 -0.669185 3... 0 0.284130 1 0.284130 2 -0.210466 3... 0 0.213680 1 0.213680 2 0.252267 3... 0 -0.314278 1 -0.314278 2 0.018644 3... 0 0.074574 1 0.074574 2 0.007990 3... 0 -0.079901 1 -0.079901 2 0.237040 3...
4 0 -0.123238 1 -0.123238 2 -0.249547 3... 0 0.379341 1 0.379341 2 0.541501 3... 0 -0.286006 1 -0.286006 2 0.208420 3... 0 -0.098545 1 -0.098545 2 -0.023970 3... 0 0.058594 1 0.058594 2 0.175783 3... 0 -0.074574 1 -0.074574 2 0.114525 3...

Loading from UCR .tsv Format Files

A further option is to load data into sktime from tab separated value (.tsv) files. Researchers at the University of Riverside, California make a variety of timeseries data available in this format at https://www.cs.ucr.edu/~eamonn/time_series_data_2018.

The load_from_ucr_tsv_to_dataframe method in sktime.utils.data_io supports reading univariate problems. An example with ArrowHead is given below to demonstrate equivalence with loading from the .ts and ARFF file formats.

[6]:
from sktime.utils.data_io import load_from_ucr_tsv_to_dataframe

X, y = load_from_ucr_tsv_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.tsv")
)
X.head()
[6]:
dim_0
0 0 -1.963009 1 -1.957825 2 -1.95614...
1 0 -1.774571 1 -1.774036 2 -1.77658...
2 0 -1.866021 1 -1.841991 2 -1.83502...
3 0 -2.073758 1 -2.073301 2 -2.04460...
4 0 -1.746255 1 -1.741263 2 -1.72274...

## Converting between other NumPy and pandas formats

It is also possible to use data from sources other than .ts and .arff files by manually shaping the data into the format described above.

Functions to convert from and to these types to sktime’s nested DataFrame format are provided in sktime.datatypes._panel._convert

Using tabular data with sktime

One approach to representing timeseries data is a tabular DataFrame. As usual, each row represents an instance. In the tabular setting each timepoint of the univariate timeseries being measured for each instance are treated as feature and stored as a primitive data type in the DataFrame’s cells.

In a univariate setting, where there are n instances of the series and each univariate timeseries has t timepoints, this would yield a pandas DataFrame with shape (n, t). In practice, this could be used to represent sensors measuring the same signal over time (features) on different machines (instances) or the same economic variable over time (features) for different countries (instances).

The function from_2d_array_to_nested converts a (n, t) tabular DataFrame to nested DataFrame with shape (n, 1). To convert from a nested DataFrame to a tabular array the function from_nested_to_2d_array can be used.

The example below uses 50 instances with 20 timepoints each.

[7]:
from numpy.random import default_rng

from sktime.datatypes._panel._convert import (
    from_2d_array_to_nested,
    from_nested_to_2d_array,
    is_nested_dataframe,
)

rng = default_rng()
X_2d = rng.standard_normal((50, 20))
print(f"The tabular data has the shape {X_2d.shape}")
The tabular data has the shape (50, 20)

The from_2d_array_to_nested function makes it easy to convert this to a nested DataFrame.

[8]:
X_nested = from_2d_array_to_nested(X_2d)
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.head()
X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 1)
[8]:
0
0 0 -0.778677 1 0.408768 2 0.451901 3...
1 0 0.659781 1 0.449760 2 1.470552 3...
2 0 0.751150 1 0.288642 2 -0.106472 3...
3 0 1.749922 1 -0.141803 2 -0.679641 3...
4 0 1.739100 1 2.389569 2 -0.450622 3...

This nested DataFrame can also be converted back to a tabular DataFrame using easily.

[9]:
X_2d = from_nested_to_2d_array(X_nested)
print(f"The tabular data has the shape {X_2d.shape}")
The tabular data has the shape (50, 20)

Using long-format data with sktime

Timeseries data can also be represented in long format where each row identifies the value for a single timepoint for a given dimension for a given instance.

This format may be encountered in a database where each row stores a single value measurement identified by several identification columns. For example, where case_id is an id to identify a specific instance in the data, dimension_id is an integer between 0 and d-1 for d dimensions in the data, reading_id is the index of timepoints for the associated case_id and dimension_id, and value is the actual value of the observation. E.g.:

     | case_id | dim_id | reading_id | value
------------------------------------------------
  0  |   int   |  int   |    int     | double
  1  |   int   |  int   |    int     | double
  2  |   int   |  int   |    int     | double
  3  |   int   |  int   |    int     | double

Sktime provides functions to convert to and from the long data format in sktime.datatypes._panel._convert.

The from_long_to_nested function converts from a long format DataFrame to sktime’s nested format (with assumptions made on how the data is initially formatted). Conversely, from_nested_to_long converts from a sktime nested DataFrame into a long format DataFrame.

To demonstrate this functionality the method below creates a dataset with a 50 instances (cases), 5 dimensions and 20 timepoints per dimension.

[10]:
from sktime.utils.data_io import generate_example_long_table

X = generate_example_long_table(num_cases=50, series_len=20, num_dims=5)

X.head()
[10]:
case_id dim_id reading_id value
0 0 0 0 0.912509
1 0 0 1 0.695034
2 0 0 2 0.053899
3 0 0 3 0.159354
4 0 0 4 0.689003
[11]:
X.tail()
[11]:
case_id dim_id reading_id value
4995 49 4 15 0.368105
4996 49 4 16 0.477780
4997 49 4 17 0.715439
4998 49 4 18 0.491749
4999 49 4 19 0.258831

As shown below, applying the from_long_to_nested method returns a sktime-formatted dataset with individual dimensions represented by columns of the output dataframe.

[12]:
from sktime.datatypes._panel._convert import from_long_to_nested, from_nested_to_long

X_nested = from_long_to_nested(X)
X_nested.head()
[12]:
var_0 var_1 var_2 var_3 var_4
0 0 0.912509 1 0.695034 2 0.053899 3... 0 0.051646 1 0.118612 2 0.895839 3... 0 0.239257 1 0.799755 2 0.214799 3... 0 0.517281 1 0.985751 2 0.700742 3... 0 0.502050 1 0.715278 2 0.723479 3...
1 0 0.890478 1 0.019525 2 0.148067 3... 0 0.016530 1 0.888095 2 0.875295 3... 0 0.522592 1 0.816917 2 0.920839 3... 0 0.414255 1 0.878936 2 0.242917 3... 0 0.236753 1 0.621625 2 0.720342 3...
2 0 0.806029 1 0.387421 2 0.081276 3... 0 0.478750 1 0.881884 2 0.296138 3... 0 0.409165 1 0.667109 2 0.340916 3... 0 0.603253 1 0.153509 2 0.457244 3... 0 0.569114 1 0.577707 2 0.987618 3...
3 0 0.190946 1 0.244025 2 0.258714 3... 0 0.885913 1 0.983103 2 0.532775 3... 0 0.640145 1 0.485168 2 0.993418 3... 0 0.241162 1 0.021157 2 0.205136 3... 0 0.809745 1 0.518432 2 0.980226 3...
4 0 0.738218 1 0.676297 2 0.845292 3... 0 0.058199 1 0.770628 2 0.890301 3... 0 0.886766 1 0.943441 2 0.295226 3... 0 0.912242 1 0.456776 2 0.957471 3... 0 0.160228 1 0.266784 2 0.998907 3...

As expected the result is a nested DataFrame and the cells include nested pandas Series objects.

[13]:
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.iloc[0, 0].head()
X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 5)
[13]:
0    0.912509
1    0.695034
2    0.053899
3    0.159354
4    0.689003
Name: 0, dtype: float64

As shown below, the from_nested_to_long function can be used to convert the resulting nested DataFrame (or any nested DataFrame) to a long format DataFrame.

[14]:
X_long = from_nested_to_long(
    X_nested,
    instance_column_name="case_id",
    time_column_name="reading_id",
    dimension_column_name="dim_id",
)
X_long.head()
[14]:
case_id reading_id dim_id value
0 0 0 var_0 0.912509
1 0 1 var_0 0.695034
2 0 2 var_0 0.053899
3 0 3 var_0 0.159354
4 0 4 var_0 0.689003
[15]:
X_long.tail()
[15]:
case_id reading_id dim_id value
4995 49 15 var_4 0.368105
4996 49 16 var_4 0.477780
4997 49 17 var_4 0.715439
4998 49 18 var_4 0.491749
4999 49 19 var_4 0.258831

Using multi-indexed pandas DataFrames

Pandas deprecated its Panel object in version 0.20.1. Since that time pandas has recommended representing 3-dimensional data using a multi-indexed DataFrame.

Storing timeseries data in a Pandas multi-indexed DataFrame is a natural option since many timeseries problems include data over the instance, feature and time dimensions.

Sktime provides the functions from_multi_index_to_nested and from_nested_to_multi_index in sktime.datatypes._panel._convert to easily convert between pandas multi-indexed DataFrames and sktime’s nested DataFrame structure.

The example below illustrates how these functions can be used to convert to and from the nested structure given data with 50 instances, 5 features (columns) and 20 timepoints per feature. In the multi-indexed DataFrame a row represents a unique combination of the instance and timepoint indices. Therefore, the resulting multi-indexed DataFrame should have the shape (1000, 5).

[16]:
from sktime.datatypes._panel._convert import (
    from_multi_index_to_nested,
    from_nested_to_multi_index,
)
from sktime.utils.data_io import make_multi_index_dataframe

X_mi = make_multi_index_dataframe(n_instances=50, n_columns=5, n_timepoints=20)

print(f"The multi-indexed DataFrame has shape {X_mi.shape}")
print(f"The multi-index names are {X_mi.index.names}")

X_mi.head()
The multi-indexed DataFrame has shape (1000, 5)
The multi-index names are ['case_id', 'reading_id']
[16]:
var_0 var_1 var_2 var_3 var_4
case_id reading_id
0 0 0.771363 0.037907 0.235545 0.853656 0.281851
1 0.265040 0.851836 0.481260 0.959612 0.009352
2 0.934614 0.489645 0.441902 0.689056 0.978110
3 0.799281 0.888585 0.362759 0.521939 0.413390
4 0.422916 0.018661 0.318501 0.811994 0.974894

The multi-indexed DataFrame can be easily converted to a nested DataFrame with shape (50, 5). Note that the conversion to the nested DataFrame has preserved the column names (it has also preserved the values of the instance index and the pandas Series objects nested in each cell have preserved the time index).

[17]:
X_nested = from_multi_index_to_nested(X_mi, instance_index="case_id")
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.head()
X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 5)
[17]:
var_0 var_1 var_2 var_3 var_4
0 0 0.771363 1 0.265040 2 0.934614 3... 0 0.037907 1 0.851836 2 0.489645 3... 0 0.235545 1 0.481260 2 0.441902 3... 0 0.853656 1 0.959612 2 0.689056 3... 0 0.281851 1 0.009352 2 0.978110 3...
1 0 0.930078 1 0.334428 2 0.438320 3... 0 0.448169 1 0.030304 2 0.907764 3... 0 0.844679 1 0.287600 2 0.876233 3... 0 0.812035 1 0.964712 2 0.484175 3... 0 0.265644 1 0.268937 2 0.667865 3...
2 0 0.562220 1 0.149496 2 0.350670 3... 0 0.828236 1 0.992163 2 0.672932 3... 0 0.852425 1 0.911694 2 0.468790 3... 0 0.082205 1 0.659805 2 0.776347 3... 0 0.382383 1 0.893029 2 0.318887 3...
3 0 0.298994 1 0.924429 2 0.249858 3... 0 0.650475 1 0.476007 2 0.812734 3... 0 0.783259 1 0.302699 2 0.439683 3... 0 0.219623 1 0.389333 2 0.669000 3... 0 0.537272 1 0.428196 2 0.695308 3...
4 0 0.772403 1 0.943499 2 0.640278 3... 0 0.601943 1 0.846388 2 0.065433 3... 0 0.190929 1 0.141046 2 0.922823 3... 0 0.481654 1 0.150438 2 0.392064 3... 0 0.137388 1 0.546390 2 0.065213 3...

Nested DataFrames can also be converted to a multi-indexed Pandas DataFrame

[18]:
X_mi = from_nested_to_multi_index(
    X_nested, instance_index="case_id", time_index="reading_id"
)
X_mi.head()
[18]:
var_0 var_1 var_2 var_3 var_4
case_id reading_id
0 0 0.771363 0.037907 0.235545 0.853656 0.281851
1 0.265040 0.851836 0.481260 0.959612 0.009352
2 0.934614 0.489645 0.441902 0.689056 0.978110
3 0.799281 0.888585 0.362759 0.521939 0.413390
4 0.422916 0.018661 0.318501 0.811994 0.974894

Using NumPy 3d-arrays with sktime

Another common approach for representing timeseries data is to use a 3-dimensional NumPy array with shape (n_instances, n_columns, n_timepoints).

Sktime provides the functions from_3d_numpy_to_nested from_nested_to_3d_numpy in sktime.datatypes._panel._convert to let users easily convert between NumPy 3d-arrays and nested pandas DataFrames.

This is demonstrated using a 3d-array with 50 instances, 5 features (columns) and 20 timepoints, resulting in a 3d-array with shape (50, 5, 20).

[19]:
from sktime.datatypes._panel._convert import (
    from_3d_numpy_to_nested,
    from_multi_index_to_3d_numpy,
    from_nested_to_3d_numpy,
)

X_mi = make_multi_index_dataframe(n_instances=50, n_columns=5, n_timepoints=20)
X_3d = from_multi_index_to_3d_numpy(
    X_mi, instance_index="case_id", time_index="reading_id"
)

print(f"The 3d-array has shape {X_3d.shape}")
The 3d-array has shape (50, 5, 20)

The 3d-array can be easily converted to a nested DataFrame with shape (50, 5). Note that since NumPy array doesn’t have indices, the instance index is the numerical range over the number of instances and the columns are automatically assigned. Users can optionally supply their own columns names via the columns_names parameter.

[20]:
X_nested = from_3d_numpy_to_nested(X_3d)
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.head()
X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 5)
[20]:
var_0 var_1 var_2 var_3 var_4
0 0 0.654519 1 0.000529 2 0.737502 3... 0 0.455267 1 0.493692 2 0.454159 3... 0 0.125717 1 0.991244 2 0.749765 3... 0 0.541312 1 0.237882 2 0.923522 3... 0 0.540375 1 0.599886 2 0.592792 3...
1 0 0.783691 1 0.597093 2 0.316369 3... 0 0.918897 1 0.069424 2 0.882183 3... 0 0.398496 1 0.799342 2 0.947683 3... 0 0.348487 1 0.160653 2 0.705979 3... 0 0.562481 1 0.695452 2 0.559146 3...
2 0 0.544813 1 0.213060 2 0.279156 3... 0 0.855332 1 0.595745 2 0.273279 3... 0 0.403519 1 0.566566 2 0.507071 3... 0 0.937227 1 0.608138 2 0.176028 3... 0 0.198976 1 0.305317 2 0.978067 3...
3 0 0.042076 1 0.987119 2 0.152571 3... 0 0.151546 1 0.081442 2 0.181950 3... 0 0.270189 1 0.037995 2 0.565569 3... 0 0.625251 1 0.227008 2 0.720423 3... 0 0.859427 1 0.416729 2 0.527326 3...
4 0 0.822275 1 0.835125 2 0.445475 3... 0 0.029691 1 0.732877 2 0.009868 3... 0 0.188352 1 0.957949 2 0.576842 3... 0 0.313692 1 0.483743 2 0.175226 3... 0 0.637401 1 0.857875 2 0.886624 3...

Nested DataFrames can also be converted to NumPy 3d-arrays.

[21]:
X_3d = from_nested_to_3d_numpy(X_nested)
print(f"The resulting object is a {type(X_3d)}")
print(f"The shape of the 3d-array is {X_3d.shape}")
The resulting object is a <class 'numpy.ndarray'>
The shape of the 3d-array is (50, 5, 20)

Converting between NumPy 3d-arrays and pandas multi-indexed DataFrame

Although an example is not provided here, sktime lets users convert data between NumPy 3d-arrays and a multi-indexed pandas DataFrame formats using the functions from_3d_numpy_to_multi_index and from_multi_index_to_3d_numpy in sktime.datatypes._panel._convert.


Generated using nbsphinx. The Jupyter notebook can be found here.