Loading Data into Nested-Pandas#

This notebook provides a brief introduction to loading data into nested-pandas or converting data into a nested structure. For an introduction to nested-pandas, see the quick start tutorial or the readthedocs page

Installation and Imports#

With a valid Python environment, nested-pandas and its dependencies are easy to install using the pip package manager. The following command can be used to install it:

[1]:

# % pip install nested-pandas

[2]:

import os
import tempfile

import pandas as pd

from nested_pandas import NestedFrame, read_parquet
from nested_pandas.datasets import generate_parquet_file

Overview#

Nested-pandas provides multiple mechanisms for loading data or converting data to the nested format. Below we walk through some of the common approaches.

Converting Flat Data#

Commonly existing data sets will be provided in “flat” data structures such as dictionaries or Pandas DataFrames. In these cases the data consists of a rectangular table where each row represents an instance or observation. Multiple instances of the same top-level item are linked together through an ID. All rows with the same ID correspond to the same object/item.

We define one such flat dataframe consisting of 10 rows for 3 distinct items.

[3]:

flat_df = pd.DataFrame(
    data={
        "a": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
        "b": [2, 2, 2, 4, 4, 4, 6, 6, 6, 6],
        "c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
        "d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
    },
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)
flat_df

[3]:

	a	b	c	d
0	1	2	0	5
0	1	2	2	4
0	1	2	4	7
1	2	4	1	5
1	2	4	4	3
1	2	4	3	1
2	3	6	1	9
2	3	6	4	3
2	3	6	1	4
2	3	6	1	1

The first column provides the object id. As we can see there are three rows with ID=0, three rows with ID=1, and four rows with ID=2. Some of the values are constant for each item. For example both columns “a” and “b” take a single value for object. We are wasting space by repeating them in every row. Other values are different per row (columns “c” and “d”).

As a concrete example, consider patient records. Each patient is assigned a unique id and has static data such as a date birth. They also have measurements that are new with every trip to the doctor, such as blood pressure or temperature.

Converting from Flat Pandas#

The easiest approach to converting the flat table above into a nested structure is to use NestedFrame.from_flat(). This function takes

a list of columns that are not nested (base_columns)
a list of columns to nest (nested_columns)
the name of the nested column (name) Rows are associated using the index by default, but a column name on which to join can also be provided.

[4]:

nf = NestedFrame.from_flat(
    flat_df,
    base_columns=["a", "b"],  # the columns not to nest
    nested_columns=["c", "d"],  # the columns to nest
    name="nested",  # name of the nested column
)
nf

[4]:

nested

c	d
0	5
+2 rows	...

c	d
1	5
+2 rows	...

c	d
1	9
+3 rows	...

3 rows x 3 columns

Inserting Nested Rows#

Alternatively, we can use the NestedFrame constructor to create our base frame from a dictionary of our columns (as we would do with a normal pandas DataFrame). This defines the top-level objects and the values that are constant across rows (“a” and “b”).

[5]:

nf = NestedFrame(
    data={
        "a": [1, 2, 3],
        "b": [2, 4, 6],
    },
    index=[0, 1, 2],
)
nf

[5]:

	a	b
0	1	2
1	2	4
2	3	6

3 rows × 2 columns

We can then create an additional pandas dataframes for the nested columns and pack them into our NestedFrame with NestedFrame.join_nested() function. join_nested will align the nest based on the index by default (one or more columns may be selected instead via the on kwarg), as we see the nested DataFrame has a repeated index corresponding to the nf NestedFrame.

[6]:

nested = pd.DataFrame(
    data={
        "c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
        "d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
    },
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)

nf = nf.join_nested(nested, "nested")
nf

[6]:

nested

c	d
0	5
+2 rows	...

c	d
1	5
+2 rows	...

c	d
1	9
+3 rows	...

3 rows x 3 columns

The “index” parameter is used to perform the association. All of the values for index=0 are bundled together into a sub-table and stored in row 0’s “nested” column.

[7]:

nf.loc[0]["nested"]

[7]:

	c	d
0	0	5
1	2	4
2	4	7

We could add other nested columns by creating new sub-tables and adding them with join_nested(). Note that while the tables added with each join_nested() must be rectangular, they do not need to have the same dimensions between calls. We could add another nested row with a different number of observations.

[8]:

nested = pd.DataFrame(
    data={
        "c": [0, 1, 0, 1, 2, 0],
        "d": [5, 4, 5, 4, 3, 5],
    },
    index=[0, 0, 1, 1, 1, 2],
)

nf = nf.join_nested(nested, "nested2")
nf

[8]:

nested

nested2

c	d
0	5
+2 rows	...

c	d
0	5
1	4

c	d
1	5
+2 rows	...

c	d
0	5
+2 rows	...

c	d
1	9
+3 rows	...

c	d
0	5

3 rows x 4 columns

When the join key spans multiple columns, pass a list to on. The matching columns are used only as join keys and are dropped from the nested structure, while the original index is preserved.

[9]:

nested = pd.DataFrame(
    data={
        "a": [1, 1, 2, 3],
        "b": [2, 2, 4, 5],
        "c": [100, 200, 300, 400],
    }
)

nf = nf.join_nested(nested, "nested3", on=["a", "b"])
nf

[9]:

nested

nested2

nested3

c	d
0	5
+2 rows	...

c	d
0	5
1	4

c
100
200

c	d
1	5
+2 rows	...

c	d
0	5
+2 rows	...

c
300

c	d
1	9
+3 rows	...

c	d
0	5

None

3 rows x 5 columns

Loading Data from Parquet Files#

For larger datasets, we support loading data from parquet files. In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the read_parquet method:

[10]:

# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    # Generates parquet files with random data within our temporary directory
    generate_parquet_file(10, {"nested1": 100, "nested2": 10}, os.path.join(temp_path, "test.parquet"))

    # Read the parquet file to a NestedFrame
    nf = read_parquet(os.path.join(temp_path, "test.parquet"))

Nested-Pandas nested columns are compatible with the parquet format, meaning they can be written and read from parquet natively.

[11]:

nf  # nf contains nested columns

[11]:

nested1

nested2

0.427066

0.136776

t	flux	flux_error	band
5.419617	43.849803	1.0	g
+99 rows	...	...	...

t	flux	flux_error	band
19.495253	8.251719	1.0	r
+9 rows	...	...	...

0.211671

1.326893

t	flux	flux_error	band
8.335402	85.171015	1.0	r
+99 rows	...	...	...

t	flux	flux_error	band
19.53081	96.874015	1.0	r
+9 rows	...	...	...

0.055414

1.082414

t	flux	flux_error	band
19.355955	94.628664	1.0	r
+99 rows	...	...	...

t	flux	flux_error	band
4.691422	11.562981	1.0	r
+9 rows	...	...	...

0.775171

0.387835

t	flux	flux_error	band
19.033199	75.098404	1.0	r
+99 rows	...	...	...

t	flux	flux_error	band
1.955557	80.127908	1.0	g
+9 rows	...	...	...

0.137772

1.039994

t	flux	flux_error	band
5.387827	51.88141	1.0	g
+99 rows	...	...	...

t	flux	flux_error	band
19.929865	74.132066	1.0	g
+9 rows	...	...	...

0.253176

0.974404

t	flux	flux_error	band
13.755509	14.049722	1.0	r
+99 rows	...	...	...

t	flux	flux_error	band
7.420792	83.609402	1.0	g
+9 rows	...	...	...

0.860085

0.090302

t	flux	flux_error	band
16.859351	79.247587	1.0	r
+99 rows	...	...	...

t	flux	flux_error	band
18.108278	77.695262	1.0	r
+9 rows	...	...	...

0.425566

0.216094

t	flux	flux_error	band
17.783718	20.052159	1.0	g
+99 rows	...	...	...

t	flux	flux_error	band
16.896547	99.459486	1.0	r
+9 rows	...	...	...

0.290725

0.863737

t	flux	flux_error	band
18.091149	90.084562	1.0	g
+99 rows	...	...	...

t	flux	flux_error	band
9.370695	52.067879	1.0	r
+9 rows	...	...	...

0.059254

0.704853

t	flux	flux_error	band
18.312317	52.530806	1.0	r
+99 rows	...	...	...

t	flux	flux_error	band
12.02237	45.384563	1.0	r
+9 rows	...	...	...

10 rows x 4 columns

Saving NestedFrames to Parquet Files#

Additionally we can save an existing NestedFrame as a parquet file using NestedFrame.to_parquet.

Note: Nested-Pandas converts any nested columns to pyarrow datatypes when writing to parquet, meaning that parquet files with nested columns can be read by a parquet reader from other packages so long as they understand pyarrow dtypes.

[12]:

# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    nf.to_parquet(
        os.path.join(temp_path, "output.parquet"),  # The output file path
    )

    # List the files in temp_path to ensure they were saved correctly.
    print("The NestedFrame was saved to the following parquet files :", os.listdir(temp_path))

The NestedFrame was saved to the following parquet files : ['output.parquet']

Loading Data into Nested-Pandas

Contents

Loading Data into Nested-Pandas#

Installation and Imports#

Overview#

Converting Flat Data#

Converting from Flat Pandas#

Inserting Nested Rows#

Loading Data from Parquet Files#

Saving NestedFrames to Parquet Files#