read_parquet

Contents

read_parquet#

read_parquet(data: str | UPath | bytes, columns: list[str] | None = None, reject_nesting: list[str] | str | None = None, autocast_list: bool = False, is_dir: bool | None = None, use_pandas_metadata: bool = True, **kwargs) NestedFrame[source]#

Load a parquet object from a file path into a NestedFrame.

As a specialization of the pandas.read_parquet function, this function loads the data via existing pyarrow or fsspec.parquet methods, and then converts the data to a NestedFrame.

Parameters:
  • data (str, list or str, Path, Upath, or file-like object) – Path to the data or a file-like object. If a string is passed, it can be a single file name, directory name, or a remote path (e.g., HTTP/HTTPS or S3). If a file-like object is passed, it must support the read method. You can also pass a filesystem keyword argument with a pyarrow.fs object, which will be passed along to the underlying file-reading method. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables/ or s3://bucket/partition_dir/ (note trailing slash for web locations, since it may be expensive to test a path for being a directory). Directory reading is not supported for HTTP(S). If the path is to a single Parquet file, it will be loaded using fsspec.parquet.open_parquet_file, which has optimized handling for remote Parquet files.

  • columns (list, default=None) – If not None, only these columns will be read from the file.

  • reject_nesting (list or str, default=None) – Column(s) to reject from being cast to a nested dtype. By default, nested-pandas assumes that any struct column with all fields being lists is castable to a nested column. However, this assumption is invalid if the lists within the struct have mismatched lengths for any given item. Columns specified here will be read using the corresponding pandas.ArrowDtype.

  • autocast_list (bool, default=True) – If True, automatically cast list columns to nested columns with NestedDType.

  • is_dir (bool, None, default=None) – If True, the pointer represents a pixel directory; if False, the pointer represents a file. In both cases there is no need to check the pointer’s content type. If is_dir is None (default), this method will resort to upath.is_dir() to identify the type of pointer. This argument is ignored for HTTP, as inferring the type for HTTP is particularly expensive because it requires downloading the contents of the pointer in its entirety.

  • use_pandas_metadata (bool, default=True) – If True (default), apply the pandas metadata stored in the Parquet file’s schema when constructing the NestedFrame (e.g. restoring the index and column dtypes). This matches the default behavior of pd.read_parquet. Set to False to ignore the metadata.

  • kwargs (dict) – Keyword arguments passed to pyarrow.parquet.read_table

Return type:

NestedFrame

Notes

For paths to single Parquet files, this function uses fsspec.parquet.open_parquet_file, which performs intelligent precaching. This can significantly improve performance compared to standard PyArrow reading on remote files.

pyarrow supports partial loading of nested structures from parquet, for example `pd.read_parquet("data.parquet", columns=["nested.a"])` will load the “a” column of the “nested” column. Standard pandas/pyarrow behavior will return “a” as a list-array base column with name “a”. In nested-pandas, this behavior is changed to load the column as a sub-column of a nested column called “nested”. Be aware that this will prohibit calls like `pd.read_parquet("data.parquet", columns=["nested.a", "nested"])` from working, as this implies both full and partial load of “nested”.

Additionally with partial loading, be aware that nested-pandas (and pyarrow) only supports partial loading of struct of list columns. Your data may be stored as a list of structs, which can be read by nested-pandas, but without support for partial loading. We try to throw a helpful error message in these cases.

Furthermore, there are some cases where subcolumns will have the same name as a top-level column. For example, if you have a column “nested” with subcolumns “nested.a” and “nested.b”, and also a top-level column “a”. In these cases, keep in mind that if “nested” is in the reject_nesting list the operation will fail, as is consistent with the default pandas behavior (but nesting will still work normally).

Examples

Simple loading example:

>>> import nested_pandas as npd
>>> nf = npd.read_parquet("path/to/file.parquet")

Partial loading:

>>> #Load only the "flux" sub-column of the "nested" column
>>> nf = npd.read_parquet("path/to/file.parquet", columns=["a", "nested.flux"])