Pandera

Latest version: v0.20.4

Safety actively analyzes 679296 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 10 of 16

0.10.0

Highlights

`pandera` now supports pyspark dataframe validation via `pyspark.pandas`

**The pandera [koalas](https://koalas.readthedocs.io/en/latest/index.html) integration has now been deprecated**

You can now `pip install pandera[pyspark]` and validate `pyspark.pandas` dataframes:

python
import pyspark.pandas as ps
import pandas as pd
import pandera as pa

from pandera.typing.pyspark import DataFrame, Series


class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


create a pyspark.pandas dataframe that's validated on object initialization
df = DataFrame[Schema](
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
print(df)


`PydanticModel` DataType Enables Row-wise Validation with a `pydantic` model

Pandera now supports row-wise validation by applying a pydantic model as a dataframe-level dtype:

python
from pydantic import BaseModel

import pandera as pa


class Record(BaseModel):
name: str
xcoord: str
ycoord: int

import pandas as pd
from pandera.engines.pandas_engine import PydanticModel


class PydanticSchema(pa.SchemaModel):
"""Pandera schema using the pydantic model."""

class Config:
"""Config with dataframe-level data type."""

dtype = PydanticModel(Record)
coerce = True this is required, otherwise a SchemaInitError is raised


⚠️ **Warning:** This may lead to performance issues for very large dataframes.

Improved conda installation experience

Before this release there were only two conda packages: one to install `pandera-core` and another to install `pandera` (which would install all extras functionality)

The conda packaging now supports finer-grained control:

bash
conda install -c conda-forge pandera-hypotheses hypothesis checks
conda install -c conda-forge pandera-io yaml/script schema io utilities
conda install -c conda-forge pandera-strategies data synthesis strategies
conda install -c conda-forge pandera-mypy enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi fastapi integration
conda install -c conda-forge pandera-dask validate dask dataframes
conda install -c conda-forge pandera-pyspark validate pyspark dataframes
conda install -c conda-forge pandera-modin validate modin dataframes
conda install -c conda-forge pandera-modin-ray validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask validate modin dataframes with dask


Enhancements

- [Add option to disallow duplicate column names](https://github.com/pandera-dev/pandera/commit/86e6eea0ac2c3f527c41c2002c11a4ba61b648c2) #758
- [Make SchemaModel use class name, define own config](https://github.com/pandera-dev/pandera/commit/5a484321e4dc60ab28fc400765a8ea00bf035ded) #761
- [implement coercion-on-initialization for DataFrame[SchemaModel] types](https://github.com/pandera-dev/pandera/commit/2cef93bc89f12e9d6e86e618d61ad609f0412a56) #772
- [Update filtering columns for performance reasons.](https://github.com/pandera-dev/pandera/commit/e4c7b54861c107acc82cdfa68db82cb4c8831cad) #777
- [implement pydantic model data type](https://github.com/pandera-dev/pandera/commit/3ea4f2c4a3a4443c13dd9741d35bb7c8cbb930ac) #779
- [make finding coerce failure cases faster](https://github.com/pandera-dev/pandera/commit/60ac7ef668e5ea23dcbbf94df2ef1708a898aa64) #792
- [add pyspark support, deprecate koalas](https://github.com/pandera-dev/pandera/commit/d62f8206ea366a53fbe0d1da66ce8081d8660554) #793
- [Add overloads to](https://github.com/pandera-dev/pandera/commit/20fd29a7c5fe1ddc1799e86db420dd0c8de0090a) [schema.to_yaml](https://github.com/pandera-dev/pandera/commit/20fd29a7c5fe1ddc1799e86db420dd0c8de0090a) #790
- [Add overloads to](https://github.com/pandera-dev/pandera/commit/e2d2f308257d120cf846bd1126842b20c3e56e73) [infer_schema](https://github.com/pandera-dev/pandera/commit/e2d2f308257d120cf846bd1126842b20c3e56e73) #789

Bugfixes

- [set default n_failure_cases to None](https://github.com/pandera-dev/pandera/commit/4efed31217b832b7b609bc6065e09144757ab94b) #784
- [🐛 Persist index uniqueness](https://github.com/pandera-dev/pandera/commit/c0c2d408f0f7d34abced02887f3f98158580712c) #801

Deprecations

- [deprecate allow_duplicates, pandas_dtype, transformer, PandasDtype en](https://github.com/pandera-dev/pandera/commit/6800848e9ca83f973c04114a3e7a76e51f0c67ef) #811
- - [add pyspark support, deprecate koalas](https://github.com/pandera-dev/pandera/commit/d62f8206ea366a53fbe0d1da66ce8081d8660554) #793

Docs Improvements

- [add imports to fastapi docs](https://github.com/pandera-dev/pandera/commit/a19a1c7ef3d79922dfdee76b3058cf688c68d059)
- [add documentation for pandas_engine.DateTime](https://github.com/pandera-dev/pandera/commit/4c97ddcc55233b841e95852ceea082d445365508) #780
- [update docs for 0.10.0](https://github.com/pandera-dev/pandera/commit/8edfb63c473bef1171f42f06e163e3c7e916785a) #795
- [update docs with fastapi](https://github.com/pandera-dev/pandera/commit/a7268e058cc39ba03bd381724d35be2289919b17) #804

Testing Improvements

Misc Changes

- [update conda install instructions](https://github.com/pandera-dev/pandera/commit/b571e0f36526732fa77b271c0c08db68f7eb8220) #776
- [Adopt NEP 29-based deprecation policy](https://github.com/pandera-dev/pandera/commit/1feb54c918162de4b2dffecd0cfa691df97dd672) #727

Contributors

0.9.0

Highlights

FastAPI Integration [[Docs](https://pandera.readthedocs.io/en/stable/fastapi.html)]

`pandera` now integrates with [fastapi](https://fastapi.tiangolo.com/). You can decorate app endpoint arguments with `DataFrame[Schema]` types and the endpoint will validate incoming and outgoing data.

python
from typing import Optional

from pydantic import BaseModel, Field

import pandera as pa


schema definitions
class Transactions(pa.SchemaModel):
id: pa.typing.Series[int]
cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)

class Config:
coerce = True

class TransactionsOut(Transactions):
id: pa.typing.Series[int]
cost: pa.typing.Series[float]
name: pa.typing.Series[str]

class TransactionsDictOut(TransactionsOut):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}


App endpoint example:
python
from fastapi import FastAPI, File

app = FastAPI()

app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
output = transactions.assign(name="foo")
... do other stuff, e.g. update backend database with transactions
return output


Data Format Conversion [[Docs](https://pandera.readthedocs.io/en/stable/data_format_conversion.html)]

The class-based API now supports automatically deserializing/serializing pandas dataframes in the context of `pa.check_types`-decorated functions, `pydantic.validate_arguments`-decorated functions, and fastapi endpoint functions.

python
import pandera as pa
from pandera.typing import DataFrame, Series

base schema definitions
class InSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
int_col: Series[int]

class OutSchema(InSchema):
float_col: pa.typing.Series[float]

read and validate data from a parquet file
class InSchemaParquet(InSchema):
class Config:
from_format = "parquet"

output data as a list of dictionary records
class OutSchemaDict(OutSchema):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}

pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
return df.assign(float_col=1.1)


The `transform` function can then take a filepath or buffer containing a parquet file that pandera automatically reads and validates:
python
import io
import json

buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)

dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))

Output:

[
{
"str_col": "a",
"int_col": 0,
"float_col": 1.1
},
{
"str_col": "b",
"int_col": 1,
"float_col": 1.1
},
{
"str_col": "c",
"int_col": 2,
"float_col": 1.1
}
]


Data Validation with GeoPandas [[Docs](https://pandera.readthedocs.io/en/stable/geopandas.html#supported-lib-geopandas)]

`DataFrameSchema`s can now validate `geopandas.GeoDataFrame` and `GeoSeries` objects:

python
import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon

geo_schema = pa.DataFrameSchema({
"geometry": pa.Column("geometry"),
"region": pa.Column(str),
})

geo_df = gpd.GeoDataFrame({
"geometry": [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
"region": ["NA", "SA"]
})

geo_schema.validate(geo_df)


You can also define `SchemaModel` classes with a `GeoSeries` field type annotation to create validated `GeoDataFrame`s, or use then in `pa.check_types`-decorated functions for input/output validation:

python
from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries


class Schema(pa.SchemaModel):
geometry: GeoSeries
region: Series[str]


create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
{
'geometry': [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
'region': ['NA','SA']
}
)


Enhancements

- Support GeoPandas data structures (732)
- Fastapi integration (741)
- add title/description fields (https://github.com/pandera-dev/pandera/pull/754)
- add nullable float dtypes (https://github.com/pandera-dev/pandera/pull/721)

Bugfixes

- typed descriptors and setup.py only includes pandera (739)
- `pa.dataframe_check` works correctly on pandas==1.1.5 (735)
- fix set_index with MultiIndex (751)
- strategies: correctly handle StringArray null values (748)

Docs Improvements

- fastapi docs, add to ci (https://github.com/pandera-dev/pandera/pull/753)

Testing Improvements

- Add Python 3.10 to CI matrix (https://github.com/pandera-dev/pandera/pull/724)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉
- roshcagra
- cristianmatache
- jamesmyatt
- smackesey
- vovavili

0.8.1

Enhancements

- add `__all__` declaration to root module for better editor autocompletion 42e60c63dddb38b58c6014a14b6fc97b6a3a1e0c
- fix: expose nullable boolean in pandera.typing 5f9c713c6a9e3654f23e3c1a24b7ad39a8979e2e
- type annotations for DataFrameSchema (700)
- add head of coerce failure cases (710)
- add mypy plugin (701)
- make SchemaError and SchemaErrors picklable (722)

Bugfixes

- Only concat and drop_duplicates if more than one of {sample,head,tail} are present d3bc974736b8e3b1fdc3b69fd1994a6541b8ac27, f75616680a36c55f807589cefa5c16b86f6a9fdc, 20a631fa61882db6b4c0820bb1d783b7914e4027
- fix field autocompletion (702)

Docs Improvements

- Update contributing documentation: how to add dependencies 696
- update package description in setup.py eb130b4117562e382efb646408002ee11bb5445f
- Fix broken links in dataframe_schemas.rst (708)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉
- smackesey
- GustavoGB
- zevisert
- gordonhart
- nickolay
- matthiashuschle

0.8.0

Community Announcements

Pandera now has a discord community! Join us if you need help, want to discuss features/bugs, or help other community members 🤝

[![Discord](https://img.shields.io/badge/discord-chat-purple?color=%235765F2&label=discord&logo=discord)](https://discord.gg/vyanhWuaKB)

Highlights

Schema support for Dask, Koalas, Modin

Excited to announce that `0.8.0` is the first release that adds built-in support for additional dataframe types beyond [Pandas](https://pandas.pydata.org/): you can now use the exact same `DataFrameSchema` objects or `SchemaModel` classes to validate [Dask](https://dask.org/), [Modin](https://modin.readthedocs.io/en/latest/), and [Koalas](https://koalas.readthedocs.io/en/latest/index.html) dataframes.

python
import dask.dataframe as dd
import pandas as pd
import pandera as pa

from pandera.typing import dask, koalas, modin

class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

pa.check_types
def dask_function(ddf: dask.DataFrame[Schema]) -> dask.DataFrame[Schema]:
return ddf[ddf["state"] == "CA"]

pa.check_types
def koalas_function(df: koalas.DataFrame[Schema]) -> koalas.DataFrame[Schema]:
return df[df["state"] == "CA"]

pa.check_types
def modin_function(df: modin.DataFrame[Schema]) -> modin.DataFrame[Schema]:
return df[df["state"] == "CA"]


And `DataFramaSchema` objects will work on all dataframe types:
python
schema: pa.DataFrameSchema = Schema.to_schema()

schema(dask_df)
schema(modin_df)
schema(koalas_df)


Pydantic Integration

`pandera.SchemaModel`s are fully compatible with pydantic:

python
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic


class SimpleSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True)


class PydanticModel(pydantic.BaseModel):
x: int
df: DataFrame[SimpleSchema]


valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
PydanticModel(x=1, df=valid_df)

invalid_df = pd.DataFrame({"str_col": ["hello", "hello"]})
PydanticModel(x=1, df=invalid_df)


Error:
python
Traceback (most recent call last):
...
ValidationError: 1 validation error for PydanticModel
df
series 'str_col' contains duplicate values:
1 hello
Name: str_col, dtype: object (type=value_error)


Mypy Integration

Pandera now supports static type-linting of `DataFrame` types with mypy out of the box so you can catch certain classes of errors at lint-time.

python
import pandera as pa
from pandera.typing import DataFrame, Series

class Schema(pa.SchemaModel):
id: Series[int]
name: Series[str]

class SchemaOut(pa.SchemaModel):
age: Series[int]

class AnotherSchema(pa.SchemaModel):
foo: Series[int]

def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) mypy okay

def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[AnotherSchema]) mypy error
error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]";
expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]" [arg-type] noqa

schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})

fn(schema_df) mypy okay
fn(pandas_df) mypy error
error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame";
expected "pandera.typing.pandas.DataFrame[Schema]" [arg-type]



Enhancements

* 735e7fe implement dataframe types (672)
* 46dc3a2 Support mypy (650)
* 02063c8 Add Basic Dask Support (665)
* b7f6516 Modin support (660)
* cdf4667 Add Pydantic support (659)
* 12378ea Support Koalas (658)
* 62d689d improve lazy validation performance for nullable cases (655)

Bugfixes

* 7a98e23 bugfix: support nullable empty strategies (638)
* 5ec4611 Fix remaining unrecognized numpy dtypes (637)
* 96d6516 Correctly handling single string constraints (670)

Docs Improvements

* 1860685 add pyproject.toml, update doc typos
* 3c086a9 add discord link, update readme, docs (674)
* d75298f more detailed docstring of pandera.model_components.Field (671)
* 96415a0 Add strictly typed pandas to readme (649)

Testing Improvements

* 0a72a51 update suppression of health checks (653)

Internals Improvements

* fdcdb91 Reuse coerce in engines.utils (645)
* 655dd85 remove assumption from nullable strategies (641)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉
- sbrugman
- rbngz
- jeffzi
- bphillips-exos
- thorben-flapo
- tfwillems: special shout out here for contributing a good chunk of the code for the pydantic plugin 659

0.7.2

Bugfixes

- Strategies should not rely on pandas dtype aliases (620)
- support timedelta in data synthesis strats (621)
- fix multiindex error reporting (622)
- Pin pylint (629)
- exclude np.float128 type registration in MacM1 (624)
- fix numpy_pandas_coercible bug dealing with single element (626)
- update pylint (630)

0.7.1

Enhancements

- add support for Any annotation in schema model (594)
- add support for timezone-aware datetime strategies (595)
- `unique` keyword arg: replace and deprecate `allow_duplicates` (580)
- Add support for empty data type annotation in SchemaModel (602)
- support frictionless primary keys with multiple fields (608)

Bugfixes

- unify `typing.DataFrame` class definitions (576)
- schemas with multi-index columns correctly report errors (600)
- strategies module supports undefined checks in regex columns (599)
- fix validation of check raising error without message (613)

Docs Improvements

- Tutorial: docs/scaling - Bring Pandera to Spark and Dask (588)

Repo Improvements

- use virtualenv instead of conda in ci (578)

Dependency Changes

- remove frictionless from core pandera deps (609)
- docs/requirements.txt pin setuptools (611)

Contributors

🎉🎉 Big shout out to all the contributors on this release 🎉🎉

- admackin
- jeffzi
- tfwillems
- fkrull8
- kvnkho

Page 10 of 16

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.