Highlights
FastAPI Integration [[Docs](https://pandera.readthedocs.io/en/stable/fastapi.html)]
`pandera` now integrates with [fastapi](https://fastapi.tiangolo.com/). You can decorate app endpoint arguments with `DataFrame[Schema]` types and the endpoint will validate incoming and outgoing data.
python
from typing import Optional
from pydantic import BaseModel, Field
import pandera as pa
schema definitions
class Transactions(pa.SchemaModel):
id: pa.typing.Series[int]
cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)
class Config:
coerce = True
class TransactionsOut(Transactions):
id: pa.typing.Series[int]
cost: pa.typing.Series[float]
name: pa.typing.Series[str]
class TransactionsDictOut(TransactionsOut):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
App endpoint example:
python
from fastapi import FastAPI, File
app = FastAPI()
app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
output = transactions.assign(name="foo")
... do other stuff, e.g. update backend database with transactions
return output
Data Format Conversion [[Docs](https://pandera.readthedocs.io/en/stable/data_format_conversion.html)]
The class-based API now supports automatically deserializing/serializing pandas dataframes in the context of `pa.check_types`-decorated functions, `pydantic.validate_arguments`-decorated functions, and fastapi endpoint functions.
python
import pandera as pa
from pandera.typing import DataFrame, Series
base schema definitions
class InSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
int_col: Series[int]
class OutSchema(InSchema):
float_col: pa.typing.Series[float]
read and validate data from a parquet file
class InSchemaParquet(InSchema):
class Config:
from_format = "parquet"
output data as a list of dictionary records
class OutSchemaDict(OutSchema):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
return df.assign(float_col=1.1)
The `transform` function can then take a filepath or buffer containing a parquet file that pandera automatically reads and validates:
python
import io
import json
buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)
dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))
Output:
[
{
"str_col": "a",
"int_col": 0,
"float_col": 1.1
},
{
"str_col": "b",
"int_col": 1,
"float_col": 1.1
},
{
"str_col": "c",
"int_col": 2,
"float_col": 1.1
}
]
Data Validation with GeoPandas [[Docs](https://pandera.readthedocs.io/en/stable/geopandas.html#supported-lib-geopandas)]
`DataFrameSchema`s can now validate `geopandas.GeoDataFrame` and `GeoSeries` objects:
python
import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon
geo_schema = pa.DataFrameSchema({
"geometry": pa.Column("geometry"),
"region": pa.Column(str),
})
geo_df = gpd.GeoDataFrame({
"geometry": [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
"region": ["NA", "SA"]
})
geo_schema.validate(geo_df)
You can also define `SchemaModel` classes with a `GeoSeries` field type annotation to create validated `GeoDataFrame`s, or use then in `pa.check_types`-decorated functions for input/output validation:
python
from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
class Schema(pa.SchemaModel):
geometry: GeoSeries
region: Series[str]
create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
{
'geometry': [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
'region': ['NA','SA']
}
)
Enhancements
- Support GeoPandas data structures (732)
- Fastapi integration (741)
- add title/description fields (https://github.com/pandera-dev/pandera/pull/754)
- add nullable float dtypes (https://github.com/pandera-dev/pandera/pull/721)
Bugfixes
- typed descriptors and setup.py only includes pandera (739)
- `pa.dataframe_check` works correctly on pandas==1.1.5 (735)
- fix set_index with MultiIndex (751)
- strategies: correctly handle StringArray null values (748)
Docs Improvements
- fastapi docs, add to ci (https://github.com/pandera-dev/pandera/pull/753)
Testing Improvements
- Add Python 3.10 to CI matrix (https://github.com/pandera-dev/pandera/pull/724)
Contributors
Big shout out to the following folks for your contributions on this release 🎉🎉🎉
- roshcagra
- cristianmatache
- jamesmyatt
- smackesey
- vovavili