We're proud to release version 0.3.0 of Daft! Please note that with this minor version increment, v0.3 contains several breaking changes:
- `daft.read_delta_lake`
- This function was deprecated in favor of `daft.read_deltalake` in v0.2.26 and is now removed. (2663)
- `daft.read_parquet` / `daft.read_csv` / `daft.read_json`
- Schema hints are deprecated in favor of `infer_schema` (whether to turn on schema inference) and `schema` (a definitive schema if infer_schema is False, otherwise it is used as a schema hint that is applied post inference). (2326)
- `Expression.str.normalize()`
- Parameters are now all False by default, and need to individually be toggled on. (2647)
- `DataFrame.agg` / `GroupedDataFrame.agg`
- Tuple syntax for aggregations was deprecated in v0.2.18 and is now no longer supported. Please use aggregation expressions instead. (2663)
- Ex: `df.agg([(col("x"), "sum"), (col("y"), "mean")])` should be written instead as `df.agg(col("x").sum(), col("y").mean())`
- `DataFrame.count`
- Calling `.count()` with no arguments will now return a DataFrame with column โcountโ which contains the length of the entire DataFrame, instead of the count for each of the columns (1996)
- `DataFrame.with_column`
- Resource requests should now be specified on UDF expressions (`udf(num_gpus=โฆ)`) instead of on Projections (through `.with_column(..., resource_request=...)` (2654)
- `DataFrame.join`
- When joining two DataFrames, columns will now be merged only if they exactly match join keys. (2631)
- Ex:
python
df1 = daft.from_pydict({
"a": ["x", "y"],
"b": [1, 2]
})
df2 = daft.from_pydict({
"a": ["y", "z"],
"b": [20, 30]
})
result_df = df1.join(
df2,
left_on=[col("a"), col("b")],
right_on=[col("a"), col("b")/10], NOTE THE "/10"
how="outer"
)
result_df.sort("a").collect()
before
โญโโโโโโโฌโโโโโโโโฎ
โ a โ b โ
โ --- โ --- โ
โ Utf8 โ Int64 โ
โโโโโโโโชโโโโโโโโก
โ x โ 1 โ
โโโโโโโโผโโโโโโโโค
โ y โ 2 โ
โโโโโโโโผโโโโโโโโค
โ z โ 30 โ
โฐโโโโโโโดโโโโโโโโฏ
after
โญโโโโโโโฌโโโโโโโโฌโโโโโโโโโโฎ
โ a โ b โ right.b โ
โ --- โ --- โ --- โ
โ Utf8 โ Int64 โ Int64 โ
โโโโโโโโชโโโโโโโโชโโโโโโโโโโก
โ x โ 1 โ None โ
โโโโโโโโผโโโโโโโโผโโโโโโโโโโค
โ y โ 2 โ 20 โ
โโโโโโโโผโโโโโโโโผโโโโโโโโโโค
โ z โ None โ 30 โ
โฐโโโโโโโดโโโโโโโโดโโโโโโโโโโฏ
Changes
โจ New Features
- [FEAT] Ellipsize scan task sources if too many Vince7778 (2695)
- [FEAT] Allow user provided schema and schema inference length for read\_sql colin-ho (2676)
- [FEAT] Add dataframe iteration on rows and change default buffer size jaychia (2685)
- [FEAT]: add to\_arrow\_iter universalmind303 (2681)
- [FEAT] Example Analyze for Local Execution Engine samster25 (2648)
- [FEAT] (ACTORS-1) Add DAFT\_ENABLE\_ACTOR\_POOL\_PROJECTS=1 feature flag and specifying concurrency jaychia (2668)
- [FEAT]: sql like \& ilike universalmind303 (2666)
- [FEAT] Changes the default count() behavior to perform a global row count instead jaychia (2653)
- [FEAT] Support passing in column name strings to `to_struct` Vince7778 (2671)
- [FEAT]: refactor tree display to get more info into physicalplan universalmind303 (2640)
- [FEAT] Add `to_struct` function for merging columns into a struct Vince7778 (2662)
- [FEAT] Add hashing and groupby on structs Vince7778 (2657)
- [FEAT]: `daft.sql_expr` universalmind303 (2656)
- [FEAT] Deprecates usage of resource\_request on df.with\_column API jaychia (2654)
- [FEAT] Add input batching for UDFs Vince7778 (2651)
- [FEAT] Add `cbrt` expression raunakab (2646)
- [FEAT] use ObfuscatedString to hide creds when Display IOConfig samster25 (2645)
- [FEAT]: more sql functions universalmind303 (2596)
- [FEAT] Support \_\_init\_\_ arguments for StatefulUDFs jaychia (2634)
- [FEAT] Move resource requests to UDFs instead of on with\_column jaychia (2632)
- [FEAT] Add wildcards in column expressions Vince7778 (2629)
- [FEAT] factor mermaid builder into it's own module to use independently samster25 (2636)
- [FEAT] Remote parquet streaming colin-ho (2620)
- [FEAT]: mermaid formatter universalmind303 (2619)
- [FEAT] Add ActorPoolProject logical and physical plans jaychia (2601)
- [FEAT] Enable broadcast strategy on anti and semi joins kevinzwang (2621)
- [FEAT] Add `.list.sort()` for sorting lists within a list column Vince7778 (2589)
- [FEAT] Streaming Local Parquet Reads colin-ho (2592)
๐ Performance Improvements
- [PERF] Add ability to automatically choose broadcast for anti/semi joins kevinzwang (2699)
- [PERF] Swordfish Dynamic Pipelines samster25 (2599)
- [PERF] Dyn Compare + Probe Table samster25 (2618)
๐พ Bug Fixes
- [BUG] Fix Parquet reads with chunk sizing desmondcheongzx (2658)
- [BUG]: repr mermaid fix universalmind303 (2688)
- [BUG] Use Daft Pickle instead of Ray Pickle and use bincode for serializing samster25 (2693)
- [BUG] Add timeout to analytics client raunakab (2670)
- [BUG] Fix swordfish inner joins colin-ho (2678)
- [BUG] Fix struct `.hash()` naming bug Vince7778 (2673)
- [BUG] Fix filter pushdown into non-inner joins kevinzwang (2659)
- [BUG] Fix issues where we check "is\_ray\_runner" on non-initialized contexts jaychia (2652)
- [BUG] Fix nested parquet reads for .show() and .limit() desmondcheongzx (2643)
- [BUG] Fix join op names and join key definition kevinzwang (2631)
- [BUG] Fix projection pushdowns not working with limits Vince7778 (2635)
- [BUG] Fix Expr::with\_new\_children for ScalarFunction kevinzwang (2624)
- [BUG] Fix pushdown past monotonically increasing id Vince7778 (2622)
๐ Documentation
- [CHORE] Fix FOTW 001 images notebook jaychia (2697)
- [DOCS] Add join types, renaming behavior, and example to join docs kevinzwang (2691)
- [FEAT] Add dataframe iteration on rows and change default buffer size jaychia (2685)
- [DOCS]: add docs for cosine\_distance universalmind303 (2675)
- [FEAT] Add `to_struct` function for merging columns into a struct Vince7778 (2662)
- [CHORE] Turn v0.3 deprecations into breaking changes kevinzwang (2663)
- [FEAT] Add `cbrt` expression raunakab (2646)
- [FEAT] Support \_\_init\_\_ arguments for StatefulUDFs jaychia (2634)
- [FEAT] Move resource requests to UDFs instead of on with\_column jaychia (2632)
- [FEAT] Add wildcards in column expressions Vince7778 (2629)
- [DOCS] Enable doc tests in CI colin-ho (2615)
- [FEAT] Add `.list.sort()` for sorting lists within a list column Vince7778 (2589)
- docs: Add fotw tutorial on working with images avriiil (2490)
๐งฐ Maintenance
- [CHORE] fix merge conflict in repr tests samster25 (2700)
- [CHORE] Fix FOTW 001 images notebook jaychia (2697)
- [CHORE] Deprecate schema hints colin-ho (2655)
- [CHORE] Add error snafus for local executor colin-ho (2660)
- [FEAT]: refactor tree display to get more info into physicalplan universalmind303 (2640)
- [CHORE] Turn v0.3 deprecations into breaking changes kevinzwang (2663)
- [CHORE]: Drop use of deprecated form "default\_features" universalmind303 (2665)
- [CHORE] bump dev version to 0.3.0 samster25 (2664)
- [CHORE]: fix feature flags universalmind303 (2661)
- [CHORE] Set `Expression.str.normalize()` options to False by default Vince7778 (2647)
- [CHORE] Improve swordfish error handling colin-ho (2628)
- [CHORE] Add ignore for helix editor raunakab (2642)
- [CHORE] Add toolchain check to Makefile Vince7778 (2641)
- [CHORE] Upgrade Rust toolchain to 2024-08-01 Vince7778 (2639)
- [CHORE] Track memory for swordfish tpch colin-ho (2633)
- [CHORE] Split resource-request and hashable-float-wrapper into utility crates jaychia (2630)
- [CHORE] Use parquet for native tpch benchmarks colin-ho (2609)
- [CHORE] Refactor UDFs to separate stateful and stateless jaychia (2597)