`apply` and `transform` Improvements
We added supports to have positional/keyword arguments for `apply`, `apply_batch`, `transform`, and `transform_batch` in `DataFrame`, `Series`, and `GroupBy`. (1484, 1485, 1486)
py
>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
id
0 4
1 5
2 6
3 7
4 8
5 9
6 10
7 11
8 12
9 13
py
>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
0 6
1 7
2 8
3 9
4 10
5 11
6 12
7 13
8 14
9 15
Name: id, dtype: int64
py
>>> kdf = ks.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"])
>>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
a b c
0 5 5 5
1 7 5 11
2 9 7 21
3 11 9 35
4 13 13 53
5 15 19 75
Spark Schema
We add `spark_schema` and `print_schema` to know the underlying Spark Schema. (1446)
py
>>> kdf = ks.DataFrame({'a': list('abc'),
... 'b': list(range(1, 4)),
... 'c': np.arange(3, 6).astype('i1'),
... 'd': np.arange(4.0, 7.0, dtype='float64'),
... 'e': [True, False, True],
... 'f': pd.date_range('20130101', periods=3)},
... columns=['a', 'b', 'c', 'd', 'e', 'f'])
>>> Print the schema out in Spark’s DDL formatted string
>>> kdf.spark_schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark_schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> Print out the schema as same as DataFrame.printSchema()
>>> kdf.print_schema()
root
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)
>>> kdf.print_schema(index_col='index')
root
|-- index: long (nullable = false)
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)
GroupBy Improvements
We fixed many bugs of `GroupBy` as listed below.
- Fix groupby when as_index=False. (1457)
- Make groupby.apply in pandas<0.25 run the function only once per group. (1462)
- Fix Series.groupby on the Series from different DataFrames. (1460)
- Fix GroupBy.head to recognize agg_columns. (1474)
- Fix GroupBy.filter to follow complex group keys. (1471)
- Fix GroupBy.transform to follow complex group keys. (1472)
- Fix GroupBy.apply to follow complex group keys. (1473)
- Fix GroupBy.fillna to use GroupBy._apply_series_op. (1481)
- Fix GroupBy.filter and apply to handle agg_columns. (1480)
- Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (1488)
- Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (1490)
Other new features and improvements
We added the following new feature:
SeriesGroupBy:
- `filter` (1483)
Other improvements
- dtype for DateType should be np.dtype("object"). (1447)
- Make reset_index disallow the same name but allow it when drop=True. (1455)
- Fix named aggregation for MultiIndex (1435)
- Raise ValueError that is not raised now (1461)
- Fix get dummies when uses the prefix parameter whose type is dict (1478)
- Simplify DataFrame.columns setter. (1489)