The primary goal of this release was to improve performance of longer data pipelines. Additionally, there were additional API additions and several minor breaking changes.
Performance Improvements
The largest under the hood change is changing all operations to be lazy by default. `0.2.0` calculates a new list at every transformation. This was initially implemented using generators, but this could lead to unexpected behavior. The problem with this approach is highlighted in 20. Code sample below:
python
from functional import seq
def gen():
for e in range(5):
yield e
nums = gen()
s = seq(nums)
s.map(lambda x: x * 2).sum()
prints 20
s.map(lambda x: x * 2).sum()
prints 0
s = seq([1, 2, 3, 4])
a = s.map(lambda x: x * 2)
a.sum()
prints 20
a.sum()
prints 0
Either, `ScalaFunctional` would need to aggressively cache results or a new approach was needed. That approach is called lineage. The basic concept is that `ScalaFunctional`:
1. Tracks the most recent concrete data (eg list of objects)
2. Tracks the list of transformations that need to be applied to the list to find the answer
3. Whenever an expression is evaluated, the result is cached for (1) and returned
The result is the problems above are fixed, below is an example showing how the backend calculates results:
python
from functional import seq
In [8]: s = seq(1, 2, 3, 4)
In [9]: s._lineage
Out[9]: Lineage: sequence
In [10]: s0 = s.map(lambda x: x * 2)
In [11]: s0._lineage
Out[11]: Lineage: sequence -> map(<lambda>)
In [12]: s0
Out[12]: [2, 4, 6, 8]
In [13]: s0._lineage
Out[13]: Lineage: sequence -> map(<lambda>) -> cache
Note how initially, since the expression is not evaluated, it is not cached. Since printing `s0` in the repl calls `__repr__`, it is evaluated and cached so it is not recomputed if `s0` is used again. You can also call `cache()` directly if desired. You may also notice that `seq` can now take a list of arguments like `list` (added in 27).
Next up
Improvements in documentation and redo of `README.md`. Next release will be focused on extending `ScalaFunctional` further to work with other data input/output and more usability improvements. This release also marks relative stability in the collections API. Everything that seemed worth porting from Scala/Spark has been completed with a few additions (predominantly left, right, inner, and outer joins). There aren't currently any foreseeable breaking changes.