Added
- **Experimental:** Data ingest using `DataFusion` engine
- It's an entirely new implementation of data readers and merge strategies
- It's often over **100x faster** than the `Spark`-based ingest as it has near-instant startup time (even avoids container overhead)
- New merge strategies can work directly over S3 without downloading all data locally
- It supports all existing data formats (Parquet, CSV, NdJson, GeoJson, Shapefile)
- Some advanced CSV / Json reader options are not yet implemented, most notably `timestampFormat`
- `Spark` is still used by default for compatibility. To start using `DataFusion` declare (a potentially no-op) `preprocess` step in your root dataset manifest ([see example](examples/currency_conversion/ca.bankofcanada.exchange-rates.daily.yaml))
- `Spark`-based ingest will be removed in future versions with `DataFusion` becoming the default, however we are planning to support `Spark` and all other engines in the `preprocess` step, while `DataFusion` will still be handling the initial reading of data and merging of results
Changed
- All examples where possible are now using `DataFusion` ingest