daffodil Changelog

0.5.6

Add join() with compatibility with SQL Join; improve .name handling; improve .from_csv(), .from_pdf(); fix concat bug; Add .attrs (similar to pandas).

This version provides some enhancements just prior to adding extensions for SQL proxy operation.

Added from_lot() class method. Perhaps these can be unified in main init function by examining type of the data passed.
Added join() method, including unit tests.
Added from_pdf() class method, used to parse PDF files with table structure across multiple pages.
Added name argument to from_lod()
Added name argument to from_csv_buff()
using raw docstring format to avoid complaints of escape characters in derive_join_translator.
Added 'tag_other' boolean parameter to tag all other column names during join, to support chained joins.
Simplified translator_daf table so it is easier to produce by hand and use across many tables being joined.
Added name argument for join() method, to provide the name of the resulting joined instance.
Improve unit tests for derive_join_translator
Added 25 tests for various indexing modes.
corrected parsing of tuple of strings for krows and kcols.
Added name argument for clone_empty() method.
Added omit_other_cols parameter for 'derive_join_translator' method.
this can probably displace the "shared_fields" parameter.
Fixed omit_other_cols so it could be properly omitted.
concat (which is called by append if a daf array is appended) was not using deep copy when
copying in the frame, and this became a real mess. Added copy with deep to concat.
Added from_csv() which will load csv to Daf from local files, s3 path or http/s path.
improved operation of streaming from file to avoid buffer recopying.
from_csv_buff() still exists for those times when a buffer or file-like object already exists.
Added .attrs dictionary to core dataframe instance definition to allow for descriptors to be
provided to users of the dataframe, esp. between when the daf array is defined and built
and when it is used and modified.
Improve file closure by using context manager in from_csv() for local file usage.
Deprecated use of utils instead of daf_utils.

0.5.5

Largest improvements:
- introduction of PYON format.
- added indirect col and support in apply and reduce to handle embedded PYON.
- Added indexing with range and T_lor (list of range) types, for both column and row indexing.
- Added __contains__ method to allow " if key in my_daf: " to test if a given key exists.
- Added daf_sql.py mainly for testing, but will be the basis for extension to sqlite backing with same syntax.

v0.5.5 (2024-12-27)
add daf_crm.py as demonstration.
add op_import()
add get_csv_column_names() as refactoring in daf_utils for reading csv.
precheck_csv_cols()
compare_lists() -- imported from utils
Introduce ops_daf for running operations, can also use in audit-engine.
operation descriptions taken from docstring.
added 'default_type' to apply_dtypes for any cols not specified in passed dtypes.
Improved preprocessing of csv file when line is commented out and embedded newlines exist in the line.
Improved Daf.from_lod() by using columns in dtypes dict if provided instead of relying only on first record of lod.

Added indexing with range and T_lor (list of range) types, for both column and row indexing.
Added __contains__ method to allow " if key in my_daf: " to test if a given key exists. Requires kd exists.
revised .sum_da() based on feedback from user group.
Improve formatting of README.md to include tables of examples.
improve daf_benchmarks.py to use objsize instead of pympler to evaluate memory use.
Corrected set_keyfield in daffodil to do nothing if daf is empty.
Added 'sparse_rows' to reduction 'by' type using an indirect_col.
Improve daf_sum() to support indirect_col.
Revised apply_in_place to support by='row_klist'. Func will modify row_klist and that will modify the array.
Changed name of keyword parameter in apply_in_place() from keylist to rowkeys to avoid confusion.
added astype parameter for to_list() and to_value()
Introduced standardization around PYON instead of JSON:
- Easier to convert esp. during serialization using csv.writer().
- Compatible with more Python data types.
- Still easy to convert to JSON.
Copied function create_index_at_cursor() for sql tables in daf_benchmarks.py
Added daf_sql.py mainly to support benchmarks at this point.
This will be the last release before sql enhancements.

0.5.4

Add sort_by_colnames(self, colnames:T_ls, reverse: bool=False, length_priority: bool=False)
Add daf_utils.sort_lol_by_cols()
Add argument 'omit_nulls' to .to_list() method.
Change references to klist.values to ._values to avoid amiguity with property getter and setters.
Add annotate_daf(self, other_daf: 'Daf', my_to_other_dict: T_ds) to effectively join two tables.
Fix value_counts_daf() by adding .to_list for total.

0.5.3

added tests:
flatten()
to_json() not completely working.
from_json() not completely working.
added __format__ to allow use of {:,} and other f-string formatting. Invokes .to_value()
added alias for valuecounts_for_colname() to value_counts() to match Pandas syntax.

extend .iloc to support klist and list rtypes.
Added .to_klist() to return a record as KeyedList type.
extended .assign_col() to insert a column if the colname does not exist.
Enhanced KeyedList() to allow both args to be None, and thus initialize to empty KeyedList.
insert_col_in_lol_at_icol():
fix bug if icol resolves to add a column. --> '>' changed to '>='
allow empty lol and create a lol with one column if col_la exists.
Add .iter_list() to allow iteration over just lol without cols defined.
fixed __format__ so unadorned daf name prints summary. It takes more than {daf:} in fstring to cause formatting.
Improved robustness of num_cols() to check first few rows.
TODO: It will probably be better to keep a value of the num cols and not calculate evertime.
changed name of values in KeyedList to _values and created accessors.
added support for Iterables passed for row and col selection.
Added method "remove_dups()" which returns unique records and duplicated records based on keyfield.
Changed operation of assign_col to append col to right if colname not exist.

worked around error in Pympler.asizeof.asizeof() function, used in daf_benchmarks.
this appears to be resolved in future updates of pympler.

0.5.2

v0.5.1 (2024-05-25)
changed dependencies in pyproject.toml so they would allow newer versions.
Upgraded to Python 3.11 and upgraded all libraries to the latest.
Using venv311

v0.5.2 (2024-05-30)
Added .iter_dict() and .iter_klist() to force iteration to produce either dicts or KeyedLists.
Producing KeyedLists means the list is not copied into a dict but can be mutated and the lol will be mutated.
Correct calculation of slice_len to correct column assignment from another column
This may still have some ambiguity if a nested list structure is meant to be assigned to an array cell.
collist = my_daf[:, 'colname'].to_list() this will return a list, but sometimes of only one value.
my_daf[:, 'colname2'] = collist there is ambiguity here as to whether the list with one
item should be placed in the cell or if just the value.

0.5.0

v0.5.0 (2024-05-23)
Added split_where(self, where: Callable) which makes a single pass and splits the daf array in two
true_daf, false_daf.
Added to Daffodil multi_groupby(), reduce_dodaf_to_daf() and multi_groupby_reduce()
Added class KeyedList() to provide a new data item that functions like a dict but is a hd plus list.
can result in much better performance by not redistributing values in the dict structure.
This is not yet integrated into daffodil fully.

Removed '_da' from many Daffodil methods and for keyword parameters, to allow future upgrade to KeyedList.
select_record_da() -> select_record()
record_append()
_basic_get_record_da -> _basic_get_record
assign_record_da() -> assign_record()
assign_record_da_irow -> assign_record_irow
update_by_keylist()
update_record_da_irow -> update_record_irow
changed test_daf accordingly.

Added _build_hd() to consistently build header dict structure.
Added to_json() and from_json() methods to allow generation of custom JSONEncoder.
Changed nomenclature in KeyedList class from dex to hd.
Added from_json and to_json to KeyedList class to allow custom JSONEncoder to be developed.

select_record() silently returns {} if self is empty.

fixed _itermode vs. itermode.
Added .strip() method.
correct icols when providing a single str column name, and when column names have more than one character each.
Added 'flatten' in '.to_list' method which will combine lol to a single list.
Added .num_rows() which will more robustly calculate the number of rows in edge cases.
Fix unflattening issue discovered when running edge_test_utils.py.
Updated documentation to reflect new approach to dtypes and flattening.

Daffodil

Page 1 of 2