wrangles Changelog

1.12.0

Highlights
If conditions
Add a condition to a read, wrangle, write or run. This will determine whether the action runs or not as a whole, in contrast to where which filters the rows that it will apply to.
yml
Send a failure email only if not testing
run:
on_failure:
- notification.email:
if: ${environment} != 'test'
...

Matrix
Matrix can be used to run multiple actions defined by variables. Matrix was previously available for write. It is now also available for read, run and wrangles.

yml
Read all the files in a folder into a single dataframe
read:
- matrix:
variables:
filename: dir(my_folder)
read:
- file:
name: ${filename}

Concurrent
By default, all functions executed by a recipe happen sequentially. The concurrent connector allows functions to be executed in parallel.

yml
Write to a file and database simultaneously.
write:
- concurrent:
write:
- file:
name: file1.csv
- postgres:
host: postgres.domain
...

Try
Added a try wrangle. This allows trying a series of wrangles, catching any error that occurs and continuing anyway. Optionally, an except can be provided with wrangles to run in the event that the try fails, or a dictionary of keys and values to populate the dataframe with.
Try will fail as a whole, not per row.
yml
Try a wrangle that might fail and run another if it does
wrangles:
- try:
wrangles:
- risky_wrangle:
input: column
except:
- backup_wrangle:
input: column

Not Columns
Added the ability to use -name syntax for column names where wildcards are supported. This will exclude that column. Columns are evaluated from first to last.
yml
Apply to all columns like column1, column2, ... except column 2
wrangles:
- convert.case:
input:
- column*
- -column2
case: upper

compare.text
Added _compare.text_ wrangle. This can be used to compare different strings.
yml
wrangles:
- compare.text:
input:
- col1
- col2
output: output
method: intersection

Available methods include:
- intersection: Return only the text in common between the inputs
- difference: Return the text that differs between the inputs
- overlap: Show the text in common between two inputs

Other Changes
- _log_:
- If using alternative logging methods such as write, default not to write to the console unless specifically set.
- Fixed a bug where log used excessive memory for large dataframes.
- _lookup_: Added schema definition for training read/write.
- _merge.concatentate_: When concatenating a single column of strings, return that value rather than treating strings as lists.
- _merge.coalesce_: Added support for coalescing lists within a single column.
- _rename_: Fixed a bug where rename dropped the column if attempting to rename to itself.
- _select.group_by_: Allow using custom functions as aggregation functions.
- _convert.to_yaml_ and _convert.to_json_: Accept multiple inputs for a single output. Will treat as a dictionary of the inputs.
- _extract.ai_:
- Set default model to gpt-4o-mini.
- Ensure examples are passed as an array even if set as a scalar value.
- _python_: Added an _except_ parameter to provide a value to return in the case of an error.
- _extract.brackets_:
- Added _find_ parameter to to specify which type of brackets to return values for.
- Added _include_brackets_ boolean parameter to set whether to include the brackets in the returned strings. Default false.
- _select.highest_confidence_: Refinements to parameter and value handling.
- _split.tokenize_: Added method parameter. New methods are available to use 'boundary' and 'boundary_ignore_space' to split on regex word boundaries. Also `regex:<pattern>` to split on a custom regex pattern and `custom.<function>` to use a custom function.
- _S3_: Fixed a bug where reading or writing gzipped files directly failed.
- _input_: New connector. This can be used in a read to reference the dataframe that was passed as part of the recipe.run function, e.g. to union or join with another source.
- _recipe wrangle_:
- Added support for using input, output and where to only apply the recipe to a subset of the data.
- Default to pass through all variables if variables parameter is not used instead of none.
- Improved the behaviour of where and handling of more edge cases.
- Support multiple reads. If a specific aggregation isn't used, the default will be to union together into a single dataframe.
- Custom functions for variables can now reference other variable by name in the arguments. e.g. `def func(other_variable):`
- Full support for nested function calls for custom functions. e.g. `custom.module.class.function`
- Fixed a bug where regex special characters had unintended side effects when using wildcards (*) in column names.
- Allow read, wrangles, write and run to be defined as strings and not require an empty dict for parameters if no parameters are needed.
- Fixed a bug where variables were not passed correctly to the recipe when they contained falsy values.
- Fixed a bug where certain special characters were not escaped correctly in user credentials.
- Fixed a bug where matrix write would not wait for all threads to complete fully.
- Updates to tests due to backend changes.
- General improvements to error messages.

1.11.0

Lookup
The lookup wrangle is now widely available. This can be used to look up data from a saved lookup wrangle. Lookups support key (exact) or semantic (most similar meaning) based matches.

yml
wrangles:
- lookup:
input: column_to_be_looked_up
output:
- column1_from_lookup
- column2_from_lookup
model_id: <id>

Batch
Added a batch wrangle. This can be used to execute a series of wrangles broken into a series of batches. The batches can optionally be executed in parallel with a threads parameter, and provide an error output to catch errors.

yml
wrangles:
- batch:
batch_size: 10
threads: 5
on_error:
error_column: error_value
wrangles:
- ...

Others
- Fixed matrix set to preserve the order of values from the input column(s).
- Use gpt-4o for GPT based tests.
- select.sample: Allow strings for the rows parameter if they are valid numbers.
- Fix schema input/output for fraction_to_decimal to show arrays are allowed.
- Fix a bug where accordion would introduce an empty string for inputs containing empty lists.
- Improved devcontainer definition for codespaces.
- Explode: added a drop_empty parameter. If true, empty lists will not produce a row in the exploded output.
- Pass any additional parameters for cloud based wrangles as url params.
- If an empty list is passed to any server based wrangles, such as in the case of a where filtering all rows, shortcut and return an empty list immediately.
- Improve handling of special characters in variable names. All non-alphanumeric characters are replaced by underscores or available within the kwargs dict using the original name.
- Enable where to work in conjunction with wrangles that use special output syntax like extract.ai and lookup.
- Fix a bug that where could cause data loss for falsy values.

1.10.2

- Don't sort keys by default when converting to YAML.
- Ensure *create.embedding* retries works even in the case of full network errors.

1.10.1

- Allow *split.list* to also work with JSON arrays.
- Allow *select.list_element* to also work with JSON arrays.
- Allow *select.dictionary_element* to also work with JSON objects.
- Allow *accordion* to also work with JSON arrays.
- Bump docker/login-action version to v3

1.10.0

Accordion
Added accordion. This allows applying a series of wrangles to the elements of a list individually.
yml
["a","b","c"] -> ["A","B","C"]
wrangles:
- accordion:
input: list_column
output: modified_lists
wrangles:
- convert.case:
input: list_column
output: modified_lists
case: upper

Other Changes
- Added *order_by* parameter for read and write connectors using SQL syntax. e.g. Col1, Col2 DESC
yml
write:
- file:
name: file.csv
order_by: Col1, Col2 DESC

- split.text
- Improvements to output slicing. Can use step, is more tolerant of different syntax, and can use slicing when outputting to columns.
- No longer requires output. If output is omitted, the input column will be overwritten in line with other wrangles.
yml
- split.text:
input: column
element: ':3'

- *select.element*
- Allow slicing lists or strings.
yml
- select.element:
input: column[1:3]

- Make default behaviour to raise an error if a default isn't set.
- *split.dictionary*
- Use output to choose only specific keys from the dictionaries either by name, using a wildcard or with regex.
yml
- split.dictionary:
input: col1
output: Out*

- Allow output to use the syntax key: output_column_name to rename the resulting columns.
- Add ability to rename columns dynamically using a wildcard (*) or regex.
yml
- split.dictionary:
input: col1
output:
- "*": "*_SUFFIX"

- *select.dictionary_element*
- Allow specifying multiple elements. If a list of elements is provided, the output will be a dictionary rather than a scalar value.
yml
{'key1': 'value1', 'key2': 'value2', ...} -> {'key1': 'value1', 'key2': 'value2'}
- select.dictionary_element:
input: column
element:
- key1
- key2

- Allow element selection to be dynamic with wildcards or regex.
yml
{'key1': 'value1', 'key2': 'value2', ...} -> {'key1': 'value1', 'key2': 'value2'}
- select.dictionary_element:
input: column
element:
- key*

- Allow renaming output keys.
yml
{'key1': 'value1', 'key2': 'value2', ...} -> {'key1': 'value1', 'renamed_key2': 'value2'}
- select.dictionary_element:
input: column
element:
- key1
- key2: renamed_key2

- Allow using a dict for default to set the default for different keys.
yml
{'key1': 'value1'} -> {'key1': 'value1', 'key2': 'A', 'key3': 'B'}
- select.dictionary_element:
input: column
element:
- key1
- key2
- key3
default:
key2: A
key3: B

- *select.left*/*select.right*
- Enable integer lengths to work even if set as a string i.e. '1' behaves as 1.
- Allow negative values to remove characters from the left/right respectively.
- *create.embeddings*
- Give a clear error message if the API Key is missing or invalid.
- Set the default model to 'text-embedding-3-small'.
- SFTP Connector
- Reuse the connection when transferring multiple files and ensure the connection is closed properly.
- Add the filename to the error message if the file is not found when attempting to read.
- HTTP connector:
- Added write function to the connector.
- Added an option to do a pre-request for oauth authentication
- Added an orient parameter to define the JSON body structure.
- Pass through kwargs to the request.
- Enable *extract.custom* to work with ai variants.
- Ensure *similarity* outputs a python float.
- Add bcc parameter for *notification.email*.
- Improve the logic for where by filtering the original dataframe using the index to reduce issues with subtle data issues from executing the SQL query.
- Bugfix: Don't try to make a directory when writing a file to memory using the file connector.
- Provide clearer and more concise error messages for custom functions.
- Minor tidying of warnings from string escaping.
- Remove use of inplace due to upcoming pandas behaviour changes.
- Allow batching logic to deal with pandas tight dataframe dict format.
- Preparatory edits for releasing lookup wrangles. Not yet widely available.
- Added a devcontainer config for codespaces.

1.9.0

- Enable _extract_raw_ option for extract.custom.
- Make OpenAI tests more tolerant of variations in model results.
- Add run function to SFTP connector.
- Add inclusive option to split.text to toggle whether to include the split character in the output or not.
- Refactor and optimize split.text function.
- Validate input and output lengths in pandas.copy.
- Added automated recipe schema generation and tests.

Wrangles

Page 1 of 5