Summary
Improve the interactions of datafiles with datasets and make output datasets available in the cloud via expiring signed URLs after `Analysis.finalise` has been called.
Contents (https://github.com/octue/octue-sdk-python/pull/416)
IMPORTANT: There are 3 breaking changes.
New features
- Add the ability to generate expiring signed URLs for datasets and their datafiles
- Add the ability to instantiate datasets and datafiles from signed URLs instead of cloud paths
Enhancements
- **BREAKING CHANGE:** Re-engineer `Dataset.add` so that datafiles are actually copied/uploaded/downloaded to the dataset when adding them
- **BREAKING CHANGE:** Re-engineer `Analysis.finalise` to:
- Provide signed URLs for uploaded datasets if a cloud path to a directory is provided
- Not support local data persistence
- **BREAKING CHANGE:** Remove redundant concepts of `output_dir` and output manifest path
- Add `Datafile.cloud_hash` and `Datafile.extension` properties
- Add `Dataset.bucket_name` and `Dataset.path_in_bucket` properties
- Add `Dataset.__contains__` method
- Avoid downloading cloud datafiles to determine their cloud hash values
- Multithread dataset and datafile instantiations/metadata-gathering in `Manifest` and `Dataset`
- Copy local datafiles to their new paths when setting their `local_path` properties
- Add `is_url` and `is_cloud_path` functions to `octue.cloud.storage.path`
Fixes
- Set default `Dataset` path to the current working directory
- When calling `Datafile.to_cloud`, allow uploading of the file if nothing exists at the given cloud path
- Calculate relative path of datafile to dataset properly in `Dataset.to_cloud`
- Raise errors from threads used for concurrent download/upload of data/metadata from the cloud
- Call `Analysis.finalise` inside the template apps
- Create directories if missing when setting datafile local path
Dependencies
- Use the latest `twined` version, allowing datasets to be optional if specified in `twine.json`
- Loosen the dependency on `twined` to allow freedom in the patch versions of the current minor version
Refactoring
- Reduce repetition and improve getting of buckets in `GoogleCloudStorageClient`
- Replace `GoogleCloudStorageClient._create_intermediate_local_directories` with proper usage of `os.makedirs`
- Simplify `Manifest._instantiate_dataset` and remove support for datasets with no explicit path
Testing
- Test calculation/getting of datafile hash values
- Create and use a mock for generating signed URLs to avoid the limitations of workload identity federation in CI testing
<!--- SKIP AUTOGENERATED NOTES --->