Highlights
- ETL - inline and offline dataset transformations, custom user-defined transformations via both user-provided containers and Python scripts, simplified ETL initialization, ETL directly to and from Cloud buckets;
<img src="https://github.com/NVIDIA/aistore/blob/master/docs/images/etl-v3.3.png" width="300">
- Multi-Cloud capability supporting co-existence and management of datasets originating from (or hosted by) different Cloud storages - !2736, !2737, !2748, !2792, !2793;
- Maintenance and decommission - the capability to put a clustered node in maintenance mode and/or safely and permanently remove it from the cluster - 947, !2935, !2957, !2983, !2990, !3094;
- Volume metadata (`VMD`) - persistent information that describes each clustered node's storage configuration (including data drives, local filesystems, `mountpaths`) further used to reinforce data integrity and protection - 939, 941, !3118, !3198;
- New protocol prefix`ht://` - uniform access to "vanilla" HTTP(S) based datasets - 882, 889;
- Terraform integration - easy and automated deployment via Terraform - there's a separate [repository](https://github.com/NVIDIA/ais-k8s) (of scripts, charts, and documentation) that we use for production deployments;
- Intra-cluster communications - the transport we use to rebalance user data, transfer erasure-coded slices, copy and transform datasets - a major upgrade !2860, !2895, !2984, !3053, !3055, !3066, !3084, !3085, !3097, !3112, !3181, !3183, !3184, !3187, !3189, !3201, !3265, !3268, !3274, !3286, !3303, !3356, !3357, !3396, !3403, !3409, !3415, !3417.
And also:
- performance optimizations, CLI usability improvements, refactoring, cleanup, and stability fixes across the board.
Multi-Cloud
A new protocol prefix `ht://` (in addition to `s3://`, `gs://`, and `azure://`) for seamless integration and uniform access to "vanilla" HTTP(S) based datasets.
Multi-Cloud via a single deployed runtime. Improved access to public Cloud buckets (from different Cloud providers). Bucket copying and transformations (see **ETL** below) extended to supports Cloud buckets.
- New HTTP provider (`ht://`) - 882, 889
- Multi-Cloud - added runtime support for bucket management of multiple Cloud providers - !2736, !2737
- Support multiple regions for AWS buckets - 778, !2804
- Improve Google provider error handling - !2792, !2793
- Public GCP buckets can be use without setting `PROJECT_ID` - !2723
- Remove default Cloud provider option (provider no must be set explicitly) - 2748
- Support Cloud-based source/destination in a bucket copy operation - !2975
- Prefetch performance improvement: keep cached object properties longer - 969
Core
Improve cluster stability in the presence of exceptional events, optimize cluster operation under heavy workloads, introduce `maintenance mode`, support permanent `decommissioning` of nodes from the cluster, improve the reliability of bucket `destroy` operation, optimize and further stabilize cluster rebalancing logic.
- Node `maintenance` feature - 947, !2935, !2990, !3094
- Improved out-of-space (out of capacity) handling - 822
- `Backend` buckets vs bucket initialization - !2841
- Improve cluster stability while it is in transition (when the primary changes) - 945, 968, 960
- If cluster restarts during rebalancing we will now resume the rebalance - 913
- Optimize `copy-bucket` and other bucket-traversing workloads - 917
- Make promote consistent with other object operations - !2763, !2765
- Add transfer statistics for `resilvering` - !2926
- Configuration option `Rebalance`. Enabled now; affects only automatic rebalance (manual one can always be started - !2915
- Reduce resource usage by `StatsD` (Grafana, Graphite) client - !3240
- New CLI option `--daemon-id` to join a node with user-predefined ID - !3255
- Fix `object rename` operation to work across different `mountpaths` - !3329
- Make `destroy bucket` operation transactional - !3315
- Volume meta data (`VMD`) - persistent information about a node and its storage configuration, used on startup when running node integrity checks - 939, 941, !3118, !3198
- No `metasync` when shutting down - !2844
- Not ignoring errors when listing multiple Cloud providers - !2845
- Refactor `reb` (rebalance) package - !2857
- Refactor target handlers and fix transactions' housekeeping logic - !2869
- Refactor `copy-object` interface - !2879
- Revise and refactor `PROMOTE` (command and API) - !2880
- Refactor target `copy-object` and `put-remote` interfaces - !2881
- Use data mover to copy buckets - !2893
- `LOM`: fix `CopyObject` - !2908
- `cmn.JoinWords` and friends - !2913
- Always allow manual rebalance (even if automatic one is disabled) - !2915
- `Mountpath resilvering` now counts moved objects and their total size - !2926
- Copy buckets to return correct total size of copied content - !2919
- Revise and optimize intra-cluster broadcasting - !2943
- Improve `HrwTargetList` performance - !2945
- Fix zero-size objects scenario - !3531
ETL
Multiple improvements and enhancements to the capability (introduced first with v3.2) to easily run user-defined custom dataset transformations - and scale the performance linearly with each added storage server. This release adds *offline* (dataset-to-dataset) transformation.
For ETL documentation (that now also includes animated presentations), please refer to [docs/etl.md](https://github.com/NVIDIA/aistore/blob/master/docs/etl.md) and [etl/README.md](https://github.com/NVIDIA/aistore/blob/master/etl/README.md)
- Add offline, local and cloud, bucket transformation - !2827, !2854, !2898, !3445
- ETL for objects in the Cloud - !3399
- ETL `build` operation - easy initialization based on the function definition - !2873, !2884, !2918, !3369
- Remove `kubectl` (shell) calls, use K8s `client-go` instead - !2896, !2907
- Support retrieving ETL logs - !2947
- Stability and performance improvements, bug fixes - !2955, !2977, !3330, !3369, !3374, !3411
- Add and improve labels in Pods and Services - !3445
- Improve waiting for the Pod/Service to be ready - !3332, !3397
- Add extension, prefix, and suffix flags for offline ETL - !2846
- Support aborting offline ETL - !2850
- Add dry run option for offline ETL - !2854
- Simplify flow to initialize ETL - !2853
- Consistent naming of API constants - !2861
- ETL build: remove unnecessary annotations - !2871
- Update *skeleton* docker images used to run custom Python-based transforms - !2870
- Install dependencies in `initContainer` - !2873
- POD spec: add volume mount - !2883
- Unify offline ETL with `copy-bucket` - !2898, !2933
- Improve waiting for POD-ready - !2912
- Add`dry-run` capability - !2939
- K8s client: pod namespace & refactoring - !2948
- The capability to throttle ETL (transforms) depending on disk utilizations - !2998
Terraform integration
Dramatically simplified deployment of AIStore cluster on the Cloud via Terraform. This release delivers GKE but can be easily extended to support any Cloud that provides Kubernetes (service). It is now possible to start a fully functional AIStore cluster with a single command - for details, please refer to [AIStore Kubernetes repository](https://github.com/NVIDIA/ais-k8s/blob/master/terraform/README.md).
- Add scripts for easy deployment and shutdown of the AIStore cluster on the cloud - !16, !56-!68, 14, 17
- Add `admin` container image - !3079, !3195, !3359
- Remove requirement for `K8S_HOST_NAME` environment variable - !3451
Information Center (IC)
More reliable extended action (`xaction`) status management and reporting, automatic cluster-wide `xaction` abort, `xaction` progress notifications (**new**). In AIS, `xaction` is a long-lived asynchronous operation, a job.
- Notify all participating nodes when any one of them aborts `xaction` - !2928
- Improve `IC` status reporting by polling `xaction` status from targets that have not reported `xaction` status yet - !2953
- Fix `xaction` registration for newly added targets - !2924
- Support both transactional and non-transactional `xactions` - !2734
- Replace target polling with notifications when waiting for `xaction` to complete - !2868
- `xactions` to return user-friendly status - !2865
Downloader
Integration with `IC`, more robust downloader job handling.
- Downloader naming; fix `mountpath` register/unregister - !2842
- Better job aborting; improved completion mechanisms - 902, !2960
- Progress Bar: report periodic status and stats to `IC` (see above) - !2911
Distributed Shuffle (`dSort`)
Performance improvements, resource usage optimizations.
- Performance: decrease resource usage - 938
- Better data transport streams handling - 936, !3307
Erasure Coding (EC)
Resource usage optimizations, better slice checksum handling.
- Fix checksum when sending constructed slices to other targets - !3073, !3132
- Improve operation over data transport streams - 916, !3311
- Fix receiving object slices when the bucket is being destroyed - 887
- Add support for nodes in maintenance mode - !3404
Intra-cluster communications
The transport that we use to rebalance user data (e. g., when adding/removing nodes), transfer erasure-coded slices, copy and transform datasets has undergone a major upgrade:
* Add data mover layer - !2860, !2895, !2899
* Support for short messages and message streams - !2984, !3055, !3066, !3084, !3085, !3097, !3112, !3181, !3183, !3184, !3187, !3189, !3201, !3265, !3268, !3274, !3303
* Revise and optimize transport stream multiplexing - !3141
* When done transmitting, wait for data mover quiescence - !2903
* Support streaming *unsized* objects - objects of unknown size - the functionality in particular useful when ETL-transforming objects on the fly (that is, *inline*) - !3356, !3357, !3396, !3403, !3409, !3415, !3417
* Optimize memory management and debug unlikely races: !3053, !3189, !3286, !3298, !3309, !3314, !3319
* `Data mover`: is-open vs quiescent - !2941
CLI (tool)
New command `ais show mountpath`, new option `--keep` for `PROMOTE` operation, allow running certain commands without accessing a cluster, redesigned `ais rm node` command, automatic progress indicator for long `ais ls <bucket>` operations, many fixes for various `show` commands.
- Display EC `xaction` extra information for `ais show xaction` command - 823
- Improve user experience: commands that do not need a cluster do not require the cluster is running - 878, !2914
- Listing bucket objects with the flag `--all` displays all objects (including temporarily misplaced) - 964
- Command `ais cat` now prints only object content, trailing object size information line is removed - !2729
- Cloud bucket can be downloaded without setting backend bucket - !2803
- Added progress indicator when listing a huge bucket - 884, !2786
- Unify `--all` sub-option for all commands - !2843, !3264
- New option for `PROMOTE` command: `--keep` original files after promoting them to objects - !2880
- New command `ais show mountpath` to display target `mountpath` info - !2900, !3387
- Fix displaying rebalance statistics - !3264
- Fix `ais show xaction rebalance` to show the last `xaction` - !3250
- Fix `ais show cluster smap` - !3243
- Revise `ais rm node` command: add mandatory option `--mode` (to choose between node decommission and putting node in maintenance), and optional `--no-rebalance` (to skip rebalance and execute removal immediately) - !2965
- An option to remove all finished download jobs - !2849
- Wait option (flag) - !2876
- New command `ais show mountpath` - !2900
- Fix 'show rebalance' showing rebalance stats - !2954
- Refactor CLI `cat`/`get` top-level commands - !2972
Other
- `aisloader` (benchmark): add progress indicator when listing very large buckets - !2821
- `aisfs`: `APPEND` operation is now checksum-protected - 780
- `build`: use custom image for faster CI, enable more linters, switch to Go 1.15, add memory and CPU profiling options via `make`, upgrade third-party packages - !3235, !3121, !2949, !2916, !2993, !3050
- `CI/CD`: fix k8s development scripts, run many more tests in `minikube` CI, add terraform GCP playground - !2851, !2858, !2980.
- `S3 compatibility`: support AIS buckets with Cloud backend - !3532, 67, 68