AIS **v3.9** is substantial productization and performance-improving upgrade. Much of the codebase has been refactored for consistency, with numerous micro-optimization and stabilization fixes across the board.
Highlights
* [promote](https://aiatscale.org/blog/2022/03/17/promote): redefine to handle remote file shares; collaborate when promoting via entire cluster; add usability options; productize;
* [xmeta](/cmd/xmeta/README.md): extend to also dump in a human-readable format: a) erasure-coded metadata and b) object metadata;
* memory usage and fragmentation: consistently use mem-pooling (via `sync.Pool`) for all control structures in the datapath;
* optimistic concurrency when running batch `prefetch` jobs; refactor and productize;
* optimize PUT datapath;
* core logic to deconflict running concurrent `xactions` (asynchronous jobs): bucket rename vs bucket copy, put a node into maintenance mode vs offline ETL, and similar;
* extend and reinforce resilvering logic to withstand simultaneous disk losses/attachments - at runtime and with no downtime;
* stabilize global rebalance to successfully pass multiple hours of random node "kills" and restarts - *node-left* and *node-joined* events - in presence of stressful data traffic;
* self-healing: object metadata cache to support recovery upon `mountpath` events (e.g., drive failures);
* error handling: phase out generic `fmt.Errorf` and consistently use assorted error types instead;
* additional options to speedup listing of very large buckets ([list-objects](/docs/bucket.mdlist-objects));
* numerous micro-optimizing improvements: fast datapath query (`DPQ`) and many more.
Promote files and directories
- refactor as a 2-phase transaction and auto-detect file share (initial) !4929
- auto-detect file share and distribute the work between target nodes !4945
- add test; add [target node => IC](/docs/ic.md) notifications !4975
- extend test coverage; reinforce global UUID when promoting via entire cluster !4976
- rename [api.Promote](/api/object.goL518); add test permutations and checks !4985
- remove redundant control structures; cleanup !4987
- add API options `delete-src` (delete source) and `overwrite-dst` (overwrite destination)!4988
- fix extra-copy optimization with full refactoring !4989
- revise/optimize `HEAD(object)` implementation and utilize it when promoting with `overwrite-dst=false` (major) !4991
- extend [object write transaction (OWT)](/cmn/owt.goL19) to support the flow !4992
- support in (i.e., transmitted), out (i.e., received) and locally-promoted stats counters - files/objects and bytes !4993
- introduce *confirmed file share*; add user option _not_ to auto-detect file share !5019
- CLI: add `overwrite-dst` and `delete-src` E2E tests !5024
- consolidate control, eliminate ambiguity !5045
- increase test coverage !5047, !5063
- add all test permutations to cover (`ais` | `cloud` | `remote-ais`) bucket vs. (non-redundant | `EC` | `n-way mirror`) !5068
ETL
- add CLI to view stored ETL code and specification !4925
- handle target-down; test !4933
- redefine and improve ETL API (!4947, !4966, !5022), including:
- manage (CRUD wise) persistently stored ETL definitions
- eliminate redundant URL path parameters
- enforce uniqueness of the user-provided ETL name
- remove (obsolete) embedded ETL-specific annotations from the `init` spec (pod template)
- support stopping and restarting ETLs !5005, !5056
- update ETL docs and fix minor bugs !4984, !5022
Global Rebalance
- global rebalance status: always respond with total (cumulative) stats counters !4905
- generic `fs.Walk` for global rebalance (refactoring) !4889, !4930
- get-status & health !4934
- global rebalance status: reimplement to optimize !4936
- `devtools`: merge `WaitForRebalanceToStart` and `WaitForAllResilver` !4937
- tweak/optimize receive logic !4994
- abort via stage notifications from other target nodes (major) !5015
- transport streams vs receive errors; assorted fixes !5040
- tweak preemption logic (when rebalance triggering events arrive back to back) !5057
- assorted fixes: global rebalance vs n-way mirroring & resilvering !5071, !5072
- consistent renames and continuous refactoring !5075
Resilvering (in presence of drive failures and attachments)
- tools and stats: `wait-for-all-resilvers`, [multi-snap API](/api/xaction.goL122) !4888
- resilvering vs copy management (major) !4865, !4866, !4867
- resilvering: tweak is-active time interval !4882
- support losing multiple disks (mountpaths) simultaneously !4884
- multiple overlapping add/remove disk operations: fixes !4894
- resilvering as scrubbing: recover objects to their expected (default) locations !4900
- resilvering: interval-of-inactivity multiplier !4974
- resilvering under stress in presence of lost mountpath(s) !5058
Asynchronous Jobs (aka `xactions`)
- when aborting and propagating abort to the control-plane caller, make sure _not_ to lose the original cause for the abort !4886
- fix `put-xaction` finishing logic !4887
- aborting jobs: propagate the original cause through channels and APIs !4890, !4891
- revise lookup by `only-running` and/or by UUID !4897
- move `xaction` and `xreg` packages with refactoring !4898
- clarify *running* vs *not finished* `xaction` !4908
- [registry](/xact/xreg/xreg.goL103): fix matching logic, remove redundant code !4911, !4912
- registry: amend housekeeping !4913
- registry: continued refactoring and cleanup !4914
- "limited coexistence" between running and about-to-run services (new) !4915, !4916, !4917
- `xact` package: revise and optimize abort-checking concurrency !4923
- registry: continued simplification and cleanup !4940
- reinforce global UUID for all cluster-wide `xactions` !4978
- [IC](/docs/ic.md) notifications vs transactional `xactions`: same rules for all !4982
- more stateful info: propagate xaction reference all the way into local PUT flow (major) !4995
- `copy-object` -- `xaction` -- `promote`: continued refactoring !5002
- registry: micro-optimizations and cleanup !5028
CLI
- PUT: add an option _not_ to load (skip loading) object metadata; amend docs; refactor and cleanup !4859
- add a command to view ETL code/spec !4925
- fix: do not add `--help` flag to the subcommands of subcommands !4926
- amend 'show config' to include CLI config (in addition to cluster and local configs); fix `cluster-unreachable` error !4983
- revise 'flag-is-set' for Boolean flags !5021, !5023, !5030
- copy bucket: add `--force` option !5042
- add `start etl` command !5056
- `ais show cluster`: add support for `refresh=<time-interval>` and `count` options !5076
- update CLI docs !5078
- enable 'ais show storage' and friends to run continuously and refresh periodically !5079
Testing
- test fixes to align with changes in the core !4861
- add `ensure-num-mountpaths` helper, and reinforce !4892
- use `api.WaitForXaction` instead of `tutils.WaitForXactionByID` !4893
- re-enable one `fs-checker` test, allow more time for mountpaths !4903
- add more checks when downloading object !4910
- extend CLI e2e promote test !4932
- `WaitFor` follow-up !4943
- fix e2e `AuthN` messages !4952
- retry upon failure to recover a damaged erasure-coded object !4986
- amend and extend EC tests !4990
- revise and enable `bucket-rename-and-copy` test !5060
CI/CD (continuous integration)
- add CI job that runs on multiple cloud buckets !5027
- add 1.18-rc1 version to build check !5044
- add `test-short-minimal` to test a single-node cluster !5046
- make `AWS_REGION` global env variable !5050
- update `test-long` stage !5059
Bug fixes, performance improvements; continuous refactoring
- `LOM` load to return distinct types: `syscall-error` and `corrupted-error` !4849
- `LOM` vs n-way mirroring: fix and revise caching of the metadata !4850
- `list-objects`: add fast mode `--only-names` !4851
- introduce permission to overwrite disconnected backend !4852
- `api`: refactor PUT API; fix `devtools` !4853
- optimize PUT latency by allowing _not_ to load object metadata !4854
- `aisloader` bench: do not run goroutine per each `PUT` request !4855
- reinforce access time `atime` (major) !4856
- reintroduce `no-metadata` error; fix n-way stress; refactor !4857
- build: fix deprecation warning on MacOS !4858
- when copying objects differentiate between copying == mirroring and all other scenarios !4860
- simplify `LOM` `from-fs` logic !4862
- general: don't use regex to validate names and UUIDs !4863, !4864
- assorted fixes !4868
- preserve `atime` across `LOM` caches !4869
- storage cleanup: leftover copies, corrupted and missing metadata !4870
- refactor `cmn` and `api` packages !4871
- `list-objects`: `use-cache` option !4872
- consistently use HTTP status 507 throughout; assorted fixes !4875
- eliminate redundant mirroring !4876
- get-cold (aka cold-GET) follow-up !4877
- object write transaction (`OWT`) fusion !4878
- prefetch: support optimistic concurrency (major) !4879
- name locker: fit two structures into 24 bytes !4880, !4881
- disable/detach mountpath: graceful (admin request) and immediate (FSHC) !4883
- move `health` package under `fs/health` !4885
- bucket summary and `obj-list` query: move, refactor, and simplify !4895, !4896
- control-plane: always free `call-results` back to pool !4899
- `api`: eliminate code duplication !4901
- general: deprecate and remove query objects !4902
- refactor `ais` package !4904
- control-plane transactions: refactor, reduce code !4906
- initial ETL get API implementation !4907
- control-plane transactions: follow-up !4909
- introduce read-only (but still configurable) timeouts: `cplane-operation` and `max-keepalive` !4918
- intra-cluster transport streams: tweak termination logic !4919
- slab allocator: amend pooling of the SGL control structures !4920
- transport streams: tweak termination logic !4921
- transport streams: revise transmit <=> terminate concurrency, optimize !4922
- refactor transaction types (minor) !4927
- `list-objects`: name-only option (follow-up) !4928
- `fs`: non-recursive `walk` in lexical order !4931
- `fs`: refactor bucket-traversing logic, eliminate nested closures" !4935
- `list-objects` and `bucket-summary`: refactor target side !4938
- LRU & storage cleanup: clarify when-previous-is-running !4939
- `transport`: redefine client callback to return error !4941
- `transport`: consistent drain-and-free cleanup on the receive side !4942
- lint: `gofumpt` & `gocritic` !4944
- `list-objects` buffering, caching: refactor, optimize !4946
- etl: intuitive RESTful URL paths - Part 1 !4947
- feature flags: refactor, enforce intra-cluster requests via API endpoints !4948
- API-level JSON messaging: uniformity & consistency !4949
- core: fast URL query parsing (major) !4950, !4953, !4954, !4955, !4956, !4957, !4958
- tools: add support for EC metafile to xmeta !4959
- bench: revamp `lstat` !4960
- `api` package: memory pooling !4961
- follow-up: `DPQ`; `t.Helper` !4962
- `api` package: memory pooling !4963
- `api` package: memory pooling !4964
- `fs` and `cos` packages: alternative slightly faster `fstat` to check existence !4965
- etl: change `init-code` and `init-spec`; intuitive RESTful APIs - part 2 !4966
- fix the logic to attach remote cluster during early startup !4967
- `api` package: memory pooling !4968
- `api/object`, `aisloader`: continued refactoring !4969
- core/cluster: rewrite max-version decision logic !4970
- core: call-args control structure !4971, !4972, !4973
- tools: `xmeta` support for `LOM` !4977
- IC notifications: eliminate redundant aborts !4980
- etl: implement ETL Stop & Delete APIS; intutive RESTful API part 3 !4981
- docs: updates to reflect all ETL API changes !4984
- alloc/free `put-obj`; mem-pool !4994
- `cos` package: inline usage of one-time constants !4996
- ais target: refactor `copy-object`, `copy-reader` logic !4997
- double transactions begin timeout; send-remote and `OWT !4998
- target-to-target copy object: remove local-only option !4999
- URL query parameter (ref) !5000
- go-vet `xmeta` !5003
- intra-cluster PUT vs user PUT: further differentiate and account for !5004
- etl: add API to start ETL !5005
- error processing: wrap errors to retain the types (major) !5006, !5007
- tools: fix linting when `golangci-lint` is installed differently !5008
- `cmn` package: remove `LogLevel` and `Vmodule` fields from config !5009
- simplify error reporting and attribution (major ref) !5010
- lint-1.44.2 !5011
- error processing and attribution !5012
- keep-alive cluster nodes: major refactoring, cleanup, code reduction !5013
- `transport`: transmit/receive *unsized* objects resulting from streaming transformation !5017
- etl: delete operation must be controlled by primary gateway !5022
- `apc`: new package for API constants (major ref) !5026
- copy/transform bucket: add 'force' option !5029
- `apc` package: move `metadata-write` and refactor bucket-props validators !5032
- `apc` package: move access control (ref) !5033
- `apc` package: move action message (ref) !5034
- `apc` package: move `list-objects` control message !5035
- `apc` package: move copy/transform bucket message (ref) !5036
- common bucket structure (major) !5037, !5038
- fix: do not use `jogger-group` bucket inside individual jogger's `visit-object` callbacks !5048
- core: two distinct methods to initialize `LOM` (major) !5049
- backend bucket: fix returned AWS error to include details !5051
- assorted fixes !5052, !5053
- cloning `LOM` (metadata): follow-up !5065
- self-healing: object recovery in the GET path !5069
- assorted fixes: object recovery vs runtime metadata extension !5070
- avoid duplicate FQN parsing when traversing buckets and *visiting* content types !5073