Metaflow

Latest version: v2.13

Safety actively analyzes 693883 Python packages for vulnerabilities to keep your Python projects secure.

Page 15 of 28

2.9.0

- Features
- [Introduce support for composing multiple interrelated workflows through external events](features.1)

Features
<a id='features.1'></a>Introduce support for composing multiple interrelated workflows through external events
With this release, Metaflow users can architect sequences of workflows that conduct data across teams, all the way from ETL and data warehouse to final ML outputs. Detailed documentation and a blog post to follow very shortly! Keep watching this space.

In case you need any assistance or have feedback for us, ping us at [chat.metaflow.org](http://chat.metaflow.org) or open a GitHub issue.

---

What's Changed
* feature: add argo events environment variables to `metaflow configure kubernetes` by saikonen in https://github.com/Netflix/metaflow/pull/1405
* handle whitespaces in argo events parameters by savingoyal in https://github.com/Netflix/metaflow/pull/1408
* Add back comment for argo workflows by savingoyal in https://github.com/Netflix/metaflow/pull/1409
* Support ArgoEvent object with kubernetes by savingoyal in https://github.com/Netflix/metaflow/pull/1410
* Print workflow template location as part of argo-workflows create by savingoyal in https://github.com/Netflix/metaflow/pull/1411

**Full Changelog**: https://github.com/Netflix/metaflow/compare/2.8.6...2.9.0

2.8.6

- Features
- [Introduce support for persistent volume claims for executions on Kubernetes](features.1)

Features
<a id='features.1'></a>Introduce support for persistent volume claims for executions on Kubernetes
With this release, Metaflow users can attach existing [persistent volume claims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) to Metaflow tasks running on a Kubernetes cluster.

To use this functionality, simply list your persistent volume claim and mount point using the _persistent_volume_claims_ arg in _kubernetes_ decorator - `kubernetes(persistent_volume_claims={"pvc-claim-name": "mount-point", "another-pvc-claim-name": "another-mount-point"})`.

Here is an example:

from metaflow import FlowSpec, step, kubernetes, current
import os

class MountPVCFlow(FlowSpec):

kubernetes(persistent_volume_claims={"test-pvc-feature-claim": "/mnt/testvol"})
step
def start(self):
print('testing PVC')
mount = "/mnt/testvol"
file = f"zeros_run_{current.run_id}"
with open(os.path.join(mount, file), "w+") as f:
f.write("\0" * 50)
f.flush()

print(f"mount folder contents: {os.listdir(mount)}")
self.next(self.end)

step
def end(self):
print("finished")

if __name__=="__main__":
MountPVCFlow()

In case you need any assistance or have feedback for us, ping us at [chat.metaflow.org](http://chat.metaflow.org) or open a GitHub issue.

What's Changed
* handle bools properly for argo-workflows task runtime cli by savingoyal in https://github.com/Netflix/metaflow/pull/1395
* fix: migrate R support to use importlib by saikonen in https://github.com/Netflix/metaflow/pull/1396
* Add configuration of username from metaflow_config.py by tfurmston in https://github.com/Netflix/metaflow/pull/1400
* feature: add Kubernetes support for PVC mounts by saikonen in https://github.com/Netflix/metaflow/pull/1402
* Update version to 2.8.6 by savingoyal in https://github.com/Netflix/metaflow/pull/1404

**Full Changelog**: https://github.com/Netflix/metaflow/compare/2.8.5...2.8.6

2.8.5

Improvements
- [Make pickled Metaflow client objects accessible across namespaces](improvements.1)

Improvements
<a id='improvements.1'></a>Make pickled Metaflow client objects accessible across namespaces
The [previous release](https://github.com/Netflix/metaflow/releases/tag/2.8.4) resulted in disabling a sequence of user operations that worked previously:
1. Pickle a Metaflow object
2. Access this Metaflow object in a different namespace
3. Access a child or parent object of this object

This release restores the previous behavior.

In case you need any assistance or have feedback for us, ping us at [chat.metaflow.org](http://chat.metaflow.org) or open a GitHub issue.

What's Changed
* feature: add sanitization for batch tags by saikonen in https://github.com/Netflix/metaflow/pull/1376
* fix: make metaflow config aware of profile environment variable by saikonen in https://github.com/Netflix/metaflow/pull/1391
* Fix an issue introduced in 2.8.4 that prevented pickled MetaflowObjec… by romain-intel in https://github.com/Netflix/metaflow/pull/1392
* Updating version to 2.8.5 by pjoshi30 in https://github.com/Netflix/metaflow/pull/1393

**Full Changelog**: https://github.com/Netflix/metaflow/compare/2.8.4...2.8.5

2.8.4

New Contributors
* dhpollack made their first contribution in https://github.com/Netflix/metaflow/pull/1236
* felipeGarciaDiaz made their first contribution in https://github.com/Netflix/metaflow/pull/1383

**Full Changelog**: https://github.com/Netflix/metaflow/compare/2.8.3...2.8.4

2.8.3

- Features
- [Introduce support for tmpfs for executions on AWS Batch](features.1)
- [Introduce auto-completion support for metaflow client in ipython notebooks](features.2)

- Improvements
- [Reduce metadata service network calls for faster execution of flows](improvements.1)
- [Handle unsupported data types for pandas.DataFrame gracefully for Metaflow's _default_ card](improvements.2)

Features
<a id='features.1'></a>Introduce support for tmpfs for executions on AWS Batch
It is typical for the user code in a Metaflow step to download assets from an object store, e.g. S3. Examples include serialized models and raw input data, such unstructured media or structured Parquet files. The amount of data loaded in a task is typically 10-100GB, allowing even terabytes to be handled in a [_foreach_](https://docs.metaflow.org/metaflow/basics#foreach).

To reduce IO bottlenecks in such tasks, we provide an optimized client for S3, [_metaflow.S3_](https://docs.metaflow.org/scaling/data#data-in-s3-metaflows3) that makes it possible to download data using all available network bandwidth. Notably, in a modern instance the available network bandwidth can be higher than the local disk bandwidth. Consider: SATA 3.0 provides 6Gbit/s whereas a large instance can have 20Gbit/s network throughput. Even Gen3 NVMe provides just 16Git/s. To benefit from the full network bandwidth, local disk IO must be bypassed. The metaflow.S3 client accomplishes this by relying on the page cache: Nominally files are downloaded in a temporary directory on disk but practically all data stays in the page cache. This is assuming that the downloaded data can fit in memory, which can be ensured by having a high enough _resources(memory=)_ setting.

The above setup, which can provide excellent IO performance in general, has a small gotcha: The instance needs to have enough local disk space to back all the data, although no data actually hits the disk. Increasingly, instances may have more memory than local disk space available, so this superfluous requirement becomes a problem. The issue is further amplified by the fact that as of today, it is impossible to add ephemeral volumes on the fly on AWS Batch. This puts users in a strange situation: The instance has enough RAM to hold all the data in memory, and there are ways to download it quickly from S3, but the lack of local disk space (that is not even needed), makes it impossible to access the data.

AWS Batch supports [mounting a tmpfs filesystem](https://docs.aws.amazon.com/batch/latest/APIReference/API_Tmpfs.html) on the fly. Using this feature, the user can create a memory-backed file system which can be used as a temporary space for downloaded data. This removes the need to have to deal with any local disks. One can simply use a minimal root filesystem, which greatly simplifies the infrastructure setup.

With this release, we introduce a new config option - _METAFLOW_TEMPDIR_, which, if defined, is used as the default _metaflow.S3(tmproot)_. If _METAFLOW_TEMPDIR_ is not defined, _tmproot=’.’_ as before. In addition, a few new attributes are introduced for _batch_ decorator -

| Attribute (default) | Default behavior | Override semantics |
| :--- | :--- | :--- |
| _use_tmpfs=False_ | _tmpfs_ disabled | _use_tmpfs=True_ enables _tmpfs_ |
| _tmpfs_tempdir=True_ | sets _METAFLOW_TEMPDIR=tmpfs_path_ | _tmpfs_tempdir=False_ doesn't set _METAFLOW_TEMPDIR_ |
| _tmpfs_size=None_ | sets _tmpfs_ size to 50% of _resources(memory)_ | _tmpfs_ size in megabytes |
| _tmpfs_path=None_ | use _/metaflow_temp_ as _tmpfs_path_ | custom mount point |

Examples
Handle large amounts of data in-memory with Batch:

batch(memory=100000, use_tmpfs=True)

In this case, at most 50GB is available for tmpfs and it is used by S3 by default. Note that tmpfs only consumes the amount of memory corresponding to the data stored, so there is no downside in setting a large size by default.
Increase tmpfs size:

batch(memory=100000, tmpfs_size=100000)

Let tmpfs use all available memory. Note that _use_tmpfs=True_ doesn’t have to be specified redundantly.
Custom tmpfs use case:

batch(memory=100000, tmpfs_size=10000, tmpfs_path=’/data’, tmpfs_tempdir=False)

Full control over settings - _metaflow.S3_ doesn’t use the tmpfs volume in this case.

Besides _metaflow.S3_, the user may want to use the tmpfs volume for their own use cases. In particular, many modern ML libraries require a local cache. To support these use cases, tmpfs_path is exposed through the current object, as _current.tempdir_.
This allows the user to leverage the volume straightforwardly:

AutoModelForSeq2SeqLM.from_pretrained(
model_path,
cache_dir=current.tempdir,
device_map='auto',
load_in_8bit=True,
)

<a id='features.2'></a>Introduce auto-completion support for metaflow client in ipython notebooks
With this release, Metaflow client objects will support autocomplete in ipython notebooks

from metaflow import Flow, Metaflow

Metaflow().flows
>>> [Flow('HelloFlow'), Flow('MovieStatsFlow')]

flow = Flow('HelloFlow') No autocomplete here
flow._ipython_key_completions_()
>>>
['1680815181013681',
'1680815178214737',
'1680432265121345',
'1680430310127401']

run = flow["1680815178214737"]
run._ipython_key_completions_()
>>> ['end', 'hello', 'start']

step = run["hello"]
step._ipython_key_completions_()
>>> ['2']

task = step["2"]
task._ipython_key_completions_()
>>> ['name']

Improvements
<a id='improvements.1'></a>Reduce metadata service network calls for faster execution of flows
With this release, Metaflow flows should execute a tad bit faster since a few network calls to Metaflow's metadata service are now cached. Expect continued further improvements in flow execution times over the next few releases.

<a id='improvements.2'></a>Handle unsupported data types for pandas.DataFrame gracefully for Metaflow's _default_ card
With this release, Metaflow card creation will handle non-JSON parseable types gracefully by replacing the column values with `UnsupportedType : <TYPENAME>`.

In case you need any assistance or have feedback for us, ping us at [chat.metaflow.org](http://chat.metaflow.org) or open a GitHub issue.

What's Changed
* Introduce codeql by savingoyal in https://github.com/Netflix/metaflow/pull/1272
* fix: GitHub Workflow security recommendations by saikonen in https://github.com/Netflix/metaflow/pull/1334
* Add docstring style to contribution code style guide by jimbudarz in https://github.com/Netflix/metaflow/pull/1328
* remove METAFLOW_DATATOOLS_SYSROOT_S3 from configuration command by tfurmston in https://github.com/Netflix/metaflow/pull/1312
* Fix 1326 and strips ext_info from blobs passed to schedulers by romain-intel in https://github.com/Netflix/metaflow/pull/1329
* Namespace check skip feature from 1271 by romain-intel in https://github.com/Netflix/metaflow/pull/1341
* Introduce tmpfs config options for batch by savingoyal in https://github.com/Netflix/metaflow/pull/1287
* fix: kubernetes ec2 instance metadata timeout by saikonen in https://github.com/Netflix/metaflow/pull/1335
* Make the contact information displayed by the Metaflow command configurable by romain-intel in https://github.com/Netflix/metaflow/pull/1340
* Safely parse `pandas.DataFrame` for `default` card by valayDave in https://github.com/Netflix/metaflow/pull/1344
* Reduce multiple metadata service rtts using cached version. by shrinandj in https://github.com/Netflix/metaflow/pull/1347
* Kubernetes running job cancellation to fallback to patching parallelism by jackie-ob in https://github.com/Netflix/metaflow/pull/1353
* Remove encoding for JSON.loads by wangchy27 in https://github.com/Netflix/metaflow/pull/1352
* Prep for 2.8.3 release by savingoyal in https://github.com/Netflix/metaflow/pull/1354

New Contributors
* wangchy27 made their first contribution in https://github.com/Netflix/metaflow/pull/1352

**Full Changelog**: https://github.com/Netflix/metaflow/compare/2.8.2...2.8.3

2.8.2

- Features
- [Introduce support for Metaflow sandboxes for Metaflow tutorials](features.1)
- [Display Metaflow UI URL on the terminal when a flow is executed via `step-functions trigger` or `argo-workflows trigger`](features.2)

Features
<a id='features.1'></a>Introduce support for Metaflow sandboxes for Metaflow tutorials
With this release, the Metaflow tutorials can now be executed within the [Metaflow sandboxes](https://outerbounds.com/sandbox/), making it trivial to evaluate whether Metaflow is a good fit for your organization without committing to deploying the necessary cloud infrastructure upfront.

<a id='features.2'></a>Display Metaflow UI URL on the terminal when a flow is executed via `step-functions trigger` or `argo-workflows trigger`
With this release, if the Metaflow config (in `~/.metaflow_config`) includes a reference to the deployed Metaflow UI (assigned to `METAFLOW_UI_URL`), the user-facing logs in the terminal will indicate the direct link to the relevant `run view` in the Metaflow UI.

![image (6)](https://user-images.githubusercontent.com/763451/228386064-e0c28fe5-06d0-436d-9044-ea1eef4a7c76.png)

In case you need any assistance or have feedback for us, ping us at [chat.metaflow.org](http://chat.metaflow.org) or open a GitHub issue.

What's Changed
* Add a way to create aliases to other parts of metaflow by romain-intel in https://github.com/Netflix/metaflow/pull/1304
* feature: emit UI url for argo workflows and step-functions by saikonen in https://github.com/Netflix/metaflow/pull/1311
* fix: update cards dependencies by saikonen in https://github.com/Netflix/metaflow/pull/1314
* Sync tutorials for Outerbounds sandbox by emattia in https://github.com/Netflix/metaflow/pull/1299
* Fix the `logs` command in cases where the step/task hasn't finished by romain-intel in https://github.com/Netflix/metaflow/pull/1315
* Update version to 2.8.2 by savingoyal in https://github.com/Netflix/metaflow/pull/1325

New Contributors
* emattia made their first contribution in https://github.com/Netflix/metaflow/pull/1299

**Full Changelog**: https://github.com/Netflix/metaflow/compare/2.8.1...2.8.2

Page 15 of 28

Releases

Has known vulnerabilities

Previous Next

Metaflow

Page 15 of 28

2.9.0

2.8.6

2.8.5

2.8.4

2.8.3

2.8.2

Page 15 of 28

Links

Releases