Dagster

Latest version: v1.10.8

Safety actively analyzes 723882 Python packages for vulnerabilities to keep your Python projects secure.

Page 48 of 54

0.8.5

Not secure

**Breaking Changes**

- Python 3.5 is no longer under test.
- `Engine` and `ExecutorConfig` have been deleted in favor of `Executor`. Instead of the `executor` decorator decorating a function that returns an `ExecutorConfig` it should now decorate a function that returns an `Executor`.

**New**

- The python built-in `dict` can be used as an alias for `Permissive()` within a config schema declaration.
- Use `StringSource` in the `S3ComputeLogManager` configuration schema to support using environment variables in the configuration (Thanks mrdrprofuroboros!)
- Improve Backfill CLI help text
- Add options to spark_df_output_schema (Thanks DavidKatz-il!)
- Helm: Added support for overriding the PostgreSQL image/version used in the init container checks.
- Update celery k8s helm chart to include liveness checks for celery workers and flower
- Support step level retries to celery k8s executor

**Bugfixes**

- Improve error message shown when a RepositoryDefinition returns objects that are not one of the allowed definition types (Thanks sd2k!)
- Show error message when `$DAGSTER_HOME` environment variable is not an absolute path (Thanks AndersonReyes!)
- Update default value for `staging_prefix` in the `DatabricksPySparkStepLauncher` configuration to be an absolute path (Thanks sd2k!)
- Improve error message shown when Databricks logs can't be retrieved (Thanks sd2k!)
- Fix errors in documentation fo `input_hydration_config` (Thanks joeyfreund!)

0.8.4

Not secure

**Bugfix**

- Reverted changed in 0.8.3 that caused error during run launch in certain circumstances
- Updated partition graphs on schedule page to select most recent run
- Forced reload of partitions for partition sets to ensure not serving stale data

**New**

- Added reload button to dagit to reload current repository
- Added option to wipe a single asset key by using `dagster asset wipe <asset_key>`
- Simplified schedule page, removing ticks table, adding tags for last tick attempt
- Better debugging tools for launch errors

0.8.3

Not secure

**Breaking Changes**

- Previously, the `gcs_resource` returned a `GCSResource` wrapper which had a single `client` property that returned a `google.cloud.storage.client.Client`. Now, the `gcs_resource` returns the client directly.

To update solids that use the `gcp_resource`, change:

context.resources.gcs.client

To:

context.resources.gcs

**New**

- Introduced a new Python API `reexecute_pipeline` to reexecute an existing pipeline run.
- Performance improvements in Pipeline Overview and other pages.
- Long metadata entries in the asset details view are now scrollable.
- Added a `project` field to the `gcs_resource` in `dagster_gcp`.
- Added new CLI command `dagster asset wipe` to remove all existing asset keys.

**Bugfix**

- Several Dagit bugfixes and performance improvements
- Fixes pipeline execution issue with custom run launchers that call `executeRunInProcess`.
- Updates `dagster schedule up` output to be repository location scoped

0.8.2

Not secure

**Bugfix**

- Fixes issues with `dagster instance migrate`.
- Fixes bug in `launch_scheduled_execution` that would mask configuration errors.
- Fixes bug in dagit where schedule related errors were not shown.
- Fixes JSON-serialization error in `dagster-k8s` when specifying per-step resources.

**New**

- Makes `label` optional parameter for materializations with `asset_key` specified.
- Changes `Assets` page to have a typeahead selector and hierarchical views based on asset_key path.
- _dagster-ssh_
- adds SFTP get and put functions to `SSHResource`, replacing sftp_solid.

**Docs**

- Various docs corrections

0.8.1

Not secure

**Bugfix**

- Fixed a file descriptor leak that caused `OSError: [Errno 24] Too many open files` when enough
temporary files were created.
- Fixed an issue where an empty config in the Playground would unexpectedly be marked as invalid
YAML.
- Removed "config" deprecation warnings for dask and celery executors.

**New**

- Improved performance of the Assets page.

0.8.0

Not secure

- _Workspace, host and user process separation, and repository definition_ Dagit and other tools no
longer load a single repository containing user definitions such as pipelines into the same
process as the framework code. Instead, they load a "workspace" that can contain multiple
repositories sourced from a variety of different external locations (e.g., Python modules and
Python virtualenvs, with containers and source control repositories soon to come).

The repositories in a workspace are loaded into their own "user" processes distinct from the
"host" framework process. Dagit and other tools now communicate with user code over an IPC
mechanism. This architectural change has a couple of advantages:

- Dagit no longer needs to be restarted when there is an update to user code.
- Users can use repositories to organize their pipelines, but still work on all of their
repositories using a single running Dagit.
- The Dagit process can now run in a separate Python environment from user code so pipeline
dependencies do not need to be installed into the Dagit environment.
- Each repository can be sourced from a separate Python virtualenv, so teams can manage their
dependencies (or even their own Python versions) separately.

We have introduced a new file format, `workspace.yaml`, in order to support this new architecture.
The workspace yaml encodes what repositories to load and their location, and supersedes the
`repository.yaml` file and associated machinery.

As a consequence, Dagster internals are now stricter about how pipelines are loaded. If you have
written scripts or tests in which a pipeline is defined and then passed across a process boundary
(e.g., using the `multiprocess_executor` or dagstermill), you may now need to wrap the pipeline
in the `reconstructable` utility function for it to be reconstructed across the process boundary.

In addition, rather than instantiate the `RepositoryDefinition` class directly, users should now
prefer the `repository` decorator. As part of this change, the `scheduler` and
`repository_partitions` decorators have been removed, and their functionality subsumed under
`repository`.

* _Dagit organization_ The Dagit interface has changed substantially and is now oriented around
pipelines. Within the context of each pipeline in an environment, the previous "Pipelines" and
"Solids" tabs have been collapsed into the "Definition" tab; a new "Overview" tab provides
summary information about the pipeline, its schedules, its assets, and recent runs; the previous
"Playground" tab has been moved within the context of an individual pipeline. Related runs (e.g.,
runs created by re-executing subsets of previous runs) are now grouped together in the Playground
for easy reference. Dagit also now includes more advanced support for display of scheduled runs
that may not have executed ("schedule ticks"), as well as longitudinal views over scheduled runs,
and asset-oriented views of historical pipeline runs.

* _Assets_ Assets are named materializations that can be generated by your pipeline solids, which
support specialized views in Dagit. For example, if we represent a database table with an asset
key, we can now index all of the pipelines and pipeline runs that materialize that table, and
view them in a single place. To use the asset system, you must enable an asset-aware storage such
as Postgres.

* _Run launchers_ The distinction between "starting" and "launching" a run has been effaced. All
pipeline runs instigated through Dagit now make use of the `RunLauncher` configured on the
Dagster instance, if one is configured. Additionally, run launchers can now support termination of
previously launched runs. If you have written your own run launcher, you may want to update it to
support termination. Note also that as of 0.7.9, the semantics of `RunLauncher.launch_run` have
changed; this method now takes the `run_id` of an existing run and should no longer attempt to
create the run in the instance.

* _Flexible reexecution_ Pipeline re-execution from Dagit is now fully flexible. You may
re-execute arbitrary subsets of a pipeline's execution steps, and the re-execution now appears
in the interface as a child run of the original execution.

* _Support for historical runs_ Snapshots of pipelines and other Dagster objects are now persisted
along with pipeline runs, so that historial runs can be loaded for review with the correct
execution plans even when pipeline code has changed. This prepares the system to be able to diff
pipeline runs and other objects against each other.

* _Step launchers and expanded support for PySpark on EMR and Databricks_ We've introduced a new
`StepLauncher` abstraction that uses the resource system to allow individual execution steps to
be run in separate processes (and thus on separate execution substrates). This has made extensive
improvements to our PySpark support possible, including the option to execute individual PySpark
steps on EMR using the `EmrPySparkStepLauncher` and on Databricks using the
`DatabricksPySparkStepLauncher` The `emr_pyspark` example demonstrates how to use a step launcher.

* _Clearer names_ What was previously known as the environment dictionary is now called the
`run_config`, and the previous `environment_dict` argument to APIs such as `execute_pipeline` is
now deprecated. We renamed this argument to focus attention on the configuration of the run
being launched or executed, rather than on an ambiguous "environment". We've also renamed the
`config` argument to all use definitions to be `config_schema`, which should reduce ambiguity
between the configuration schema and the value being passed in some particular case. We've also
consolidated and improved documentation of the valid types for a config schema.

* _Lakehouse_ We're pleased to introduce Lakehouse, an experimental, alternative programming model
for data applications, built on top of Dagster core. Lakehouse allows developers to define data
applications in terms of data assets, such as database tables or ML models, rather than in terms
of the computations that produce those assets. The `simple_lakehouse` example gives a taste of
what it's like to program in Lakehouse. We'd love feedback on whether this model is helpful!

* _Airflow ingest_ We've expanded the tooling available to teams with existing Airflow installations
that are interested in incrementally adopting Dagster. Previously, we provided only injection
tools that allowed developers to write Dagster pipelines and then compile them into Airflow DAGs
for execution. We've now added ingestion tools that allow teams to move to Dagster for execution
without having to rewrite all of their legacy pipelines in Dagster. In this approach, Airflow
DAGs are kept in their own container/environment, compiled into Dagster pipelines, and run via
the Dagster orchestrator. See the `airflow_ingest` example for details!

**Breaking Changes**

- _dagster_

- The `scheduler` and `repository_partitions` decorators have been removed. Instances of
`ScheduleDefinition` and `PartitionSetDefinition` belonging to a repository should be specified
using the `repository` decorator instead.
- Support for the Dagster solid selection DSL, previously introduced in Dagit, is now uniform
throughout the Python codebase, with the previous `solid_subset` arguments (`--solid-subset` in
the CLI) being replaced by `solid_selection` (`--solid-selection`). In addition to the names of
individual solids, this argument now supports selection queries like `*solid_name++` (i.e.,
`solid_name`, all of its ancestors, its immediate descendants, and their immediate descendants).
- The built-in Dagster type `Path` has been removed.
- `PartitionSetDefinition` names, including those defined by a `PartitionScheduleDefinition`,
must now be unique within a single repository.
- Asset keys are now sanitized for non-alphanumeric characters. All characters besides
alphanumerics and `_` are treated as path delimiters. Asset keys can also be specified using
`AssetKey`, which accepts a list of strings as an explicit path. If you are running 0.7.10 or
later and using assets, you may need to migrate your historical event log data for asset keys
from previous runs to be attributed correctly. This `event_log` data migration can be invoked
as follows:

python
from dagster.core.storage.event_log.migration import migrate_event_log_data
from dagster import DagsterInstance

migrate_event_log_data(instance=DagsterInstance.get())

- The interface of the `Scheduler` base class has changed substantially. If you've written a
custom scheduler, please get in touch!
- The partitioned schedule decorators now generate `PartitionSetDefinition` names using
the schedule name, suffixed with `_partitions`.
- The `repository` property on `ScheduleExecutionContext` is no longer available. If you were
using this property to pass to `Scheduler` instance methods, this interface has changed
significantly. Please see the `Scheduler` class documentation for details.
- The CLI option `--celery-base-priority` is no longer available for the command:
`dagster pipeline backfill`. Use the tags option to specify the celery priority, (e.g.
`dagster pipeline backfill my_pipeline --tags '{ "dagster-celery/run_priority": 3 }'`
- The `execute_partition_set` API has been removed.
- The deprecated `is_optional` parameter to `Field` and `OutputDefinition` has been removed.
Use `is_required` instead.
- The deprecated `runtime_type` property on `InputDefinition` and `OutputDefinition` has been
removed. Use `dagster_type` instead.
- The deprecated `has_runtime_type`, `runtime_type_named`, and `all_runtime_types` methods on
`PipelineDefinition` have been removed. Use `has_dagster_type`, `dagster_type_named`, and
`all_dagster_types` instead.
- The deprecated `all_runtime_types` method on `SolidDefinition` and `CompositeSolidDefinition`
has been removed. Use `all_dagster_types` instead.
- The deprecated `metadata` argument to `SolidDefinition` and `solid` has been removed. Use
`tags` instead.
- The graphviz-based DAG visualization in Dagster core has been removed. Please use Dagit!

- _dagit_

- `dagit-cli` has been removed, and `dagit` is now the only console entrypoint.

- _dagster-aws_

- The AWS CLI has been removed.
- `dagster_aws.EmrRunJobFlowSolidDefinition` has been removed.

- _dagster-bash_

- This package has been renamed to dagster-shell. The`bash_command_solid` and `bash_script_solid`
solid factory functions have been renamed to `create_shell_command_solid` and
`create_shell_script_solid`.

- _dagster-celery_

- The CLI option `--celery-base-priority` is no longer available for the command:
`dagster pipeline backfill`. Use the tags option to specify the celery priority, (e.g.
`dagster pipeline backfill my_pipeline --tags '{ "dagster-celery/run_priority": 3 }'`

- _dagster-dask_

- The config schema for the `dagster_dask.dask_executor` has changed. The previous config should
now be nested under the key `local`.

- _dagster-gcp_

- The `BigQueryClient` has been removed. Use `bigquery_resource` instead.

- _dagster-dbt_

- The dagster-dbt package has been removed. This was inadequate as a reference integration, and
will be replaced in 0.8.x.

- _dagster-spark_

- `dagster_spark.SparkSolidDefinition` has been removed - use `create_spark_solid` instead.
- The `SparkRDD` Dagster type, which only worked with an in-memory engine, has been removed.

- _dagster-twilio_

- The `TwilioClient` has been removed. Use `twilio_resource` instead.

**New**

- _dagster_

- You may now set `asset_key` on any `Materialization` to use the new asset system. You will also
need to configure an asset-aware storage, such as Postgres. The `longitudinal_pipeline` example
demonstrates this system.
- The partitioned schedule decorators now support an optional `end_time`.
- Opt-in telemetry now reports the Python version being used.

- _dagit_

- Dagit's GraphQL playground is now available at `/graphiql` as well as at `/graphql`.

- _dagster-aws_

- The `dagster_aws.S3ComputeLogManager` may now be configured to override the S3 endpoint and
associated SSL settings.
- Config string and integer values in the S3 tooling may now be set using either environment
variables or literals.

- _dagster-azure_

- We've added the dagster-azure package, with support for Azure Data Lake Storage Gen2; you can
use the `adls2_system_storage` or, for direct access, the `adls2_resource` resource. (Thanks
sd2k!)

- _dagster-dask_

- Dask clusters are now supported by `dagster_dask.dask_executor`. For full support, you will need
to install extras with `pip install dagster-dask[yarn, pbs, kube]`. (Thanks DavidKatz-il!)

- _dagster-databricks_

- We've added the dagster-databricks package, with support for running PySpark steps on Databricks
clusters through the `databricks_pyspark_step_launcher`. (Thanks sd2k!)

- _dagster-gcp_

- Config string and integer values in the BigQuery, Dataproc, and GCS tooling may now be set
using either environment variables or literals.

- _dagster-k8s_

- Added the `CeleryK8sRunLauncher` to submit execution plan steps to Celery task queues for
execution as k8s Jobs.
- Added the ability to specify resource limits on a per-pipeline and per-step basis for k8s Jobs.
- Many improvements and bug fixes to the dagster-k8s Helm chart.

- _dagster-pandas_

- Config string and integer values in the dagster-pandas input and output schemas may now be set
using either environment variables or literals.

- _dagster-papertrail_

- Config string and integer values in the `papertrail_logger` may now be set using either
environment variables or literals.

- _dagster-pyspark_

- PySpark solids can now run on EMR, using the `emr_pyspark_step_launcher`, or on Databricks using
the new dagster-databricks package. The `emr_pyspark` example demonstrates how to use a step
launcher.

- _dagster-snowflake_

- Config string and integer values in the `snowflake_resource` may now be set using either
environment variables or literals.

- _dagster-spark_

- `dagster_spark.create_spark_solid` now accepts a `required_resource_keys` argument, which
enables setting up a step launcher for Spark solids, like the `emr_pyspark_step_launcher`.

**Bugfix**

- `dagster pipeline execute` now sets a non-zero exit code when pipeline execution fails.

Page 48 of 54

Releases

Has known vulnerabilities

Previous Next

Dagster

Page 48 of 54

0.8.5

0.8.4

0.8.3

0.8.2

0.8.1

0.8.0

Page 48 of 54

Links

Releases