Apache-beam

Latest version: v2.64.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 9

2.59.0

Highlights

* Added support for setting a configureable timeout when loading a model and performing inference in the [RunInference](https://beam.apache.org/documentation/ml/inference-overview/) transform using [with_exception_handling](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference.with_exception_handling) ([32137](https://github.com/apache/beam/issues/32137))
* Initial experimental support for using Prism with the Java and Python SDKs
* Prism is presently targeting local testing usage, or other small scale execution.
* For Java, use 'PrismRunner', or 'TestPrismRunner' as an argument to the `--runner` flag.
* For Python, use 'PrismRunner' as an argument to the `--runner` flag.
* Go already uses Prism as the default local runner.

I/Os

* Improvements to the performance of BigqueryIO when using withPropagateSuccessfulStorageApiWrites(true) method (Java) ([31840](https://github.com/apache/beam/pull/31840)).
* [Managed Iceberg] Added support for writing to partitioned tables ([32102](https://github.com/apache/beam/pull/32102))
* Update ClickHouseIO to use the latest version of the ClickHouse JDBC driver ([32228](https://github.com/apache/beam/issues/32228)).
* Add ClickHouseIO dedicated User-Agent ([32252](https://github.com/apache/beam/issues/32252)).

New Features / Improvements

* BigQuery endpoint can be overridden via PipelineOptions, this enables BigQuery emulators (Java) ([28149](https://github.com/apache/beam/issues/28149)).
* Go SDK Minimum Go Version updated to 1.21 ([32092](https://github.com/apache/beam/pull/32092)).
* [BigQueryIO] Added support for withFormatRecordOnFailureFunction() for STORAGE_WRITE_API and STORAGE_API_AT_LEAST_ONCE methods (Java) ([31354](https://github.com/apache/beam/issues/31354)).
* Updated Go protobuf package to new version (Go) ([21515](https://github.com/apache/beam/issues/21515)).
* Added support for setting a configureable timeout when loading a model and performing inference in the [RunInference](https://beam.apache.org/documentation/ml/inference-overview/) transform using [with_exception_handling](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference.with_exception_handling) ([32137](https://github.com/apache/beam/issues/32137))
* Adds OrderedListState support for Java SDK via FnApi.
* Initial support for using Prism from the Python and Java SDKs.

Bugfixes

* Fixed incorrect service account impersonation flow for Python pipelines using BigQuery IOs ([32030](https://github.com/apache/beam/issues/32030)).
* Auto-disable broken and meaningless `upload_graph` feature when using Dataflow Runner V2 ([32159](https://github.com/apache/beam/issues/32159)).
* (Python) Upgraded google-cloud-storage to version 2.18.2 to fix a data corruption issue ([32135](https://github.com/apache/beam/pull/32135)).
* (Go) Fix corruption on State API writes. ([32245](https://github.com/apache/beam/issues/32245)).

Known Issues

[comment]: ( When updating known issues after release, make sure also update website blog in website/www/site/content/blog.)
* Prism is under active development and does not yet support all pipelines. See [29650](https://github.com/apache/beam/issues/29650) for progress.
* In the 2.59.0 release, Prism passes most runner validations tests with the exceptions of pipelines using the following features:
OrderedListState, OnWindowExpiry (eg. GroupIntoBatches), CustomWindows, MergingWindowFns, Trigger and WindowingStrategy associated features, Bundle Finalization, Looping Timers, and some Coder related issues such as with Python combiner packing, and Java Schema transforms, and heterogenous flatten coders. Processing Time timers do not yet have real time support.
* If your pipeline is having difficulty with the Python or Java direct runners, but runs well on Prism, please let us know.

* Java file-based IOs read or write lots (100k+) files could experience slowness and/or broken metrics visualization on Dataflow UI [32649](https://github.com/apache/beam/issues/32649).
* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([32780](https://github.com/apache/beam/pull/32780)):
* Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output.
* Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results.
* Fixed in 2.61.0.
* [Managed Iceberg] DataFile metadata is assigned incorrect partition values ([33497](https://github.com/apache/beam/issues/33497)).
* Fixed in 2.62.0
* [FileBasedIO] StringSet metrics can grow unlimitedly large when pipeline involves read/write large number of files, and degrading functionalities such us metrics monitoring and Dataflow job upgrade.
* Mitigated in 2.60.0 ([32649](https://github.com/apache/beam/issues/32649)).

2.58.1

New Features / Improvements

* Fixed issue where KafkaIO Records read with `ReadFromKafkaViaSDF` are redistributed and may contain duplicates regardless of the configuration. This affects Java pipelines with Dataflow v2 runner and xlang pipelines reading from Kafka, ([32196](https://github.com/apache/beam/issues/32196))

Known Issues

* Large Dataflow graphs using runner v2, or pipelines explicitly enabling the `upload_graph` experiment, will fail at construction time ([32159](https://github.com/apache/beam/issues/32159)).
* Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.
* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([32780](https://github.com/apache/beam/pull/32780)):
* Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output.
* Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results.
* Fixed in 2.61.0.

2.58.0

Highlights

* Support for [Solace](https://solace.com/) source (`SolaceIO.Read`) added (Java) ([#31440](https://github.com/apache/beam/issues/31440)).

New Features / Improvements

* Multiple RunInference instances can now share the same model instance by setting the model_identifier parameter (Python) ([31665](https://github.com/apache/beam/issues/31665)).
* Added options to control the number of Storage API multiplexing connections ([31721](https://github.com/apache/beam/pull/31721))
* [BigQueryIO] Better handling for batch Storage Write API when it hits AppendRows throughput quota ([31837](https://github.com/apache/beam/pull/31837))
* [IcebergIO] All specified catalog properties are passed through to the connector ([31726](https://github.com/apache/beam/pull/31726))
* Removed a 3rd party LGPL dependency from the Go SDK ([31765](https://github.com/apache/beam/issues/31765)).
* Support for MapState and SetState when using Dataflow Runner v1 with Streaming Engine (Java) ([[18200](https://github.com/apache/beam/issues/18200)])

Breaking Changes

* [IcebergIO] IcebergCatalogConfig was changed to support specifying catalog properties in a key-store fashion ([31726](https://github.com/apache/beam/pull/31726))
* [SpannerIO] Added validation that query and table cannot be specified at the same time for SpannerIO.read(). Previously withQuery overrides withTable, if set ([24956](https://github.com/apache/beam/issues/24956)).

Bugfixes

* [BigQueryIO] Fixed a bug in batch Storage Write API that frequently exhausted concurrent connections quota ([31710](https://github.com/apache/beam/pull/31710))
* Fixed a logging issue where Python worker dependency installation logs sometimes were not emitted in a timely manner ([31977](https://github.com/apache/beam/pull/31977))

Known Issues

* Large Dataflow graphs using runner v2, or pipelines explicitly enabling the `upload_graph` experiment, will fail at construction time ([32159](https://github.com/apache/beam/issues/32159)).
* Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.
* [KafkaIO] Records read with `ReadFromKafkaViaSDF` are redistributed and may contain duplicates regardless of the configuration. This affects Java pipelines with Dataflow v2 runner and xlang pipelines reading from Kafka, ([32196](https://github.com/apache/beam/issues/32196))
* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([32780](https://github.com/apache/beam/pull/32780)):
* Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output.
* Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results.
* Fixed in 2.61.0.

2.57.0

Highlights

* Apache Beam adds Python 3.12 support ([29149](https://github.com/apache/beam/issues/29149)).
* Added FlinkRunner for Flink 1.18 ([30789](https://github.com/apache/beam/issues/30789)).

I/Os

* Ensure that BigtableIO closes the reader streams ([31477](https://github.com/apache/beam/issues/31477)).

New Features / Improvements

* Added Feast feature store handler for enrichment transform (Python) ([30957](https://github.com/apache/beam/issues/30964)).
* BigQuery per-worker metrics are reported by default for Streaming Dataflow Jobs (Java) ([31015](https://github.com/apache/beam/pull/31015))
* Adds `inMemory()` variant of Java List and Map side inputs for more efficient lookups when the entire side input fits into memory.
* Beam YAML now supports the jinja templating syntax.
Template variables can be passed with the (json-formatted) `--jinja_variables` flag.
* DataFrame API now supports pandas 2.1.x and adds 12 more string functions for Series.([31185](https://github.com/apache/beam/pull/31185)).
* Added BigQuery handler for enrichment transform (Python) ([31295](https://github.com/apache/beam/pull/31295))
* Disable soft delete policy when creating the default bucket for a project (Java) ([31324](https://github.com/apache/beam/pull/31324)).
* Added `DoFn.SetupContextParam` and `DoFn.BundleContextParam` which can be used
as a python `DoFn.process`, `Map`, or `FlatMap` parameter to invoke a context
manager per DoFn setup or bundle (analogous to using `setup`/`teardown`
or `start_bundle`/`finish_bundle` respectively.)
* Go SDK Prism Runner
* Pre-built Prism binaries are now part of the release and are available via the Github release page. ([29697](https://github.com/apache/beam/issues/29697)).
* ProcessingTime is now handled synthetically with TestStream pipelines and Non-TestStream pipelines, for fast test pipeline execution by default. ([30083](https://github.com/apache/beam/issues/30083)).
* Prism does NOT yet support "real time" execution for this release.
* Improve processing for large elements to reduce the chances for exceeding 2GB protobuf limits (Python)([https://github.com/apache/beam/issues/31607]).

Breaking Changes

* Java's View.asList() side inputs are now optimized for iterating rather than
indexing when in the global window.
This new implementation still supports all (immutable) List methods as before,
but some of the random access methods like get() and size() will be slower.
To use the old implementation one can use View.asList().withRandomAccess().
* SchemaTransforms implemented with TypedSchemaTransformProvider now produce a
configuration Schema with snake_case naming convention
([31374](https://github.com/apache/beam/pull/31374)). This will make the following
cases problematic:
* Running a pre-2.57.0 remote SDK pipeline containing a 2.57.0+ Java SchemaTransform,
and vice versa:
* Running a 2.57.0+ remote SDK pipeline containing a pre-2.57.0 Java SchemaTransform
* All direct uses of Python's [SchemaAwareExternalTransform](https://github.com/apache/beam/blob/a998107a1f5c3050821eef6a5ad5843d8adb8aec/sdks/python/apache_beam/transforms/external.py#L381)
should be updated to use new snake_case parameter names.
* Upgraded Jackson Databind to 2.15.4 (Java) ([26743](https://github.com/apache/beam/issues/26743)).
jackson-2.15 has known breaking changes. An important one is it imposed a buffer limit for parser.
If your custom PTransform/DoFn are affected, refer to [31580](https://github.com/apache/beam/pull/31580) for mitigation.

Known Issues

* Large Dataflow graphs using runner v2, or pipelines explicitly enabling the `upload_graph` experiment, will fail at construction time ([32159](https://github.com/apache/beam/issues/32159)).
* Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.
* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([32780](https://github.com/apache/beam/pull/32780)):
* Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output.
* Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results.
* Fixed in 2.61.0.

2.56.0

Highlights

* Added FlinkRunner for Flink 1.17, removed support for Flink 1.12 and 1.13. Previous version of Pipeline running on Flink 1.16 and below can be upgraded to 1.17, if the Pipeline is first updated to Beam 2.56.0 with the same Flink version. After Pipeline runs with Beam 2.56.0, it should be possible to upgrade to FlinkRunner with Flink 1.17. ([29939](https://github.com/apache/beam/issues/29939))
* New Managed I/O Java API ([30830](https://github.com/apache/beam/pull/30830)).
* New Ordered Processing PTransform added for processing order-sensitive stateful data ([30735](https://github.com/apache/beam/pull/30735)).

I/Os

* Upgraded Avro version to 1.11.3, kafka-avro-serializer and kafka-schema-registry-client versions to 7.6.0 (Java) ([30638](https://github.com/apache/beam/pull/30638)).
The newer Avro package is known to have breaking changes. If you are affected, you can keep pinned to older Avro versions which are also tested with Beam.
* Iceberg read/write support is available through the new Managed I/O Java API ([30830](https://github.com/apache/beam/pull/30830)).

New Features / Improvements

* Added ability to control the exact number of models loaded across processes by RunInference. This may be useful for pipelines with tight memory constraints ([31052](https://github.com/apache/beam/pull/31052))
* Profiling of Cythonized code has been disabled by default. This might improve performance for some Python pipelines ([30938](https://github.com/apache/beam/pull/30938)).
* Bigtable enrichment handler now accepts a custom function to build a composite row key. (Python) ([30974](https://github.com/apache/beam/issues/30975)).

Breaking Changes

* Default consumer polling timeout for KafkaIO.Read was increased from 1 second to 2 seconds. Use KafkaIO.read().withConsumerPollingTimeout(Duration duration) to configure this timeout value when necessary ([30870](https://github.com/apache/beam/issues/30870)).
* Python Dataflow users no longer need to manually specify --streaming for pipelines using unbounded sources such as ReadFromPubSub.

Bugfixes

* Fixed locking issue when shutting down inactive bundle processors. Symptoms of this issue include slowness or stuckness in long-running jobs (Python) ([30679](https://github.com/apache/beam/pull/30679)).
* Fixed logging issue that caused silecing the pip output when installing of dependencies provided in `--requirements_file` (Python).
* Fixed pipeline stuckness issue by disallowing versions of grpcio that can cause the stuckness (Python) ([30867](https://github.com/apache/beam/issues/30867)).

Known Issues

* The beam interactive runner does not correctly run on flink ([31168](https://github.com/apache/beam/issues/31168)).
* When using the Flink runner from Python, 1.17 is not supported and 1.12/13 do not work correctly. Support for 1.17 will be added in 2.57.0, and the ability to choose 1.12/13 will be cleaned up and fully removed in 2.57.0 as well ([31168](https://github.com/apache/beam/issues/31168)).
* Large Dataflow graphs using runner v2, or pipelines explicitly enabling the `upload_graph` experiment, will fail at construction time ([32159](https://github.com/apache/beam/issues/32159)).
* Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.

2.55.1

Bugfixes

* Fixed issue that broke WriteToJson in languages other than Java (X-lang) ([30776](https://github.com/apache/beam/issues/30776)).

Page 2 of 9

Releases

Has known vulnerabilities

Previous Next

Apache-beam

Page 2 of 9

2.59.0

2.58.1

2.58.0

2.57.0

2.56.0

2.55.1

Page 2 of 9

Links

Releases