Apache-beam

Latest version: v2.63.0

Safety actively analyzes 722491 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 9

2.65.0

Highlights

* New highly anticipated feature X added to Python SDK ([X](https://github.com/apache/beam/issues/X)).
* New highly anticipated feature Y added to Java SDK ([Y](https://github.com/apache/beam/issues/Y)).

I/Os

* Support for X source added (Java/Python) ([X](https://github.com/apache/beam/issues/X)).

New Features / Improvements

* X feature added (Java/Python) ([X](https://github.com/apache/beam/issues/X)).

Breaking Changes

* X behavior was changed ([X](https://github.com/apache/beam/issues/X)).

Deprecations

* X behavior is deprecated and will be removed in X versions ([X](https://github.com/apache/beam/issues/X)).

Bugfixes

* Fixed X (Java/Python) ([X](https://github.com/apache/beam/issues/X)).
* Fixed read Beam rows from cross-lang transform (for example, ReadFromJdbc) involving negative 32-bit integers incorrectly decoded to large integers ([34089](https://github.com/apache/beam/issues/34089))

Security Fixes
* Fixed (CVE-YYYY-NNNN)[https://www.cve.org/CVERecord?id=CVE-YYYY-NNNN] (Java/Python/Go) ([#X](https://github.com/apache/beam/issues/X)).

Known Issues

[comment]: ( When updating known issues after release, make sure also update website blog in website/www/site/content/blog.)
* ([X](https://github.com/apache/beam/issues/X)).

2.64.0

Highlights

* Managed API for [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/managed/Managed.html) and [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.managed.html#module-apache_beam.transforms.managed) supports [key I/O connectors](https://beam.apache.org/documentation/io/connectors/) Iceberg, Kafka, and BigQuery.
* New highly anticipated feature X added to Python SDK ([X](https://github.com/apache/beam/issues/X)).
* New highly anticipated feature Y added to Java SDK ([Y](https://github.com/apache/beam/issues/Y)).

I/Os

* [Java] Use API compatible with both com.google.cloud.bigdataoss:util 2.x and 3.x in BatchLoads ([34105](https://github.com/apache/beam/pull/34105))
* [IcebergIO] Added new CDC source for batch and streaming, available as `Managed.ICEBERG_CDC` ([33504](https://github.com/apache/beam/pull/33504))
* [IcebergIO] Address edge case where bundle retry following a successful data commit results in data duplication ([34264](https://github.com/apache/beam/pull/34264))

New Features / Improvements

* [Python] Support custom coders in Reshuffle ([29908](https://github.com/apache/beam/issues/29908), [#33356](https://github.com/apache/beam/issues/33356)).
* [Java] Upgrade SLF4J to 2.0.16. Update default Spark version to 3.5.0. ([33574](https://github.com/apache/beam/pull/33574))
* [Java] Support for `--add-modules` JVM option is added through a new pipeline option `JdkAddRootModules`. This allows extending the module graph with optional modules such as SDK incubator modules. Sample usage: `<pipeline invocation> --jdkAddRootModules=jdk.incubator.vector` ([30281](https://github.com/apache/beam/issues/30281)).
* X feature added (Java/Python) ([X](https://github.com/apache/beam/issues/X)).
* Managed API for [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/managed/Managed.html) and [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.managed.html#module-apache_beam.transforms.managed) supports [key I/O connectors](https://beam.apache.org/documentation/io/connectors/) Iceberg, Kafka, and BigQuery.
* Prism now supports event time triggers for most common cases. ([31438](https://github.com/apache/beam/issues/31438))
* Prism does not yet support triggered side inputs, or triggers on merging windows (such as session windows).

Breaking Changes

* [Python] Reshuffle now correctly respects user-specified type hints, fixing a previous bug where it might use FastPrimitivesCoder wrongly. This change could break pipelines with incorrect type hints in Reshuffle. If you have issues after upgrading, temporarily set update_compatibility_version to a previous Beam version to use the old behavior. The recommended solution is to fix the type hints in your code. ([33932](https://github.com/apache/beam/pull/33932))
* [Java] SparkReceiver 2 has been moved to SparkReceiver 3 that supports Spark 3.x. ([33574](https://github.com/apache/beam/pull/33574))
* [Python] Correct parsing of `collections.abc.Sequence` type hints was added, which can lead to pipelines failing type hint checks that were previously passing erroneously. These issues will be most commonly seen trying to consume a PCollection with a `Sequence` type hint after a GroupByKey or a CoGroupByKey. ([33999](https://github.com/apache/beam/pull/33999).

Bugfixes

* (Python) Fixed occasional pipeline stuckness that was affecting Python 3.11 users ([33966](https://github.com/apache/beam/issues/33966)).
* (Java) Fixed TIME field encodings for BigQuery Storage API writes on GenericRecords ([34059](https://github.com/apache/beam/pull/34059)).
* (Java) Fixed a race condition in JdbcIO which could cause hangs trying to acquire a connection ([34058](https://github.com/apache/beam/pull/34058)).
* (Java) Fix BigQuery Storage Write compatibility with Avro 1.8 ([34281](https://github.com/apache/beam/pull/34281)).
* Fixed checkpoint recovery and streaming behavior in Spark Classic and Portable runner's Flatten transform by replacing queueStream with SingleEmitInputDStream ([34080](https://github.com/apache/beam/pull/34080), [#18144](https://github.com/apache/beam/issues/18144), [#20426](https://github.com/apache/beam/issues/20426))
* (Java) Fixed Read caching of UnboundedReader objects to effectively cache across multiple DoFns and avoid checkpointing unstarted reader. [34146](https://github.com/apache/beam/pull/34146) [#33901](https://github.com/apache/beam/pull/33901)

2.63.0

I/Os

* Support gcs-connector 3.x+ in GcsUtil ([33368](https://github.com/apache/beam/pull/33368))
* Support for X source added (Java/Python) ([X](https://github.com/apache/beam/issues/X)).
* Introduced `--groupFilesFileLoad` pipeline option to mitigate side-input related issues in BigQueryIO
batch FILE_LOAD on certain runners (including Dataflow Runner V2) (Java) ([33587](https://github.com/apache/beam/pull/33587)).

New Features / Improvements

* Add BigQuery vector/embedding ingestion and enrichment components to apache_beam.ml.rag (Python) ([33413](https://github.com/apache/beam/pull/33413)).
* Upgraded to protobuf 4 (Java) ([33192](https://github.com/apache/beam/issues/33192)).
* [GCSIO] Added retry logic to each batch method of the GCS IO (Python) ([33539](https://github.com/apache/beam/pull/33539))
* [GCSIO] Enable recursive deletion for GCSFileSystem Paths (Python) ([33611](https://github.com/apache/beam/pull/33611)).
* External, Process based Worker Pool support added to the Go SDK container. ([33572](https://github.com/apache/beam/pull/33572))
* This is used to enable sidecar containers to run SDK workers for some runners.
* See https://beam.apache.org/documentation/runtime/sdk-harness-config/ for details.
* Support the Process Environment for execution in the Go SDK. ([33651](https://github.com/apache/beam/pull/33651))
* Prism
* Prism now uses the same single port for both pipeline submission and execution on workers. Requests are differentiated by worker-id. ([33438](https://github.com/apache/beam/pull/33438))
* This avoids port starvation and provides clarity on port use when running Prism in non-local environments.
* Support for RequiresTimeSortedInputs added. ([33513](https://github.com/apache/beam/issues/33513))
* Initial support for AllowedLateness added. ([33542](https://github.com/apache/beam/pull/33542))
* The Go SDK's inprocess Prism runner (AKA the Go SDK default runner) now supports non-loopback mode environment types. ([33572](https://github.com/apache/beam/pull/33572))
* Support the Process Environment for execution in Prism ([33651](https://github.com/apache/beam/pull/33651))
* Support the AnyOf Environment for execution in Prism ([33705](https://github.com/apache/beam/pull/33705))
* This improves support for developing Xlang pipelines, when using a compatible cross language service.
* Partitions are now configurable for the DaskRunner in the Python SDK ([33805](https://github.com/apache/beam/pull/33805)).
* [Dataflow Streaming] Enable Windmill GetWork Response Batching by default ([33847](https://github.com/apache/beam/pull/33847)).
* With this change user workers will request batched GetWork responses from backend and backend will send multiple WorkItems in the same response proto.
* The feature can be disabled by passing `--windmillRequestBatchedGetWorkResponse=false`
* Added supports for staging arbitrary files via `--files_to_stage` flag (Python) ([34208](https://github.com/apache/beam/pull/34208))
Breaking Changes

* AWS V1 I/Os have been removed (Java). As part of this, x-lang Python Kinesis I/O has been updated to consume the V2 IO and it also no longer supports setting producer_properties ([33430](https://github.com/apache/beam/issues/33430)).
* Upgraded to protobuf 4 (Java) ([33192](https://github.com/apache/beam/issues/33192)), but forced Debezium IO to use protobuf 3 ([#33541](https://github.com/apache/beam/issues/33541) because Debezium clients are not protobuf 4 compatible. This may cause conflicts when using clients which are only compatible with protobuf 4.
* Minimum Go version for Beam Go updated to 1.22.10 ([33609](https://github.com/apache/beam/pull/33609))

Bugfixes

* Fix data loss issues when reading gzipped files with TextIO (Python) ([18390](https://github.com/apache/beam/issues/18390), [#31040](https://github.com/apache/beam/issues/31040)).
* [BigQueryIO] Fixed an issue where Storage Write API sometimes doesn't pick up auto-schema updates ([33231](https://github.com/apache/beam/pull/33231))
* Prism
* Fixed an edge case where Bundle Finalization might not become enabled. ([33493](https://github.com/apache/beam/issues/33493)).
* Fixed session window aggregation, which wasn't being performed per-key. ([33542](https://github.com/apache/beam/issues/33542)).)
* [Dataflow Streaming Appliance] Fixed commits failing with KeyCommitTooLargeException when a key outputs >180MB of results. [33588](https://github.com/apache/beam/issues/33588).
* Fixed a Dataflow template creation issue that ignores template file creation errors (Java) ([33636](https://github.com/apache/beam/issues/33636))
* Correctly documented Pane Encodings in the portability protocols ([33840](https://github.com/apache/beam/issues/33840)).
* Fixed the user mailing list address ([26013](https://github.com/apache/beam/issues/26013)).
* Fixed the contributing prerequisites link ([33903](https://github.com/apache/beam/issues/33903)).

2.62.0

I/Os

* gcs-connector config options can be set via GcsOptions (Java) ([32769](https://github.com/apache/beam/pull/32769)).
* [Managed Iceberg] Support partitioning by time (year, month, day, hour) for types `date`, `time`, `timestamp`, and `timestamp(tz)` ([32939](https://github.com/apache/beam/pull/32939))
* Upgraded the default version of Hadoop dependencies to 3.4.1. Hadoop 2.10.2 is still supported (Java) ([33011](https://github.com/apache/beam/issues/33011)).
* [BigQueryIO] Create managed BigLake tables dynamically ([33125](https://github.com/apache/beam/pull/33125))

New Features / Improvements

* Added support for stateful processing in Spark Runner for streaming pipelines. Timer functionality is not yet supported and will be implemented in a future release ([33237](https://github.com/apache/beam/issues/33237)).
* The datetime module is now available for use in jinja templatization for yaml.
* Improved batch performance of SparkRunner's GroupByKey ([20943](https://github.com/apache/beam/pull/20943)).
* Support OnWindowExpiration in Prism ([32211](https://github.com/apache/beam/issues/32211)).
* This enables initial Java GroupIntoBatches support.
* Support OrderedListState in Prism ([32929](https://github.com/apache/beam/issues/32929)).
* Add apache_beam.ml.rag package with RAG types, base chunking, LangChain chunking and HuggingFace embedding components (Python) ([33364](https://github.com/apache/beam/pull/33364)).

Breaking Changes

* Upgraded ZetaSQL to 2024.11.1 ([32902](https://github.com/apache/beam/pull/32902)). Java11+ is now needed if Beam's ZetaSQL component is used.

Bugfixes

* Fixed EventTimeTimer ordering in Prism. ([32222](https://github.com/apache/beam/issues/32222)).
* [Managed Iceberg] Fixed a bug where DataFile metadata was assigned incorrect partition values ([33549](https://github.com/apache/beam/pull/33549)).

Security Fixes

* Fixed (CVE-2024-47561)[https://www.cve.org/CVERecord?id=CVE-2024-47561] (Java) by upgrading Avro version to 1.11.4

Known Issues

[comment]: ( When updating known issues after release, make sure also update website blog in website/www/site/content/blog.)
* [Python] If you are using the official Apache Beam Python containers for version 2.62.0, be aware that they include NumPy version 1.26.4. It is strongly recommended that you explicitly specify numpy==1.26.4 in your project's dependency list. ([33639](https://github.com/apache/beam/issues/33639)).
* [Dataflow Streaming Appliance] Commits fail with KeyCommitTooLargeException when a key outputs >180MB of results. Bug affects versions 2.60.0 to 2.62.0,
* fix will be released with 2.63.0. [33588](https://github.com/apache/beam/issues/33588).
* To resolve this issue, downgrade to 2.59.0 or upgrade to 2.63.0 or enable [Streaming Engine](https://cloud.google.com/dataflow/docs/streaming-engine#use).

2.61.0

Highlights

* [Python] Introduce Managed Transforms API ([31495](https://github.com/apache/beam/pull/31495))
* Flink 1.19 support added ([32648](https://github.com/apache/beam/pull/32648))

I/Os

* [Managed Iceberg] Support creating tables if needed ([32686](https://github.com/apache/beam/pull/32686))
* [Managed Iceberg] Now available in Python SDK ([31495](https://github.com/apache/beam/pull/31495))
* [Managed Iceberg] Add support for TIMESTAMP, TIME, and DATE types ([32688](https://github.com/apache/beam/pull/32688))
* BigQuery CDC writes are now available in Python SDK, only supported when using StorageWrite API at least once mode ([32527](https://github.com/apache/beam/issues/32527))
* [Managed Iceberg] Allow updating table partition specs during pipeline runtime ([32879](https://github.com/apache/beam/pull/32879))
* Added BigQueryIO as a Managed IO ([31486](https://github.com/apache/beam/pull/31486))
* Support for writing to [Solace messages queues](https://solace.com/) (`SolaceIO.Write`) added (Java) ([#31905](https://github.com/apache/beam/issues/31905)).

New Features / Improvements

* Added support for read with metadata in MqttIO (Java) ([32195](https://github.com/apache/beam/issues/32195))
* Added support for processing events which use a global sequence to "ordered" extension (Java) ([32540](https://github.com/apache/beam/pull/32540))
* Add new meta-transform FlattenWith and Tee that allow one to introduce branching
without breaking the linear/chaining style of pipeline construction.
* Use Prism as a fallback to the Python Portable runner when running a pipeline with the Python Direct runner ([32876](https://github.com/apache/beam/pull/32876))

Deprecations

* Removed support for Flink 1.15 and 1.16
* Removed support for Python 3.8

Bugfixes

* (Java) Fixed tearDown not invoked when DoFn throws on Portable Runners ([18592](https://github.com/apache/beam/issues/18592), [#31381](https://github.com/apache/beam/issues/31381)).
* (Java) Fixed protobuf error with MapState.remove() in Dataflow Streaming Java Legacy Runner without Streaming Engine ([32892](https://github.com/apache/beam/issues/32892)).
* Adding flag to support conditionally disabling auto-commit in JdbcIO ReadFn ([31111](https://github.com/apache/beam/issues/31111))
* (Python) Fixed BigQuery Enrichment bug that can lead to multiple conditions returning duplicate rows, batching returning incorrect results and conditions not scoped by row during batching ([32780](https://github.com/apache/beam/pull/32780)).

Known Issues

[comment]: ( When updating known issues after release, make sure also update website blog in website/www/site/content/blog.)
* [Managed Iceberg] DataFile metadata is assigned incorrect partition values ([33497](https://github.com/apache/beam/issues/33497)).
* Fixed in 2.62.0
* [Python] If you are using the official Apache Beam Python containers for version 2.61.0, be aware that they include NumPy version 1.26.4. It is strongly recommended that you explicitly specify numpy==1.26.4 in your project's dependency list. ([33639](https://github.com/apache/beam/issues/33639)).
* [Dataflow Streaming Appliance] Commits fail with KeyCommitTooLargeException when a key outputs >180MB of results. Bug affects versions 2.60.0 to 2.62.0,
* fix will be released with 2.63.0. [33588](https://github.com/apache/beam/issues/33588).
* To resolve this issue, downgrade to 2.59.0 or upgrade to 2.63.0 or enable [Streaming Engine](https://cloud.google.com/dataflow/docs/streaming-engine#use).

2.60.0

Highlights

* Added support for using vLLM in the RunInference transform (Python) ([32528](https://github.com/apache/beam/issues/32528))
* [Managed Iceberg] Added support for streaming writes ([32451](https://github.com/apache/beam/pull/32451))
* [Managed Iceberg] Added auto-sharding for streaming writes ([32612](https://github.com/apache/beam/pull/32612))
* [Managed Iceberg] Added support for writing to dynamic destinations ([32565](https://github.com/apache/beam/pull/32565))

I/Os

* PubsubIO can validate that the Pub/Sub topic exists before running the Read/Write pipeline (Java) ([32465](https://github.com/apache/beam/pull/32465))

New Features / Improvements

* Dataflow worker can install packages from Google Artifact Registry Python repositories (Python) ([32123](https://github.com/apache/beam/issues/32123)).
* Added support for Zstd codec in SerializableAvroCodecFactory (Java) ([32349](https://github.com/apache/beam/issues/32349))
* Added support for using vLLM in the RunInference transform (Python) ([32528](https://github.com/apache/beam/issues/32528))
* Prism release binaries and container bootloaders are now being built with the latest Go 1.23 patch. ([32575](https://github.com/apache/beam/pull/32575))
* Prism
* Prism now supports Bundle Finalization. ([32425](https://github.com/apache/beam/pull/32425))
* Significantly improved performance of Kafka IO reads that enable [commitOffsetsInFinalize](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#commitOffsetsInFinalize--) by removing the data reshuffle from SDF implementation. ([31682](https://github.com/apache/beam/pull/31682)).
* Added support for dynamic writing in MqttIO (Java) ([19376](https://github.com/apache/beam/issues/19376))
* Optimized Spark Runner parDo transform evaluator (Java) ([32537](https://github.com/apache/beam/issues/32537))
* [Managed Iceberg] More efficient manifest file writes/commits ([32666](https://github.com/apache/beam/issues/32666))

Breaking Changes

* In Python, assert_that now throws if it is not in a pipeline context instead of silently succeeding ([30771](https://github.com/apache/beam/pull/30771))
* In Python and YAML, ReadFromJson now override the dtype from None to
an explicit False. Most notably, string values like `"123"` are preserved
as strings rather than silently coerced (and possibly truncated) to numeric
values. To retain the old behavior, pass `dtype=True` (or any other value
accepted by `pandas.read_json`).
* Users of KafkaIO Read transform that enable [commitOffsetsInFinalize](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#commitOffsetsInFinalize--) might encounter pipeline graph compatibility issues when updating the pipeline. To mitigate, set the `updateCompatibilityVersion` option to the SDK version used for the original pipeline, example `--updateCompatabilityVersion=2.58.1`

Deprecations

* Python 3.8 is reaching EOL and support is being removed in Beam 2.61.0. The 2.60.0 release will warn users
when running on 3.8. ([31192](https://github.com/apache/beam/issues/31192))

Bugfixes

* (Java) Fixed custom delimiter issues in TextIO ([32249](https://github.com/apache/beam/issues/32249), [#32251](https://github.com/apache/beam/issues/32251)).
* (Java, Python, Go) Fixed PeriodicSequence backlog bytes reporting, which was preventing Dataflow Runner autoscaling from functioning properly ([32506](https://github.com/apache/beam/issues/32506)).
* (Java) Fix improper decoding of rows with schemas containing nullable fields when encoded with a schema with equal encoding positions but modified field order. ([32388](https://github.com/apache/beam/issues/32388)).
* (Java) Skip close on bundles in BigtableIO.Read ([32661](https://github.com/apache/beam/pull/32661), [#32759](https://github.com/apache/beam/pull/32759)).

Known Issues

[comment]: ( When updating known issues after release, make sure also update website blog in website/www/site/content/blog.)
* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([32780](https://github.com/apache/beam/pull/32780)):
* Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output.
* Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results.
* Fixed in 2.61.0.
* [Managed Iceberg] DataFile metadata is assigned incorrect partition values ([33497](https://github.com/apache/beam/issues/33497)).
* Fixed in 2.62.0
* [Dataflow Streaming Appliance] Commits fail with KeyCommitTooLargeException when a key outputs >180MB of results. Bug affects versions 2.60.0 to 2.62.0,
* fix will be released with 2.63.0. [33588](https://github.com/apache/beam/issues/33588).
* To resolve this issue, downgrade to 2.59.0 or upgrade to 2.63.0 or enable [Streaming Engine](https://cloud.google.com/dataflow/docs/streaming-engine#use).

Page 1 of 9

Releases

Has known vulnerabilities

Apache-beam

Page 1 of 9

2.65.0

2.64.0

2.63.0

2.62.0

2.61.0

2.60.0

Page 1 of 9

Links

Releases