Apache-beam

Latest version: v2.64.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 9

2.55.0

Highlights

* The Python SDK will now include automatically generated wrappers for external Java transforms! ([29834](https://github.com/apache/beam/pull/29834))

I/Os

* Added support for handling bad records to BigQueryIO ([30081](https://github.com/apache/beam/pull/30081)).
* Full Support for Storage Read and Write APIs
* Partial Support for File Loads (Failures writing to files supported, failures loading files to BQ unsupported)
* No Support for Extract or Streaming Inserts
* Added support for handling bad records to PubSubIO ([30372](https://github.com/apache/beam/pull/30372)).
* Support is not available for handling schema mismatches, and enabling error handling for writing to pubsub topics with schemas is not recommended
* `--enableBundling` pipeline option for BigQueryIO DIRECT_READ is replaced by `--enableStorageReadApiV2`. Both were considered experimental and may subject to change (Java) ([26354](https://github.com/apache/beam/issues/26354)).

New Features / Improvements

* Allow writing clustered and not time partitioned BigQuery tables (Java) ([30094](https://github.com/apache/beam/pull/30094)).
* Redis cache support added to RequestResponseIO and Enrichment transform (Python) ([30307](https://github.com/apache/beam/pull/30307))
* Merged sdks/java/fn-execution and runners/core-construction-java into the main SDK. These artifacts were never meant for users, but noting
that they no longer exist. These are steps to bring portability into the core SDK alongside all other core functionality.
* Added Vertex AI Feature Store handler for Enrichment transform (Python) ([30388](https://github.com/apache/beam/pull/30388))

Breaking Changes

* Arrow version was bumped to 15.0.0 from 5.0.0 ([30181](https://github.com/apache/beam/pull/30181)).
* Go SDK users who build custom worker containers may run into issues with the move to distroless containers as a base (see Security Fixes).
* The issue stems from distroless containers lacking additional tools, which current custom container processes may rely on.
* See https://beam.apache.org/documentation/runtime/environments/#from-scratch-go for instructions on building and using a custom container.
* Python SDK has changed the default value for the `--max_cache_memory_usage_mb` pipeline option from 100 to 0. This option was first introduced in 2.52.0 SDK. This change restores the behavior of 2.51.0 SDK, which does not use the state cache. If your pipeline uses iterable side inputs views, consider increasing the cache size by setting the option manually. ([30360](https://github.com/apache/beam/issues/30360)).

Bugfixes

* Fixed SpannerIO.readChangeStream to support propagating credentials from pipeline options
to the getDialect calls for authenticating with Spanner (Java) ([30361](https://github.com/apache/beam/pull/30361)).
* Reduced the number of HTTP requests in GCSIO function calls (Python) ([30205](https://github.com/apache/beam/pull/30205))

Security Fixes

* Go SDK base container image moved to distroless/base-nossl-debian12, reducing vulnerable container surface to kernel and glibc ([30011](https://github.com/apache/beam/pull/30011)).

Known Issues

* In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 ([30679](https://github.com/apache/beam/pull/30679)).
* WriteToJson broken in languages other than Java (X-lang) ([30776](https://github.com/apache/beam/issues/30776)).
* Python pipelines might occasionally become stuck due to a regression in grpcio ([30867](https://github.com/apache/beam/issues/30867)). The issue manifests frequently with Bigtable IO connector, but might also affect other GCP connectors. Fixed in 2.56.0.
* Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.

2.54.0

Highlights

* [Enrichment Transform](https://s.apache.org/enrichment-transform) along with GCP BigTable handler added to Python SDK ([#30001](https://github.com/apache/beam/pull/30001)).
* Beam Java Batch pipelines run on Google Cloud Dataflow will default to the Portable (Runner V2)[https://cloud.google.com/dataflow/docs/runner-v2] starting with this version. (All other languages are already on Runner V2.)
* This change is still rolling out to the Dataflow service, see (Runner V2 documentation)[https://cloud.google.com/dataflow/docs/runner-v2] for how to enable or disable it intentionally.

I/Os

* Added support for writing to BigQuery dynamic destinations with Python's Storage Write API ([30045](https://github.com/apache/beam/pull/30045))
* Adding support for Tuples DataType in ClickHouse (Java) ([29715](https://github.com/apache/beam/pull/29715)).
* Added support for handling bad records to FileIO, TextIO, AvroIO ([29670](https://github.com/apache/beam/pull/29670)).
* Added support for handling bad records to BigtableIO ([29885](https://github.com/apache/beam/pull/29885)).

New Features / Improvements

* [Enrichment Transform](https://s.apache.org/enrichment-transform) along with GCP BigTable handler added to Python SDK ([#30001](https://github.com/apache/beam/pull/30001)).

Breaking Changes

* N/A

Deprecations

* N/A

Bugfixes

* Fixed a memory leak affecting some Go SDK since 2.46.0. ([28142](https://github.com/apache/beam/pull/28142))

Security Fixes

* N/A

Known Issues

* Some Python pipelines that run with 2.52.0-2.54.0 SDKs and use large materialized side inputs might be affected by a performance regression. To restore the prior behavior on these SDK versions, supply the `--max_cache_memory_usage_mb=0` pipeline option. ([30360](https://github.com/apache/beam/issues/30360)).
* Python pipelines that run with 2.53.0-2.54.0 SDKs and perform file operations on GCS might be affected by excess HTTP requests. This could lead to a performance regression or a permission issue. ([28398](https://github.com/apache/beam/issues/28398))
* In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 ([30679](https://github.com/apache/beam/pull/30679)).
* Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.

2.53.0

Not secure

Highlights

* Python streaming users that use 2.47.0 and newer versions of Beam should update to version 2.53.0, which fixes a known issue: ([27330](https://github.com/apache/beam/issues/27330)).

I/Os

* TextIO now supports skipping multiple header lines (Java) ([17990](https://github.com/apache/beam/issues/17990)).
* Python GCSIO is now implemented with GCP GCS Client instead of apitools ([25676](https://github.com/apache/beam/issues/25676))
* Added support for handling bad records to KafkaIO (Java) ([29546](https://github.com/apache/beam/pull/29546))
* Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.([29564](https://github.com/apache/beam/pull/29564))
* NATS IO connector added (Go) ([29000](https://github.com/apache/beam/issues/29000)).
* Adding support for LowCardinality (Java) ([29533](https://github.com/apache/beam/pull/29533)).

New Features / Improvements

* The Python SDK now type checks `collections.abc.Collections` types properly. Some type hints that were erroneously allowed by the SDK may now fail. ([29272](https://github.com/apache/beam/pull/29272))
* Running multi-language pipelines locally no longer requires Docker.
Instead, the same (generally auto-started) subprocess used to perform the
expansion can also be used as the cross-language worker.
* Framework for adding Error Handlers to composite transforms added in Java ([29164](https://github.com/apache/beam/pull/29164)).
* Python 3.11 images now include google-cloud-profiler ([29561](https://github.com/apache/beam/pull/29651)).

Deprecations

* Euphoria DSL is deprecated and will be removed in a future release (not before 2.56.0) ([29451](https://github.com/apache/beam/issues/29451))

Bugfixes

* (Python) Fixed sporadic crashes in streaming pipelines that affected some users of 2.47.0 and newer SDKs ([27330](https://github.com/apache/beam/issues/27330)).
* (Python) Fixed a bug that caused MLTransform to drop identical elements in the output PCollection ([29600](https://github.com/apache/beam/issues/29600)).

Security Fixes

* Upgraded to go 1.21.5 to build, fixing [CVE-2023-45285](https://security-tracker.debian.org/tracker/CVE-2023-45285) and [CVE-2023-39326](https://security-tracker.debian.org/tracker/CVE-2023-39326)

Known Issues

* Potential race condition causing NPE in DataflowExecutionStateSampler in Dataflow Java Streaming pipelines ([29987](https://github.com/apache/beam/issues/29987)).
* Some Python pipelines that run with 2.52.0-2.54.0 SDKs and use large materialized side inputs might be affected by a performance regression. To restore the prior behavior on these SDK versions, supply the `--max_cache_memory_usage_mb=0` pipeline option. ([30360](https://github.com/apache/beam/issues/30360)).
* Python pipelines that run with 2.53.0-2.54.0 SDKs and perform file operations on GCS might be affected by excess HTTP requests. This could lead to a performance regression or a permission issue. ([28398](https://github.com/apache/beam/issues/28398))
* In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 ([30679](https://github.com/apache/beam/pull/30679)).
* Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.

2.52.0

Not secure

Highlights

* Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
Please, use `beam-sdks-java-extensions-avro` instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
should handle this. ([25252](https://github.com/apache/beam/issues/25252)).
* Publishing Java 21 SDK container images now supported as part of Apache Beam release process. ([28120](https://github.com/apache/beam/issues/28120))
* Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.

New Features / Improvements

* Add `UseDataStreamForBatch` pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API.
* `upload_graph` as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK ([PR28621](https://github.com/apache/beam/pull/28621)).
* Introduced a pipeline option `--max_cache_memory_usage_mb` to configure state and side input cache size. The cache has been enabled to a default of 100 MB. Use `--max_cache_memory_usage_mb=X` to provide cache size for the user state API and side inputs. ([28770](https://github.com/apache/beam/issues/28770)).
* Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the [README](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/README.md).

Breaking Changes

* `org.apache.beam.sdk.io.CountingSource.CounterMark` uses custom `CounterMarkCoder` as a default coder since all Avro-dependent
classes finally moved to `extensions/avro`. In case if it's still required to use `AvroCoder` for `CounterMark`, then,
as a workaround, a copy of "old" `CountingSource` class should be placed into a project code and used directly
([25252](https://github.com/apache/beam/issues/25252)).
* Renamed `host` to `firestoreHost` in `FirestoreOptions` to avoid potential conflict of command line arguments (Java) ([29201](https://github.com/apache/beam/pull/29201)).
* Transforms which use `SnappyCoder` are update incompatible with previous versions of the same transform (Java) on some runners. This includes PubSubIO's read ([28655](https://github.com/apache/beam/pull/28655#issuecomment-2407839769)).

Bugfixes

* Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) [28793](https://github.com/apache/beam/issues/28793).
* `watch_file_pattern` arg of the [RunInference](https://github.com/apache/beam/blob/104c10b3ee536a9a3ea52b4dbf62d86b669da5d9/sdks/python/apache_beam/ml/inference/base.py#L997) arg had no effect prior to 2.52.0. To use the behavior of arg `watch_file_pattern` prior to 2.52.0, follow the documentation at https://beam.apache.org/documentation/ml/side-input-updates/ and use `WatchFilePattern` PTransform as a SideInput. ([#28948](https://github.com/apache/beam/pulls/28948))
* `MLTransform` doesn't output artifacts such as min, max and quantiles. Instead, `MLTransform` will add a feature to output these artifacts as human readable format - [29017](https://github.com/apache/beam/issues/29017). For now, to use the artifacts such as min and max that were produced by the eariler `MLTransform`, use `read_artifact_location` of `MLTransform`, which reads artifacts that were produced earlier in a different `MLTransform` ([#29016](https://github.com/apache/beam/pull/29016/))
* Fixed a memory leak, which affected some long-running Python pipelines: [28246](https://github.com/apache/beam/issues/28246).

Security Fixes
* Fixed [CVE-2023-39325](https://www.cve.org/CVERecord?id=CVE-2023-39325) (Java/Python/Go) ([#29118](https://github.com/apache/beam/issues/29118)).
* Mitigated [CVE-2023-47248](https://nvd.nist.gov/vuln/detail/CVE-2023-47248) (Python) [#29392](https://github.com/apache/beam/issues/29392).

Known issues

* MLTransform drops the identical elements in the output PCollection. For any duplicate elements, a single element will be emitted downstream. ([29600](https://github.com/apache/beam/issues/29600)).
* Some Python pipelines that run with 2.52.0-2.54.0 SDKs and use large materialized side inputs might be affected by a performance regression. To restore the prior behavior on these SDK versions, supply the `--max_cache_memory_usage_mb=0` pipeline option. (Python) ([30360](https://github.com/apache/beam/issues/30360)).
* Users who lauch Python pipelines in an environment without internet access and use the `--setup_file` pipeline option might experience an increase in pipeline submission time. This has been fixed in 2.56.0 ([31070](https://github.com/apache/beam/pull/31070)).
* Transforms which use `SnappyCoder` are update incompatible with previous versions of the same transform (Java) on some runners. This includes PubSubIO's read ([28655](https://github.com/apache/beam/pull/28655#issuecomment-2407839769)).

2.51.0

Not secure

New Features / Improvements

* In Python, [RunInference](https://beam.apache.org/documentation/sdks/python-machine-learning/#why-use-the-runinference-api) now supports loading many models in the same transform using a [KeyedModelHandler](https://beam.apache.org/documentation/sdks/python-machine-learning/#use-a-keyed-modelhandler) ([27628](https://github.com/apache/beam/issues/27628)).
* In Python, the [VertexAIModelHandlerJSON](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.vertex_ai_inference.html#apache_beam.ml.inference.vertex_ai_inference.VertexAIModelHandlerJSON) now supports passing in inference_args. These will be passed through to the Vertex endpoint as parameters.
* Added support to run `mypy` on user pipelines ([27906](https://github.com/apache/beam/issues/27906))
* Python SDK worker start-up logs and crash logs are now captured by a buffer and logged at appropriate levels via Beam logging API. Dataflow Runner users might observe that most `worker-startup` log content is now captured by the `worker` logger. Users who relied on `print()` statements for logging might notice that some logs don't flush before pipeline succeeds - we strongly advise to use `logging` package instead of `print()` statements for logging. ([28317](https://github.com/apache/beam/pull/28317))

Breaking Changes

* Removed fastjson library dependency for Beam SQL. Table property is changed to be based on jackson ObjectNode (Java) ([24154](https://github.com/apache/beam/issues/24154)).
* Removed TensorFlow from Beam Python container images [PR](https://github.com/apache/beam/pull/28424). If you have been negatively affected by this change, please comment on [#20605](https://github.com/apache/beam/issues/20605).
* Removed the parameter `t reflect.Type` from `parquetio.Write`. The element type is derived from the input PCollection (Go) ([28490](https://github.com/apache/beam/issues/28490))
* Refactor BeamSqlSeekableTable.setUp adding a parameter joinSubsetType. [28283](https://github.com/apache/beam/issues/28283)

Bugfixes

* Fixed exception chaining issue in GCS connector (Python) ([26769](https://github.com/apache/beam/issues/26769#issuecomment-1700422615)).
* Fixed streaming inserts exception handling, GoogleAPICallErrors are now retried according to retry strategy and routed to failed rows where appropriate rather than causing a pipeline error (Python) ([21080](https://github.com/apache/beam/issues/21080)).
* Fixed a bug in Python SDK's cross-language Bigtable sink that mishandled records that don't have an explicit timestamp set: [28632](https://github.com/apache/beam/issues/28632).

Security Fixes
* Python containers updated, fixing [CVE-2021-30474](https://nvd.nist.gov/vuln/detail/CVE-2021-30474), [CVE-2021-30475](https://nvd.nist.gov/vuln/detail/CVE-2021-30475), [CVE-2021-30473](https://nvd.nist.gov/vuln/detail/CVE-2021-30473), [CVE-2020-36133](https://nvd.nist.gov/vuln/detail/CVE-2020-36133), [CVE-2020-36131](https://nvd.nist.gov/vuln/detail/CVE-2020-36131), [CVE-2020-36130](https://nvd.nist.gov/vuln/detail/CVE-2020-36130), and [CVE-2020-36135](https://nvd.nist.gov/vuln/detail/CVE-2020-36135)
* Used go 1.21.1 to build, fixing [CVE-2023-39320](https://security-tracker.debian.org/tracker/CVE-2023-39320)

Known Issues

* Long-running Python pipelines might experience a memory leak: [28246](https://github.com/apache/beam/issues/28246).
* Python pipelines using BigQuery Storage Read API might need to pin `fastavro`
dependency to 1.8.3 or earlier on some runners that don't use Beam Docker containers: [28811](https://github.com/apache/beam/issues/28811)
* MLTransform drops the identical elements in the output PCollection. For any duplicate elements, a single element will be emitted downstream. ([29600](https://github.com/apache/beam/issues/29600)).

2.50.0

Not secure

Highlights

* Spark 3.2.2 is used as default version for Spark runner ([23804](https://github.com/apache/beam/issues/23804)).
* The Go SDK has a new default local runner, called Prism ([24789](https://github.com/apache/beam/issues/24789)).
* All Beam released container images are now [multi-arch images](https://cloud.google.com/kubernetes-engine/docs/how-to/build-multi-arch-for-arm#what_is_a_multi-arch_image) that support both x86 and ARM CPU architectures.

I/Os

* Java KafkaIO now supports picking up topics via topicPattern ([26948](https://github.com/apache/beam/pull/26948))
* Support for read from Cosmos DB Core SQL API ([23604](https://github.com/apache/beam/issues/23604))
* Upgraded to HBase 2.5.5 for HBaseIO. (Java) ([27711](https://github.com/apache/beam/issues/19554))
* Added support for GoogleAdsIO source (Java) ([27681](https://github.com/apache/beam/pull/27681)).

New Features / Improvements

* The Go SDK now requires Go 1.20 to build. ([27558](https://github.com/apache/beam/issues/27558))
* The Go SDK has a new default local runner, Prism. ([24789](https://github.com/apache/beam/issues/24789)).
* Prism is a portable runner that executes each transform independantly, ensuring coders.
* At this point it supercedes the Go direct runner in functionality. The Go direct runner is now deprecated.
* See https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/README.md for the goals and features of Prism.
* Hugging Face Model Handler for RunInference added to Python SDK. ([26632](https://github.com/apache/beam/pull/26632))
* Hugging Face Pipelines support for RunInference added to Python SDK. ([27399](https://github.com/apache/beam/pull/27399))
* Vertex AI Model Handler for RunInference now supports private endpoints ([27696](https://github.com/apache/beam/pull/27696))
* MLTransform transform added with support for common ML pre/postprocessing operations ([26795](https://github.com/apache/beam/pull/26795))
* Upgraded the Kryo extension for the Java SDK to Kryo 5.5.0. This brings in bug fixes, performance improvements, and serialization of Java 14 records. ([27635](https://github.com/apache/beam/issues/27635))
* All Beam released container images are now [multi-arch images](https://cloud.google.com/kubernetes-engine/docs/how-to/build-multi-arch-for-arm#what_is_a_multi-arch_image) that support both x86 and ARM CPU architectures. ([27674](https://github.com/apache/beam/issues/27674)). The multi-arch container images include:
* All versions of Go, Python, Java and Typescript SDK containers.
* All versions of Flink job server containers.
* Java and Python expansion service containers.
* Transform service controller container.
* Spark3 job server container.
* Added support for batched writes to AWS SQS for improved throughput (Java, AWS 2).([21429](https://github.com/apache/beam/issues/21429))

Breaking Changes

* Python SDK: Legacy runner support removed from Dataflow, all pipelines must use runner v2.
* Python SDK: Dataflow Runner will no longer stage Beam SDK from PyPI in the `--staging_location` at pipeline submission. Custom container images that are not based on Beam's default image must include Apache Beam installation.([26996](https://github.com/apache/beam/issues/26996))

Deprecations

* The Go Direct Runner is now Deprecated. It remains available to reduce migration churn.
* Tests can be set back to the direct runner by overriding TestMain: `func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }`
* It's recommended to fix issues seen in tests using Prism, as they can also happen on any portable runner.
* Use the generic register package for your pipeline DoFns to ensure pipelines function on portable runners, like prism.
* Do not rely on closures or using package globals for DoFn configuration. They don't function on portable runners.

Bugfixes

* Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection and fails when pipeline option `direct_num_workers!=1`.([27373](https://github.com/apache/beam/pull/27373))
* Fixed BigQuery I/O bug when estimating size on queries that utilize row-level security ([27474](https://github.com/apache/beam/pull/27474))

Known Issues

* Long-running Python pipelines might experience a memory leak: [28246](https://github.com/apache/beam/issues/28246).
* Python Pipelines using BigQuery IO or `orjson` dependency might experience segmentation faults or get stuck: [28318](https://github.com/apache/beam/issues/28318).
* Beam Python containers rely on a version of Debian/aom that has several security vulnerabilities: [CVE-2021-30474](https://nvd.nist.gov/vuln/detail/CVE-2021-30474), [CVE-2021-30475](https://nvd.nist.gov/vuln/detail/CVE-2021-30475), [CVE-2021-30473](https://nvd.nist.gov/vuln/detail/CVE-2021-30473), [CVE-2020-36133](https://nvd.nist.gov/vuln/detail/CVE-2020-36133), [CVE-2020-36131](https://nvd.nist.gov/vuln/detail/CVE-2020-36131), [CVE-2020-36130](https://nvd.nist.gov/vuln/detail/CVE-2020-36130), and [CVE-2020-36135](https://nvd.nist.gov/vuln/detail/CVE-2020-36135)
* Python SDK's cross-language Bigtable sink mishandles records that don't have an explicit timestamp set: [28632](https://github.com/apache/beam/issues/28632). To avoid this issue, set explicit timestamps for all records before writing to Bigtable.
* Python SDK worker start-up logs, particularly PIP dependency installations, that are not logged at warning or higher are suppressed. This suppression is reverted in 2.51.0.
* MLTransform drops the identical elements in the output PCollection. For any duplicate elements, a single element will be emitted downstream. ([29600](https://github.com/apache/beam/issues/29600)).

Page 3 of 9

Releases

Has known vulnerabilities

Previous Next

Apache-beam

Page 3 of 9

2.55.0

2.54.0

2.53.0

2.52.0

2.51.0

2.50.0

Page 3 of 9

Links

Releases