Apache-beam

Latest version: v2.56.0

Safety actively analyzes 634568 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 7

2.52.0

Not secure
Highlights

* Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
Please, use `beam-sdks-java-extensions-avro` instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
should handle this. ([25252](https://github.com/apache/beam/issues/25252)).
* Publishing Java 21 SDK container images now supported as part of Apache Beam release process. ([28120](https://github.com/apache/beam/issues/28120))
* Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.

New Features / Improvements

* Add `UseDataStreamForBatch` pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API.
* `upload_graph` as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK ([PR28621](https://github.com/apache/beam/pull/28621)).
* Introduced a pipeline option `--max_cache_memory_usage_mb` to configure state and side input cache size. The cache has been enabled to a default of 100 MB. Use `--max_cache_memory_usage_mb=X` to provide cache size for the user state API and side inputs. ([28770](https://github.com/apache/beam/issues/28770)).
* Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the [README](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/README.md).


Breaking Changes

* `org.apache.beam.sdk.io.CountingSource.CounterMark` uses custom `CounterMarkCoder` as a default coder since all Avro-dependent
classes finally moved to `extensions/avro`. In case if it's still required to use `AvroCoder` for `CounterMark`, then,
as a workaround, a copy of "old" `CountingSource` class should be placed into a project code and used directly
([25252](https://github.com/apache/beam/issues/25252)).
* Renamed `host` to `firestoreHost` in `FirestoreOptions` to avoid potential conflict of command line arguments (Java) ([29201](https://github.com/apache/beam/pull/29201)).

Bugfixes

* Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) [28793](https://github.com/apache/beam/issues/28793).
* `watch_file_pattern` arg of the [RunInference](https://github.com/apache/beam/blob/104c10b3ee536a9a3ea52b4dbf62d86b669da5d9/sdks/python/apache_beam/ml/inference/base.py#L997) arg had no effect prior to 2.52.0. To use the behavior of arg `watch_file_pattern` prior to 2.52.0, follow the documentation at https://beam.apache.org/documentation/ml/side-input-updates/ and use `WatchFilePattern` PTransform as a SideInput. ([#28948](https://github.com/apache/beam/pulls/28948))
* `MLTransform` doesn't output artifacts such as min, max and quantiles. Instead, `MLTransform` will add a feature to output these artifacts as human readable format - [29017](https://github.com/apache/beam/issues/29017). For now, to use the artifacts such as min and max that were produced by the eariler `MLTransform`, use `read_artifact_location` of `MLTransform`, which reads artifacts that were produced earlier in a different `MLTransform` ([#29016](https://github.com/apache/beam/pull/29016/))
* Fixed a memory leak, which affected some long-running Python pipelines: [28246](https://github.com/apache/beam/issues/28246).

Security Fixes
* Fixed [CVE-2023-39325](https://www.cve.org/CVERecord?id=CVE-2023-39325) (Java/Python/Go) ([#29118](https://github.com/apache/beam/issues/29118)).
* Mitigated [CVE-2023-47248](https://nvd.nist.gov/vuln/detail/CVE-2023-47248) (Python) [#29392](https://github.com/apache/beam/issues/29392).

Known issues

* MLTransform drops the identical elements in the output PCollection. For any duplicate elements, a single element will be emitted downstream. ([29600](https://github.com/apache/beam/issues/29600)).
* Some Python pipelines that run with 2.52.0-2.54.0 SDKs and use large materialized side inputs might be affected by a performance regression. To restore the prior behavior on these SDK versions, supply the `--max_cache_memory_usage_mb=0` pipeline option. (Python) ([30360](https://github.com/apache/beam/issues/30360)).
* Users who lauch Python pipelines in an environment without internet access and use the `--setup_file` pipeline option might experience an increase in pipeline submission time. This has been fixed in 2.56.0 ([31070](https://github.com/apache/beam/pull/31070)).

2.51.0

Not secure
New Features / Improvements

* In Python, [RunInference](https://beam.apache.org/documentation/sdks/python-machine-learning/#why-use-the-runinference-api) now supports loading many models in the same transform using a [KeyedModelHandler](https://beam.apache.org/documentation/sdks/python-machine-learning/#use-a-keyed-modelhandler) ([27628](https://github.com/apache/beam/issues/27628)).
* In Python, the [VertexAIModelHandlerJSON](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.vertex_ai_inference.html#apache_beam.ml.inference.vertex_ai_inference.VertexAIModelHandlerJSON) now supports passing in inference_args. These will be passed through to the Vertex endpoint as parameters.
* Added support to run `mypy` on user pipelines ([27906](https://github.com/apache/beam/issues/27906))
* Python SDK worker start-up logs and crash logs are now captured by a buffer and logged at appropriate levels via Beam logging API. Dataflow Runner users might observe that most `worker-startup` log content is now captured by the `worker` logger. Users who relied on `print()` statements for logging might notice that some logs don't flush before pipeline succeeds - we strongly advise to use `logging` package instead of `print()` statements for logging. ([28317](https://github.com/apache/beam/pull/28317))


Breaking Changes

* Removed fastjson library dependency for Beam SQL. Table property is changed to be based on jackson ObjectNode (Java) ([24154](https://github.com/apache/beam/issues/24154)).
* Removed TensorFlow from Beam Python container images [PR](https://github.com/apache/beam/pull/28424). If you have been negatively affected by this change, please comment on [#20605](https://github.com/apache/beam/issues/20605).
* Removed the parameter `t reflect.Type` from `parquetio.Write`. The element type is derived from the input PCollection (Go) ([28490](https://github.com/apache/beam/issues/28490))
* Refactor BeamSqlSeekableTable.setUp adding a parameter joinSubsetType. [28283](https://github.com/apache/beam/issues/28283)


Bugfixes

* Fixed exception chaining issue in GCS connector (Python) ([26769](https://github.com/apache/beam/issues/26769#issuecomment-1700422615)).
* Fixed streaming inserts exception handling, GoogleAPICallErrors are now retried according to retry strategy and routed to failed rows where appropriate rather than causing a pipeline error (Python) ([21080](https://github.com/apache/beam/issues/21080)).
* Fixed a bug in Python SDK's cross-language Bigtable sink that mishandled records that don't have an explicit timestamp set: [28632](https://github.com/apache/beam/issues/28632).


Security Fixes
* Python containers updated, fixing [CVE-2021-30474](https://nvd.nist.gov/vuln/detail/CVE-2021-30474), [CVE-2021-30475](https://nvd.nist.gov/vuln/detail/CVE-2021-30475), [CVE-2021-30473](https://nvd.nist.gov/vuln/detail/CVE-2021-30473), [CVE-2020-36133](https://nvd.nist.gov/vuln/detail/CVE-2020-36133), [CVE-2020-36131](https://nvd.nist.gov/vuln/detail/CVE-2020-36131), [CVE-2020-36130](https://nvd.nist.gov/vuln/detail/CVE-2020-36130), and [CVE-2020-36135](https://nvd.nist.gov/vuln/detail/CVE-2020-36135)
* Used go 1.21.1 to build, fixing [CVE-2023-39320](https://security-tracker.debian.org/tracker/CVE-2023-39320)

Known Issues

* Long-running Python pipelines might experience a memory leak: [28246](https://github.com/apache/beam/issues/28246).
* Python pipelines using BigQuery Storage Read API might need to pin `fastavro`
dependency to 1.8.3 or earlier on some runners that don't use Beam Docker containers: [28811](https://github.com/apache/beam/issues/28811)
* MLTransform drops the identical elements in the output PCollection. For any duplicate elements, a single element will be emitted downstream. ([29600](https://github.com/apache/beam/issues/29600)).

2.50.0

Not secure
Highlights

* Spark 3.2.2 is used as default version for Spark runner ([23804](https://github.com/apache/beam/issues/23804)).
* The Go SDK has a new default local runner, called Prism ([24789](https://github.com/apache/beam/issues/24789)).
* All Beam released container images are now [multi-arch images](https://cloud.google.com/kubernetes-engine/docs/how-to/build-multi-arch-for-arm#what_is_a_multi-arch_image) that support both x86 and ARM CPU architectures.

I/Os

* Java KafkaIO now supports picking up topics via topicPattern ([26948](https://github.com/apache/beam/pull/26948))
* Support for read from Cosmos DB Core SQL API ([23604](https://github.com/apache/beam/issues/23604))
* Upgraded to HBase 2.5.5 for HBaseIO. (Java) ([27711](https://github.com/apache/beam/issues/19554))
* Added support for GoogleAdsIO source (Java) ([27681](https://github.com/apache/beam/pull/27681)).

New Features / Improvements

* The Go SDK now requires Go 1.20 to build. ([27558](https://github.com/apache/beam/issues/27558))
* The Go SDK has a new default local runner, Prism. ([24789](https://github.com/apache/beam/issues/24789)).
* Prism is a portable runner that executes each transform independantly, ensuring coders.
* At this point it supercedes the Go direct runner in functionality. The Go direct runner is now deprecated.
* See https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/README.md for the goals and features of Prism.
* Hugging Face Model Handler for RunInference added to Python SDK. ([26632](https://github.com/apache/beam/pull/26632))
* Hugging Face Pipelines support for RunInference added to Python SDK. ([27399](https://github.com/apache/beam/pull/27399))
* Vertex AI Model Handler for RunInference now supports private endpoints ([27696](https://github.com/apache/beam/pull/27696))
* MLTransform transform added with support for common ML pre/postprocessing operations ([26795](https://github.com/apache/beam/pull/26795))
* Upgraded the Kryo extension for the Java SDK to Kryo 5.5.0. This brings in bug fixes, performance improvements, and serialization of Java 14 records. ([27635](https://github.com/apache/beam/issues/27635))
* All Beam released container images are now [multi-arch images](https://cloud.google.com/kubernetes-engine/docs/how-to/build-multi-arch-for-arm#what_is_a_multi-arch_image) that support both x86 and ARM CPU architectures. ([27674](https://github.com/apache/beam/issues/27674)). The multi-arch container images include:
* All versions of Go, Python, Java and Typescript SDK containers.
* All versions of Flink job server containers.
* Java and Python expansion service containers.
* Transform service controller container.
* Spark3 job server container.
* Added support for batched writes to AWS SQS for improved throughput (Java, AWS 2).([21429](https://github.com/apache/beam/issues/21429))

Breaking Changes

* Python SDK: Legacy runner support removed from Dataflow, all pipelines must use runner v2.
* Python SDK: Dataflow Runner will no longer stage Beam SDK from PyPI in the `--staging_location` at pipeline submission. Custom container images that are not based on Beam's default image must include Apache Beam installation.([26996](https://github.com/apache/beam/issues/26996))

Deprecations

* The Go Direct Runner is now Deprecated. It remains available to reduce migration churn.
* Tests can be set back to the direct runner by overriding TestMain: `func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }`
* It's recommended to fix issues seen in tests using Prism, as they can also happen on any portable runner.
* Use the generic register package for your pipeline DoFns to ensure pipelines function on portable runners, like prism.
* Do not rely on closures or using package globals for DoFn configuration. They don't function on portable runners.

Bugfixes

* Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection and fails when pipeline option `direct_num_workers!=1`.([27373](https://github.com/apache/beam/pull/27373))
* Fixed BigQuery I/O bug when estimating size on queries that utilize row-level security ([27474](https://github.com/apache/beam/pull/27474))

Known Issues

* Long-running Python pipelines might experience a memory leak: [28246](https://github.com/apache/beam/issues/28246).
* Python Pipelines using BigQuery IO or `orjson` dependency might experience segmentation faults or get stuck: [28318](https://github.com/apache/beam/issues/28318).
* Beam Python containers rely on a version of Debian/aom that has several security vulnerabilities: [CVE-2021-30474](https://nvd.nist.gov/vuln/detail/CVE-2021-30474), [CVE-2021-30475](https://nvd.nist.gov/vuln/detail/CVE-2021-30475), [CVE-2021-30473](https://nvd.nist.gov/vuln/detail/CVE-2021-30473), [CVE-2020-36133](https://nvd.nist.gov/vuln/detail/CVE-2020-36133), [CVE-2020-36131](https://nvd.nist.gov/vuln/detail/CVE-2020-36131), [CVE-2020-36130](https://nvd.nist.gov/vuln/detail/CVE-2020-36130), and [CVE-2020-36135](https://nvd.nist.gov/vuln/detail/CVE-2020-36135)
* Python SDK's cross-language Bigtable sink mishandles records that don't have an explicit timestamp set: [28632](https://github.com/apache/beam/issues/28632). To avoid this issue, set explicit timestamps for all records before writing to Bigtable.
* Python SDK worker start-up logs, particularly PIP dependency installations, that are not logged at warning or higher are suppressed. This suppression is reverted in 2.51.0.
* MLTransform drops the identical elements in the output PCollection. For any duplicate elements, a single element will be emitted downstream. ([29600](https://github.com/apache/beam/issues/29600)).

2.49.0

Not secure
I/Os

* Support for Bigtable Change Streams added in Java `BigtableIO.ReadChangeStream` ([27183](https://github.com/apache/beam/issues/27183))

New Features / Improvements

* Allow prebuilding large images when using `--prebuild_sdk_container_engine=cloud_build`, like images depending on `tensorflow` or `torch` ([27023](https://github.com/apache/beam/pull/27023)).
* Disabled `pip` cache when installing packages on the workers. This reduces the size of prebuilt Python container images ([27035](https://github.com/apache/beam/pull/27035)).
* Select dedicated avro datum reader and writer (Java) ([18874](https://github.com/apache/beam/issues/18874)).
* Timer API for the Go SDK (Go) ([22737](https://github.com/apache/beam/issues/22737)).

Deprecations

* Removed Python 3.7 support. ([26447](https://github.com/apache/beam/issues/26447))

Bugfixes

* Fixed KinesisIO `NullPointerException` when a progress check is made before the reader is started (IO) ([23868](https://github.com/apache/beam/issues/23868))

Known Issues

* Long-running Python pipelines might experience a memory leak: [28246](https://github.com/apache/beam/issues/28246).

2.48.0

Not secure
Highlights

* "Experimental" annotation cleanup: the annotation and concept have been removed from Beam to avoid
the misperception of code as "not ready". Any proposed breaking changes will be subject to
case-by-case pro/con decision making (and generally avoided) rather than using the "Experimental"
to allow them.

I/Os

* Added rename for GCS and copy for local filesystem (Go) ([25779](https://github.com/apache/beam/issues/26064)).
* Added support for enhanced fan-out in KinesisIO.Read (Java) ([19967](https://github.com/apache/beam/issues/19967)).
* This change is not compatible with Flink savepoints created by Beam 2.46.0 applications which had KinesisIO sources.
* Added textio.ReadWithFilename transform (Go) ([25812](https://github.com/apache/beam/issues/25812)).
* Added fileio.MatchContinuously transform (Go) ([26186](https://github.com/apache/beam/issues/26186)).

New Features / Improvements

* Allow passing service name for google-cloud-profiler (Python) ([26280](https://github.com/apache/beam/issues/26280)).
* Dead letter queue support added to RunInference in Python ([24209](https://github.com/apache/beam/issues/24209)).
* Support added for defining pre/postprocessing operations on the RunInference transform ([26308](https://github.com/apache/beam/issues/26308))
* Adds a Docker Compose based transform service that can be used to discover and use portable Beam transforms ([26023](https://github.com/apache/beam/pull/26023)).

Breaking Changes

* Passing a tag into MultiProcessShared is now required in the Python SDK ([26168](https://github.com/apache/beam/issues/26168)).
* CloudDebuggerOptions is removed (deprecated in Beam v2.47.0) for Dataflow runner as the Google Cloud Debugger service is [shutting down](https://cloud.google.com/debugger/docs/deprecations). (Java) ([#25959](https://github.com/apache/beam/issues/25959)).
* AWS 2 client providers (deprecated in Beam [v2.38.0](2380---2022-04-20)) are finally removed ([26681](https://github.com/apache/beam/issues/26681)).
* AWS 2 SnsIO.writeAsync (deprecated in Beam v2.37.0 due to risk of data loss) was finally removed ([26710](https://github.com/apache/beam/issues/26710)).
* AWS 2 coders (deprecated in Beam v2.43.0 when adding Schema support for AWS Sdk Pojos) are finally removed ([23315](https://github.com/apache/beam/issues/23315)).

Deprecations


Bugfixes

* Fixed Java bootloader failing with Too Long Args due to long classpaths, with a pathing jar. (Java) ([25582](https://github.com/apache/beam/issues/25582)).

Known Issues

* PubsubIO writes will throw *SizeLimitExceededException* for any message above 100 bytes, when used in batch (bounded) mode. (Java) ([27000](https://github.com/apache/beam/issues/27000)).
* Long-running Python pipelines might experience a memory leak: [28246](https://github.com/apache/beam/issues/28246).
* Python SDK's cross-language Bigtable sink mishandles records that don't have an explicit timestamp set: [28632](https://github.com/apache/beam/issues/28632). To avoid this issue, set explicit timestamps for all records before writing to Bigtable.

2.47.0

Not secure
Highlights

* Apache Beam adds Python 3.11 support ([23848](https://github.com/apache/beam/issues/23848)).

I/Os

* BigQuery Storage Write API is now available in Python SDK via cross-language ([21961](https://github.com/apache/beam/issues/21961)).
* Added HbaseIO support for writing RowMutations (ordered by rowkey) to Hbase (Java) ([25830](https://github.com/apache/beam/issues/25830)).
* Added fileio transforms MatchFiles, MatchAll and ReadMatches (Go) ([25779](https://github.com/apache/beam/issues/25779)).
* Add integration test for JmsIO + fix issue with multiple connections (Java) ([25887](https://github.com/apache/beam/issues/25887)).

New Features / Improvements

* The Flink runner now supports Flink 1.16.x ([25046](https://github.com/apache/beam/issues/25046)).
* Schema'd PTransforms can now be directly applied to Beam dataframes just like PCollections.
(Note that when doing multiple operations, it may be more efficient to explicitly chain the operations
like `df | (Transform1 | Transform2 | ...)` to avoid excessive conversions.)
* The Go SDK adds new transforms periodic.Impulse and periodic.Sequence that extends support
for slowly updating side input patterns. ([23106](https://github.com/apache/beam/issues/23106))
* Several Google client libraries in Python SDK dependency chain were updated to latest available major versions. ([24599](https://github.com/apache/beam/pull/24599))

Breaking Changes

* If a main session fails to load, the pipeline will now fail at worker startup. ([25401](https://github.com/apache/beam/issues/25401)).
* Python pipeline options will now ignore unparsed command line flags prefixed with a single dash. ([25943](https://github.com/apache/beam/issues/25943)).
* The SmallestPerKey combiner now requires keyword-only arguments for specifying optional parameters, such as `key` and `reverse`. ([25888](https://github.com/apache/beam/issues/25888)).

Deprecations

* Cloud Debugger support and its pipeline options are deprecated and will be removed in the next Beam version,
in response to the Google Cloud Debugger service [turning down](https://cloud.google.com/debugger/docs/deprecations). (Java) ([#25959](https://github.com/apache/beam/issues/25959)).

Bugfixes

* BigQuery sink in STORAGE_WRITE_API mode in batch pipelines could result in data consistency issues during the handling of other unrelated transient errors for Beam SDKs 2.35.0 - 2.46.0 (inclusive). For more details see: https://github.com/apache/beam/issues/26521

Known Issues

* The google-cloud-profiler dependency was accidentally removed from Beam's Python Docker
Image [26998](https://github.com/apache/beam/issues/26698). [Dataflow Docker images](https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies) still preinstall this dependency.
* Long-running Python pipelines might experience a memory leak: [28246](https://github.com/apache/beam/issues/28246).

Page 2 of 7

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.