mrjob Changelog

0.6.11

* Python 3.4 is again supported, except for Google libraries (2090)
* can intermix positional (input file) args to MRJobs on Python 3.7 (1701)
* all runners
* can parse logs to find cause of error in Spark (2056)
* EMR runner
* retrying on transient API errors now works with pagination (2005)
* default image_version (AMI) is now 5.27.0 (2105)
* restored m4.large as default instance type for pre-5.13.0 AMIs (2098)
* can override emr_configurations with !clear or by Classification (2097)
* Spark runner
* can run scripts with spark-submit without pyspark in $PYTHONPATH (2091)

0.6.10

* officially support PyPy (1011)
* when launched in PyPy, defaults python_bin to pypy or pypy3
* Spark runner
* turn off internal protocol with --skip-internal-protocol (1952)
* spark Harness can run inside EMR (2070)
* EMR runner
* default instance type is now m5.xlarge (2071)
* log DNS of master node as soon as we know it (2074)
* better error when reading YAML conf file without YAML library (2047)

0.6.9

* formally dropped support for Python 3.4
* (still seems to work except for Google libraries)
* jobs:
* deprecated add_*_option() methods can take types as their type arg (2058)
* all runners
* archives no longer go into working dir mirror (2059)
* fixes bug in v0.6.8 that could break archives on Hadoop
* sim runners (local, inline)
* simulated mapreduce.map.input.file is now a file:// URL (2066)
* Spark runner
* added emulate_map_input_file option (2061)
* can optionally emulate mapreduce.map.input.file in first step's mapper
* increment counter() emulation now uses correct arg names (2060)
* warns if spark_tmp_dir and master aren't both local/remote (2062)
* mrjob spark-submit can take switches to script without using "--" (2055)

0.6.8

* updated library dependencies (2019, 2025)
* google-cloud-dataproc>=0.3.0
* google-cloud-logging>=1.9.0
* google-cloud-storage>=1.13.1
* PyYAML>=3.10
* jobs:
* MRJobs are now Spark-serializable (without calling sandbox())
* spark() can pass job methods to rdd.map() etc. (2039)
* all runners:
* inline runner runs Spark jobs through PySpark (1965)
* local runner runs Spark jobs on local-cluster master (1361)
* cat_output() now ignores files and subdirs starting with "." too (1337)
* this includes Spark checksum files (e.g. .part-00000.crc)
* empty *_bin options mean use the default, not a no-args command (1926)
* affected gcloud_bin, hadoop_bin, sh_bin, ssh_bin
* *python_bin options already worked this way
* improved Spark support
* full support for setup scripts (was just YARN) (2048)
* fully supports uploading files to Spark working dir (1922)
* including renaming files (2017)
* uploading archives/dirs is still unsupported except on YARN
* spark.yarn.appMasterEnv.* now only set on YARN (1919)
* add_file_arg() works on Spark
* even on local[*] master (2031)
* uses file:// as appropriate when running locally (1985)
* won't hang if Hadoop or Spark binary can't be run (2024)
* spark master/deploy mode can't be overridden by jobconf (2032)
* can search for spark-submit binary in pyspark installation (1984)
* (Dataproc runner does not yet support Spark)
* EMR runner:
* fixed fs bug that prevented running with non-default temp bucket (2015)
* less API calls when job retries joining a pooled cluster (1990)
* extra_cluster_params can set nested sub-params (1934)
* e.g. Instances.EmrManagedMasterSecurityGroup
* --subnet '' un-sets subnet set in mrjob.conf (1931)
* added Spark runner (1940)
* runs jobs entirely on Spark, uses `hadoop fs` for HDFS only
* can use any fs mrjob supports (HDFS, EMR, Dataproc, local)
* can run "classic" MRJobs normally run on Hadoop streaming (1972)
* supports mappers, combiners, reducers, including _init() and _final()
* makes efficient use of combiners, if available (1946)
* supports Hadoop input/output format set in job (1944)
* can run consecutive MRSteps in a single Spark step (1950)
* respects SORT_VALUES (1945)
* emulates Hadoop output compression (1943)
* set the same jobconf variables you would in Hadoop
* can control number of output files
* set Hadoop jobconf variables to control of reducers (1953)
* or use --max-output-files (2040)
* can simulate counters with accumulators (1955)
* can handle jobs that load file args in their constructor (2044)
* does not support commands (e.g. mapper_cmd(), mapper_pre_filter())
* (Spark runner does not yet parse logs for probable cause of error)
* Spark harness renamed to mrjob/spark/harness.py, no need to run directly
* `mrjob spark-submit` now defaults to spark runner
* works on emr, hadoop, and local runners as well (1975)
* runner filesystems:
* added put() method to all filesystems (1977)
* part size for uploads is now set at fs init time
* CompositeFilesystem can can give up on an un-configured filesystem (1974)
* used by the Spark runner when GCS/S3 aren't set up
* mkdir() can now create buckets (2014)
* fs-specific methods now accessed through fs.<name>
* e.g. runner.fs.s3.make_s3_client()
* deprecated useless local_tmp_dir arg to GCSFilesystem (1961)
* missing mrjob.examples support files now installed

0.6.7

* tools:
* added mrjob spark-submit subcommand (1382)
* add subcommand to usage in --help for subcommands (1885)
* added --emr-action-on-failure switch to mrjob create-cluster (1959)
* jobs:
* added *_pairs() methods to MRJob (1947)
* jobs pass steps description to runner constructor (1845)
* --steps is deprecated
* all runners:
* sh_bin defaults to /bin/sh -ex, not just sh -ex (1924)
* sh_bin may not be empty and should not take more than one argument
* warn about command-line switches for wrong runner (1898)
* added plural command-line switches (1882):
* added --applications, --archives, --dirs, --files, --libjars, --py-files
* deprecated --archive, --dir, --file, --libjar, --py-file
* interpreter and steps_interpreter opts are deprecated (1850)
* steps_python_bin is deprecated (1851)
* can set separate SPARK_PYTHON and SPARK_DRIVER_PYTHON if need be
* inline runner:
* no longer attempts to run command substeps (1878)
* inline and local runner:
* no longer attempts to run non-streaming steps (1915)
* Dataproc and EMR runners:
* fixed SIGKILL import error on Windows (1892)
* Hadoop and EMR runners:
* setup opt works with Spark scripts on YARN (1376)
* Hadoop runner:
* removed useless bootstrap_spark opt (1382)
* EMR runner:
* fail fast if input files are archived in Glacier (1887)
* default instance type is m4.large (1932)
* pooling knows about c5 and m5 instance types (1930, 1936)
* create_bucket() was broken in us-east-1, fixed (1927)
* idle timeout silently failed on 2.x AMIs, fixed (1909)
* updated deprecated escape sequences that would break in Python 3.8 (1920)
* raise ValueError, not AssertionError (1877)
* added experimental harness to submit basic MRJobs on Spark (1941)

0.6.6

* configs:
* boolean jobconf values in mrjob.conf now work correctly (323)
* added mrjob.conf.combine_jobconfs()
* jobs:
* fixed "usage: usage: " in --help (1866)
* overriding jobconf() and libjars() can no longer clobber
command-line options (1453)
* JarSteps use GENERIC_ARGS to interpolate -D/-libjars (1863)
* add_file_arg() supports explicit type=str (1858)
* add_file_option() and add_passthrough_option() support type='str' (1857)
* all runners:
* py_files are always uploaded to HDFS/S3 (1852)
* options and switches:
* added -D as a synonym for --jobconf (1839)
* added --local-tmp-dir switch (1870)
* setting local_tmp_dir to '' uses default temp dir
* added --hadoop-args and --spark-args switches (1844)
* --hadoop-arg and --spark-arg are now deprecated
* EMR runner:
* can now fetch history log via SSH, eliminating wait for S3 (1253)
* Hadoop runner:
* added spark_deploy_mode option (1864)
* sim runners:
* fixed permission error on Windows (1847)

Mrjob

Page 2 of 10