mrjob Changelog

0.4.1

* jobs:
* SORT_VALUES: Secondary sort by value (240)
* see mrjob/examples/
* can now override jobconf() again (656)
* renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
* examples:
* bash_wrap/ (mapper/reducer_cmd() example)
* mr_most_used_word.py (two step job)
* mr_next_word_stats.py (SORT_VALUES example)
* runners:
* All runners:
* single setup option works but is not yet documented (206)
* setup now uses sh rather than python internally
* EMR runner:
* max_hours_idle: self-terminating idle job flows (628)
* mins_to_end_of_hour option gives finer control over self-termination.
* Can reuse pooled job flows where previous job failed (633)
* Throws IOError if output path already exists (634)
* Gracefully handles SSL cert issues (621, 706)
* Automatically infers EMR/S3 endpoints from region (658)
* ls() supports s3n:// schema (672)
* Fixed log parsing crash on JarSteps (645)
* visible_to_all_users works with boto <2.8.0 (701)
* must use --interpreter with non-Python scripts (683)
* cat() can decompress gzipped data (601)
* Hadoop runner:
* check_input_paths: can disable input path checking (583)
* cat() can decompress gzipped data (601)
* Inline/Local runners:
* Fixed counter parsing for multi-step jobs in inline mode
* Supports per-step jobconf (616)
* Documentation revamp
* mrjob.parse.urlparse() works consistently across Python versions (686)
* deprecated:
* many constants in mrjob.emr replaced with functions in mrjob.aws
* removed deprecated features:
* old conf locations (~/.mrjob and in PYTHONPATH) (747)
* built-in protocols must be instances (488)

0.4.0

* Changes:
* 'mrjob' command (225)
* Changed default runner from 'local' to 'inline' (423)
* Local runner no longer adds working directory to PYTHONPATH of
subprocesses; use inline runner instead (424)
* Requires boto 2.2.0 or later
* Filesystem functionality moved out of MRJobRunner into into 'fs' objects
but forwarded from runners for backward compatibility
* Changed exception hierarchy of mrjob.ssh (which is private but
important)
* Inline and local runners now inherit from the SimMRJobRunner class and thus share most
of their implementation
* Internal data structure for representing a step is much richer, allowing
many cool future features (479)
* mrjob detects Hadoop version from EMR based on API responses instead of
what's in the config (611)
* New features:
* Support for non-Hadoop Streaming jar steps (499)
* Support for arbitrary commands as Hadoop Streaming
mappers/combiners/reducers
* mapper_pre_filter, combiner_pre_filter, and reducer_pre_filter allow
running of a UNIX command in front of tasks to filter input outside of
the interpreter
* Hadoop runner uses PTY to print output from the Hadoop sub process to the
console (580)
* mrjob knows how to terminate the job on cleanup (Ctrl+C closes the job).
(353)
* Allow use of multiple -c flags on the command line (420)
* Bug fixes:
* Silenced some incorrect warnings about ignored options in 'inline' runner
* terminate_idle_job_flows uses the default configuration to terminate idle jobs (559)
* Removed deprecated functionality:
* --hadoop-*-format
* --*-protocol switches
* MRJob.DEFAULT_*_PROTOCOL
* MRJob.get_default_opts()
* MRJob.protocols()
* PROTOCOL_DICT
* IF_SUCCESSFUL
* DEFAULT_CLEANUP
* S3Filesystem.get_s3_folder_keys()

0.3.5

* EMR:
* --pool-wait-minutes option lets you wait up to X minutes before creating a
job flow (455)
* Job flow ID included in error messages on failure (452)
* JOB and JOB_FLOW cleanup options (485, 455)
* EMR and Hadoop:
* Compatibility fixes related to deprecated options and Hadoop's bizarre
non-sequential version numbers (489, 534)
* Other:
* Warn when *_PROTOCOL is not a class (490)
* Bug fixes:
* Unicode strings can be used when specifying interpreters (431)
* --enable-emr-logging no longer causes the wrong counters/logs to be parsed
(446)
* TMP_DIR inserted into 'sort' environment variables (477)
* Setting hadoop_home in mrjob.conf works again
* Gzipped input files work when specified with relative paths (494)
* Passthrough options are not re-ordered when sent to Hadoop Streaming
(509)

v0.3.4.1, 2012-06-12 -- The test suite doesn't catch everything...
* Local mode doesn't try to send multiple mappers to the same output file
when using multiple compressed files as input

v0.3.4, 2012-06-11 -- We are friendly people.
* Experimental support for IronPython in the local and inline runners
* set_status() and increment_counter() will encode messages/names of type
'unicode' as UTF-8 when writing to Hadoop Streaming
* EMR and Hadoop counter parsing is more correct
* mrjob.tools.emr.fetch_logs fetches logs from S3 when asked instead of
incorrectly refusing to do so
* jobconf values can be booleans in mrjob.conf as well as 'true' and 'false'
strings
* hadoop_version can be a float in mrjob.conf, but a warning is printed to the
console
* Command line help is split across several --help-* commands
* Local runner sorts output consistently

0.3.3.2

* Option parsing no longer dies when -- is used as an argument (435)
* Fixed race condition where two jobs can join same job flow thinking it is
idle, delaying one of the jobs (438)
* Better error message when a config file contains no data for the current
runner (433)

0.3.3.1

* Fixed S3 locking mechanism parsing of last modified time to work around an
inconsistency in the EMR API

0.3.3

* EMR:
* Error detection code follows symlinks in Hadoop logs (396)
* terminate_idle_job_flows locks job flows before terminating them (391)
* terminate_idle_job_flows -qq silences all output (380)
* Other fixes:
* mr_tower_of_powers test no longer requires Testify (395)
* Various runner du() implementations no longer broken (393, 394)
* Hadoop counter parser regex handles long lines better (388)
* Hadoop counter parser regex is more correct (305)
* Better error when trying to parse YAML without PyYAML (348)

Mrjob

Page 7 of 10