* Configuration:
* Saner mrjob.conf locations (Issue 97):
* ~/.mrjob is deprecated in favor of ~/.mrjob.conf
* searching in PYTHONPATH is deprecated
* MRJOB_CONF environment variable for custom paths
* Defining Jobs (MRJob):
* Combiner support (Issue 74)
* *_init() and *_final() methods for mappers, combiners, and reducers
(Issue 124)
* mapper/combiner/reducer methods no longer need to contain a yield
statement if they emit no data
* Protocols:
* Protocols can be anything with read() and write() methods, and are
instances by default (Issue 229)
* Set protocols with the *_PROTOCOL attributes or by re-defining the
*_protocol() methods
* Built-in protocol classes cache the encoded and decoded value of the
last key for faster decoding during reducing (Issue 230)
* --*protocol switches and aliases are deprecated (Issue 106)
* Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
methods (Issue 241)
* --hadoop-*-format switches are deprecated
* Hadoop formats can no longer be set from mrjob.conf
* Set jobconf with JOBCONF attribute or the jobconf() method (in addition
to --jobconf)
* Set Hadoop partitioner class with --partitioner, PARTITIONER, or
partitioner() (Issue 6)
* Custom option parsing (Issue 172)
* Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
* Running jobs:
* All modes:
* All runners are Hadoop-version aware and use the correct jobconf and
combiner invocation styles (Issue 111)
* All types of URIs can be passed through to Hadoop (Issue 53)
* Speed up steps with no mapper by using cat (Issue 5)
* Stream compressed files with cat() method (Issue 17)
* hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue 96)
* job_name_prefix option is gone (was deprecated)
* Better cleanup (Issue 10):
* Separate cleanup_on_failure option
* More granular cleanup options
* Cleaner handling of passthrough options (Issue 32)
* emr mode:
* job flow pooling (Issue 26)
* vastly improved log fetching via SSH (Issue 2)
* New tool: mrjob.tools.emr.fetch_logs
* default Hadoop version on EMR is 0.20 (was 0.18)
* ec2_instance_type option now only sets instance type for slave nodes
when there are multiple EC2 instances (Issue 66)
* New tool: mrjob.tools.emr.mrboss for running commands on all nodes and
saving output locally
* inline mode:
* Supports cmdenv (Issue 136)
* Passthrough options can now affect steps list (Issue 301)
* local mode:
* Runs 2 mappers and 2 reducers in parallel by default (Issue 228)
* Preliminary Hadoop simulation for some jobconf variables (Issue 86)
* Misc:
* boto 2.0+ is now required (Issue 92)
* Removed debian packaging (should be handled separately)