CHANGES.txt

v0.3.2, 2012-02-22 -- AMI versions, spot instances, and more
 * Docs:
   * 'Testing with mrjob' section in docs (includes #321)
   * MRJobRunner.counters() included in docs (#321)
   * terminate_idle_job_flows is spelled correctly in docs (#339)
 * Running jobs:
   * local mode:
     * Allow non-string jobconf values again (this changed in v0.3.0)
     * Don't split *.gz files (#333)
   * emr mode:
     * Spot instance support via ec2_*_instance_bid_price and renamed instance
       type/number options (#219)
     * ami_version option to allow switching between EMR AMIs (#306)
     * 'Error while reading from input file' displays correct file (#358)
     * python_bin used for bootstrap_python_packages instead of just 'python'
       (#355)
     * Pooling works with bootstrap_mrjob=False (#347)
     * Pooling makes sure a job flow has space for the new job before joining
       it (#324)
 * EMR tools:
   * create_job_flow no longer tries to use an option that does not exist
     (#349)
   * report_long_jobs tool alerts on jobs that have run for more than X hours
     (#345)
   * mrboss no longer spells stderr 'stsderr'
   * terminate_idle_job_flows counts jobs with pending (but not running)
     steps as idle (#365)
   * terminate_idle_job_flows can terminate job flows near the end of a
     billable hour (#319)
   * audit_usage breaks down job flows by pool (#239)
   * Various tools (e.g. audit_usage) get list of job flows correctly (#346)

v0.3.1, 2011-12-20 -- Nooooo there were bugs!
 * Instance-type command-line arguments always override mrjob.conf (Issue #311)
 * Fixed crash in mrjob.tools.emr.audit_usage (Issue #315)
 * Tests now use unittest; python setup.py test now works (Issue #292)

v0.3.0, 2011-12-07 -- Worth the wait
 * Configuration:
   * Saner mrjob.conf locations (Issue #97):
     * ~/.mrjob is deprecated in favor of ~/.mrjob.conf
     * searching in PYTHONPATH is deprecated
     * MRJOB_CONF environment variable for custom paths
 * Defining Jobs (MRJob):
   * Combiner support (Issue #74)
   * *_init() and *_final() methods for mappers, combiners, and reducers
     (Issue #124)
   * mapper/combiner/reducer methods no longer need to contain a yield
     statement if they emit no data
   * Protocols:
     * Protocols can be anything with read() and write() methods, and are
       instances by default (Issue #229)
     * Set protocols with the *_PROTOCOL attributes or by re-defining the
       *_protocol() methods
     * Built-in protocol classes cache the encoded and decoded value of the
       last key for faster decoding during reducing (Issue #230)
     * --*protocol switches and aliases are deprecated (Issue #106)
   * Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
     methods (Issue #241)
     * --hadoop-*-format switches are deprecated
     * Hadoop formats can no longer be set from mrjob.conf
   * Set jobconf with JOBCONF attribute or the jobconf() method (in addition
     to --jobconf)
   * Set Hadoop partitioner class with --partitioner, PARTITIONER, or
     partitioner() (Issue #6)
   * Custom option parsing (Issue #172)
   * Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
 * Running jobs:
   * All modes:
     * All runners are Hadoop-version aware and use the correct jobconf and
       combiner invocation styles (Issue #111)
     * All types of URIs can be passed through to Hadoop (Issue #53)
     * Speed up steps with no mapper by using cat (Issue #5)
     * Stream compressed files with cat() method (Issue #17)
     * hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96)
     * job_name_prefix option is gone (was deprecated)
     * Better cleanup (Issue #10):
       * Separate cleanup_on_failure option
       * More granular cleanup options
     * Cleaner handling of passthrough options (Issue #32)
   * emr mode:
     * job flow pooling (Issue #26)
     * vastly improved log fetching via SSH (Issue #2)
       * New tool: mrjob.tools.emr.fetch_logs
     * default Hadoop version on EMR is 0.20 (was 0.18)
     * ec2_instance_type option now only sets instance type for slave nodes
       when there are multiple EC2 instances (Issue #66)
     * New tool: mrjob.tools.emr.mrboss for running commands on all nodes and
       saving output locally
   * inline mode:
     * Supports cmdenv (Issue #136)
     * Passthrough options can now affect steps list (Issue #301)
   * local mode:
     * Runs 2 mappers and 2 reducers in parallel by default (Issue #228)
     * Preliminary Hadoop simulation for some jobconf variables (Issue #86)
 * Misc:
   * boto 2.0+ is now required (Issue #92)
   * Removed debian packaging (should be handled separately)

v0.2.8, 2011-09-07 -- Bugfixes and betas
 * Fix log parsing crash dealing with timeout errors
 * Make mr_travelling_salesman.py work with simplejson
 * Add emr_additional_info option, to support EMR beta features
 * Remove debian packaging (should be handled separately)
 * Fix crash when creating tmp bucket for job in us-east-1

v0.2.7, 2011-07-12 -- Hooray for interns!
 * All runner options can be set from the command line (Issue #121)
   * Including for mrjob.tools.emr.create_job_flow (Issue #142)
 * New EMR options:
   * availability_zone (Issue #72)
   * bootstrap_actions (Issue #69)
   * enable_emr_debugging (Issue #133)
 * Read counters from EMR log files (Issue #134)
 * Clean old files out of S3 with mrjob.tools.emr.s3_tmpwatch (Issue #9)
 * EMR parses and reports job failure due to steps timing out (Issue #15)
 * EMR boostrap files are no longer made public on S3 (Issue #70)
 * mrjob.tools.emr.terminate_idle_job_flows handles custom hadoop streaming
   jars correctly (Issue #116)
 * LocalMRJobRunner separates out counters by step (Issue #28)
 * bootstrap_python_packages works regardless of tarball name (Issue #49)
 * mrjob always creates temp buckets in the correct AWS region (Issue #64)
 * Catch abuse of __main__ in jobs (Issue #78)
 * Added mr_travelling_salesman example

v0.2.6, 2011-05-24 -- Hadoop 0.20 in EMR, inline runner, and more
* Set Hadoop to run on EMR with --hadoop-version (Issue #71).
   * Default is still 0.18, but will change to 0.20 in mrjob v0.3.0.
 * New inline runner, for testing locally with a debugger
 * New --strict-protocols option, to catch unencodable data (Issue #76)
 * Added steps_python_bin option (for use with virtualenv)
 * mrjob no longer chokes when asked to run on an EMR job flow running 
   Hadoop 0.20 (Issue #110)
 * mrjob no longer chokes on job flows with no LogUri (Issue #112)

v0.2.5, 2011-04-29 -- Hadoop input and output formats
 * Added hadoop_input/output_format options
 * You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar)
 * extra args to hadoop now come before -mapper/-reducer on EMR, so
   that e.g. -libjar will work (worked in hadoop mode since v0.2.2)
 * hadoop mode now supports s3n:// URIs (Issue #53)

v0.2.4, 2011-03-09 -- fix bootstrapping mrjob
 * Fix bootstrapping of mrjob in hadoop and local mode (Issue #89)
 * SSH tunnels try to use the same port for the same job flow (Issue #67)
 * Added mr_postfix_bounce and mr_pegasos_svm to examples.
 * Retry on spurious 505s from EMR API

v0.2.3, 2011-02-24 -- boto compatibility
 * Fix incompatibility with boto 2.0b4 (Issue #91)

v0.2.2, 2011-02-15 -- GET/POST EMR issue
 * Use POST requests for most EMR queries (EMR was choking on large GETs)
 * find_probable_cause_of_failure() ignores transient errors (Issue #31)
 * --hadoop-arg now actually works (Issue #79)
   * on Hadoop, extra args are added first, so you can set e.g. -libjar
 * S3 buckets may now have . in their names
 * MRJob scripts now respect --quiet (Issue #84)
 * added --no-output option for MRJob scripts (Issue #81)
 * added --python-bin option (Issue #54)

v0.2.1, 2010-11-17 -- laststatechangereason bugfix
 * Don't assume EMR sets laststatechangereason

v0.2.0, 2010-11-15 -- Many bugfixes, Windows support
 * New Features/Changes:
   * EMRJobRunner now prints % of mappers and reducers completed when you 
     enable the SSH tunnel.
   * Added mr_page_rank example
   * Added mrjob.tools.emr.audit_usage script (Issue #21)
   * You can specify alternate job owners with the "owner" option. Useful for 
     auditing usage. (Issue #59)
   * The job_name_prefix option has been renamed to label (the old name still 
     works but is deprecated)
   * bootstrap_cmds and bootstrap_scripts no longer automatically invoke sudo
 * Bugs Fixed/Cleanup:
   * bootstrap files no longer get uploaded to S3 twice (Issue #8)
   * When using add_file_option(), show_steps() can now see the local version 
     of the file (Issue #45)
   * Now works on Windows (Issue #46)
   * No longer requires external jar, tar, or zip binaries (Issue #47)
   * mrjob-* scratch bucket is only created as needed (Issue #50)
   * Can now specify us-east-1 region explicitly (Issue #58)
   * mrjob.tools.emr.terminate_idle_job_flows leaves Hive jobs alone (Issue #60)

v0.1.0, 2010-10-28 -- Same code, better version. It's official!

v0.1.0-pre3, 2010-10-27 -- Pre-release to run Yelp code against
 * Added debian packaging
 * mrjob bootstrapping can now deal with symlinks in site-packages/mrjob
 * MRJobRunner.stream_output() can now be called multiple times

v0.1.0-pre2, 2010-10-25 -- Second pre-release after testing
 * Fixed small bugs that broke Python 2.5.1 and Python 2.7
 * Fixed reading mrjob.conf without yaml installed
 * Fix tests to work with modern simplejson and pipes.quote()
 * Auto-create temp bucket on S3 if we don't have one (Issue #16)
 * Auto-infer AWS region from bucket (Issue #7)
 * --steps now passes in all extra args (e.g. --protocol) (Issue #4)
 * Better docs

v0.1.0-pre1, 2010-10-21 -- Initial pre-release. YMMV!