forked from Yelp/mrjob
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGES.txt
202 lines (187 loc) · 9.55 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
v0.3.2, 2012-02-22 -- AMI versions, spot instances, and more
* Docs:
* 'Testing with mrjob' section in docs (includes #321)
* MRJobRunner.counters() included in docs (#321)
* terminate_idle_job_flows is spelled correctly in docs (#339)
* Running jobs:
* local mode:
* Allow non-string jobconf values again (this changed in v0.3.0)
* Don't split *.gz files (#333)
* emr mode:
* Spot instance support via ec2_*_instance_bid_price and renamed instance
type/number options (#219)
* ami_version option to allow switching between EMR AMIs (#306)
* 'Error while reading from input file' displays correct file (#358)
* python_bin used for bootstrap_python_packages instead of just 'python'
(#355)
* Pooling works with bootstrap_mrjob=False (#347)
* Pooling makes sure a job flow has space for the new job before joining
it (#324)
* EMR tools:
* create_job_flow no longer tries to use an option that does not exist
(#349)
* report_long_jobs tool alerts on jobs that have run for more than X hours
(#345)
* mrboss no longer spells stderr 'stsderr'
* terminate_idle_job_flows counts jobs with pending (but not running)
steps as idle (#365)
* terminate_idle_job_flows can terminate job flows near the end of a
billable hour (#319)
* audit_usage breaks down job flows by pool (#239)
* Various tools (e.g. audit_usage) get list of job flows correctly (#346)
v0.3.1, 2011-12-20 -- Nooooo there were bugs!
* Instance-type command-line arguments always override mrjob.conf (Issue #311)
* Fixed crash in mrjob.tools.emr.audit_usage (Issue #315)
* Tests now use unittest; python setup.py test now works (Issue #292)
v0.3.0, 2011-12-07 -- Worth the wait
* Configuration:
* Saner mrjob.conf locations (Issue #97):
* ~/.mrjob is deprecated in favor of ~/.mrjob.conf
* searching in PYTHONPATH is deprecated
* MRJOB_CONF environment variable for custom paths
* Defining Jobs (MRJob):
* Combiner support (Issue #74)
* *_init() and *_final() methods for mappers, combiners, and reducers
(Issue #124)
* mapper/combiner/reducer methods no longer need to contain a yield
statement if they emit no data
* Protocols:
* Protocols can be anything with read() and write() methods, and are
instances by default (Issue #229)
* Set protocols with the *_PROTOCOL attributes or by re-defining the
*_protocol() methods
* Built-in protocol classes cache the encoded and decoded value of the
last key for faster decoding during reducing (Issue #230)
* --*protocol switches and aliases are deprecated (Issue #106)
* Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
methods (Issue #241)
* --hadoop-*-format switches are deprecated
* Hadoop formats can no longer be set from mrjob.conf
* Set jobconf with JOBCONF attribute or the jobconf() method (in addition
to --jobconf)
* Set Hadoop partitioner class with --partitioner, PARTITIONER, or
partitioner() (Issue #6)
* Custom option parsing (Issue #172)
* Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
* Running jobs:
* All modes:
* All runners are Hadoop-version aware and use the correct jobconf and
combiner invocation styles (Issue #111)
* All types of URIs can be passed through to Hadoop (Issue #53)
* Speed up steps with no mapper by using cat (Issue #5)
* Stream compressed files with cat() method (Issue #17)
* hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96)
* job_name_prefix option is gone (was deprecated)
* Better cleanup (Issue #10):
* Separate cleanup_on_failure option
* More granular cleanup options
* Cleaner handling of passthrough options (Issue #32)
* emr mode:
* job flow pooling (Issue #26)
* vastly improved log fetching via SSH (Issue #2)
* New tool: mrjob.tools.emr.fetch_logs
* default Hadoop version on EMR is 0.20 (was 0.18)
* ec2_instance_type option now only sets instance type for slave nodes
when there are multiple EC2 instances (Issue #66)
* New tool: mrjob.tools.emr.mrboss for running commands on all nodes and
saving output locally
* inline mode:
* Supports cmdenv (Issue #136)
* Passthrough options can now affect steps list (Issue #301)
* local mode:
* Runs 2 mappers and 2 reducers in parallel by default (Issue #228)
* Preliminary Hadoop simulation for some jobconf variables (Issue #86)
* Misc:
* boto 2.0+ is now required (Issue #92)
* Removed debian packaging (should be handled separately)
v0.2.8, 2011-09-07 -- Bugfixes and betas
* Fix log parsing crash dealing with timeout errors
* Make mr_travelling_salesman.py work with simplejson
* Add emr_additional_info option, to support EMR beta features
* Remove debian packaging (should be handled separately)
* Fix crash when creating tmp bucket for job in us-east-1
v0.2.7, 2011-07-12 -- Hooray for interns!
* All runner options can be set from the command line (Issue #121)
* Including for mrjob.tools.emr.create_job_flow (Issue #142)
* New EMR options:
* availability_zone (Issue #72)
* bootstrap_actions (Issue #69)
* enable_emr_debugging (Issue #133)
* Read counters from EMR log files (Issue #134)
* Clean old files out of S3 with mrjob.tools.emr.s3_tmpwatch (Issue #9)
* EMR parses and reports job failure due to steps timing out (Issue #15)
* EMR boostrap files are no longer made public on S3 (Issue #70)
* mrjob.tools.emr.terminate_idle_job_flows handles custom hadoop streaming
jars correctly (Issue #116)
* LocalMRJobRunner separates out counters by step (Issue #28)
* bootstrap_python_packages works regardless of tarball name (Issue #49)
* mrjob always creates temp buckets in the correct AWS region (Issue #64)
* Catch abuse of __main__ in jobs (Issue #78)
* Added mr_travelling_salesman example
v0.2.6, 2011-05-24 -- Hadoop 0.20 in EMR, inline runner, and more
* Set Hadoop to run on EMR with --hadoop-version (Issue #71).
* Default is still 0.18, but will change to 0.20 in mrjob v0.3.0.
* New inline runner, for testing locally with a debugger
* New --strict-protocols option, to catch unencodable data (Issue #76)
* Added steps_python_bin option (for use with virtualenv)
* mrjob no longer chokes when asked to run on an EMR job flow running
Hadoop 0.20 (Issue #110)
* mrjob no longer chokes on job flows with no LogUri (Issue #112)
v0.2.5, 2011-04-29 -- Hadoop input and output formats
* Added hadoop_input/output_format options
* You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar)
* extra args to hadoop now come before -mapper/-reducer on EMR, so
that e.g. -libjar will work (worked in hadoop mode since v0.2.2)
* hadoop mode now supports s3n:// URIs (Issue #53)
v0.2.4, 2011-03-09 -- fix bootstrapping mrjob
* Fix bootstrapping of mrjob in hadoop and local mode (Issue #89)
* SSH tunnels try to use the same port for the same job flow (Issue #67)
* Added mr_postfix_bounce and mr_pegasos_svm to examples.
* Retry on spurious 505s from EMR API
v0.2.3, 2011-02-24 -- boto compatibility
* Fix incompatibility with boto 2.0b4 (Issue #91)
v0.2.2, 2011-02-15 -- GET/POST EMR issue
* Use POST requests for most EMR queries (EMR was choking on large GETs)
* find_probable_cause_of_failure() ignores transient errors (Issue #31)
* --hadoop-arg now actually works (Issue #79)
* on Hadoop, extra args are added first, so you can set e.g. -libjar
* S3 buckets may now have . in their names
* MRJob scripts now respect --quiet (Issue #84)
* added --no-output option for MRJob scripts (Issue #81)
* added --python-bin option (Issue #54)
v0.2.1, 2010-11-17 -- laststatechangereason bugfix
* Don't assume EMR sets laststatechangereason
v0.2.0, 2010-11-15 -- Many bugfixes, Windows support
* New Features/Changes:
* EMRJobRunner now prints % of mappers and reducers completed when you
enable the SSH tunnel.
* Added mr_page_rank example
* Added mrjob.tools.emr.audit_usage script (Issue #21)
* You can specify alternate job owners with the "owner" option. Useful for
auditing usage. (Issue #59)
* The job_name_prefix option has been renamed to label (the old name still
works but is deprecated)
* bootstrap_cmds and bootstrap_scripts no longer automatically invoke sudo
* Bugs Fixed/Cleanup:
* bootstrap files no longer get uploaded to S3 twice (Issue #8)
* When using add_file_option(), show_steps() can now see the local version
of the file (Issue #45)
* Now works on Windows (Issue #46)
* No longer requires external jar, tar, or zip binaries (Issue #47)
* mrjob-* scratch bucket is only created as needed (Issue #50)
* Can now specify us-east-1 region explicitly (Issue #58)
* mrjob.tools.emr.terminate_idle_job_flows leaves Hive jobs alone (Issue #60)
v0.1.0, 2010-10-28 -- Same code, better version. It's official!
v0.1.0-pre3, 2010-10-27 -- Pre-release to run Yelp code against
* Added debian packaging
* mrjob bootstrapping can now deal with symlinks in site-packages/mrjob
* MRJobRunner.stream_output() can now be called multiple times
v0.1.0-pre2, 2010-10-25 -- Second pre-release after testing
* Fixed small bugs that broke Python 2.5.1 and Python 2.7
* Fixed reading mrjob.conf without yaml installed
* Fix tests to work with modern simplejson and pipes.quote()
* Auto-create temp bucket on S3 if we don't have one (Issue #16)
* Auto-infer AWS region from bucket (Issue #7)
* --steps now passes in all extra args (e.g. --protocol) (Issue #4)
* Better docs
v0.1.0-pre1, 2010-10-21 -- Initial pre-release. YMMV!