-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet-compression for lgbm #15
base: main
Are you sure you want to change the base?
Conversation
(benchmark 4547923027 / attempt 1)
|
Current state in performance: Timer unit: 1e-06 s
Total time: 4.75943 s
File: /Users/ytatar/projects/2302-pickle-compression/slim_trees/lgbm_booster.py
Function: _decompress_booster_handle at line 178
Line # Hits Time Per Hit % Time Line Contents
==============================================================
178 # @profile
179 def _decompress_booster_handle(
180 compressed_state: Tuple[str, bytes, bytes, bytes, str]
181 ) -> str:
182 10 4.0 0.4 0.0 (
183 10 0.0 0.0 0.0 front_str,
184 10 0.0 0.0 0.0 trees_df_bytes,
185 10 0.0 0.0 0.0 nodes_df_bytes,
186 10 0.0 0.0 0.0 leaf_value_bytes,
187 10 0.0 0.0 0.0 back_str,
188 10 2.0 0.2 0.0 ) = compressed_state
189 10 4.0 0.4 0.0 assert type(front_str) == str
190 10 4.0 0.4 0.0 assert type(back_str) == str
191
192 10 34970.0 3497.0 0.7 trees_df = pq_bytes_to_df(trees_df_bytes)
193 10 2020509.0 202050.9 42.5 nodes_df = pq_bytes_to_df(nodes_df_bytes).groupby("tree_idx").agg(lambda x: list(x))
194 10 2.0 0.2 0.0 leaf_values_df = (
195 10 357533.0 35753.3 7.5 pq_bytes_to_df(leaf_value_bytes).groupby("tree_idx")["leaf_value"].apply(list)
196 )
197 # merge trees_df, nodes_df, and leaf_values_df on tree_idx
198 10 11741.0 1174.1 0.2 trees_df = trees_df.merge(nodes_df, on="tree_idx")
199 10 9704.0 970.4 0.2 trees_df = trees_df.merge(leaf_values_df, on="tree_idx")
200
201 # handle = front_str
202
203 10 3.0 0.3 0.0 tree_strings = [front_str]
204
205 # TODO: directly go over trees and nodes
206 20000 620494.0 31.0 13.0 for i, tree in trees_df.iterrows():
222
223 20000 79986.0 4.0 1.7 num_leaves = int(tree["num_leaves"])
224 20000 2578.0 0.1 0.1 num_nodes = num_leaves - 1
225
226 20000 26030.0 1.3 0.5 tree_strings.append(f"""Tree={i}
227 20000 61575.0 3.1 1.3 num_leaves={int(tree["num_leaves"])}
228 20000 59082.0 3.0 1.2 num_cat={tree['num_cat']}
229 20000 148784.0 7.4 3.1 split_feature={' '.join([str(x) for x in tree["split_feature"]])}
230 20000 4304.0 0.2 0.1 split_gain={("0" * num_nodes)[:-1]}
231 20000 357808.0 17.9 7.5 threshold={' '.join([str(x) for x in tree['threshold']])}
232 20000 149148.0 7.5 3.1 decision_type={' '.join([str(x) for x in tree["decision_type"]])}
233 20000 151402.0 7.6 3.2 left_child={" ".join([str(x) for x in tree["left_child"]])}
234 20000 151562.0 7.6 3.2 right_child={" ".join([str(x) for x in tree["right_child"]])}
235 20000 369298.0 18.5 7.8 leaf_value={" ".join([str(x) for x in tree["leaf_value"]])}
236 20000 4519.0 0.2 0.1 leaf_weight={("0 " * num_leaves)[:-1]}
237 20000 4008.0 0.2 0.1 leaf_count={("0 " * num_leaves)[:-1]}
238 20000 3483.0 0.2 0.1 internal_value={("0 " * num_nodes)[:-1]}
239 20000 3363.0 0.2 0.1 internal_weight={("0 " * num_nodes)[:-1]}
240 20000 3289.0 0.2 0.1 internal_count={("0 " * num_nodes)[:-1]}
241 20000 61646.0 3.1 1.3 is_linear={tree['is_linear']}
242 20000 57860.0 2.9 1.2 shrinkage={tree['shrinkage']}
243
244
276
277 10 1.0 0.1 0.0 tree_strings.append(back_str)
278
279 # handle += back_str
280 10 4732.0 473.2 0.1 return "".join(tree_strings) |
I seem to be missing some values that are required for inference, hence all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Here a few comments.
Don't forget to delete benchmark.py.lprof
before merging.
LGBM gbdt large: 3.05x -> 7.15x While this is a great performance improvement without using LZMA at the end, the improvement is not that big when only using LZMA. Maybe it makes sense to add an option not to use the pyarrow way but to use the old implementation instead if pyarrow is not installed? But I suggest moving this discussion to when the PR is almost ready to merge. |
The table suggests that the factor improves especially in larger models. Unfortunately, we have lost some performance again, I seem to remember partially up to 11x. |
stream = pa.BufferOutputStream() | ||
pq.write_table( | ||
table, stream, compression="lz4" | ||
) # TODO: investigate different effects of compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also test some other compression methods other than lz4
. During my sklearn tests I noticed that no
also had very nice performance when using lzma
afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out zstd
, and it, randomly, performed better here. 🤷
Old:
New:
|
As discussed with @pavelzw today, how about splitting the non-Parquet parts into a separate PR? It's annoying that Parquet doesn't improve the compression ratio despite the effort that you have put into this @YYYasin19. But let's make the best out of it by keeping the other parts! |
""" | ||
stream = pa.BufferOutputStream() | ||
pq.write_table( | ||
table, stream, compression="zstd", compression_level=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea for the future might be to try out different values here. Just checked and found out that disabling zstd
here allows lzma
to compress more later resulting in a slight overall improvement by around ~5% for the trees_df
--> might generalize to other dataframes as well
initial try for a parquet compression implementation
this is a draft!
core idea:
todos: