-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive performance regression between 419 and 463 #24099
Comments
Which JDK are you using? Trino 457 uses native code for decompression: https://trino.io/docs/current/release/release-457.html#hive-connector |
Also, 458 switched off the legacy FS library, did you migrate to native FS or kept the legacy one? Can you share your catalog configs? |
@nineinchnick Thank you very much for your quick response!
I do have
hive.properties
coordinator config.properties
worker config.properties
|
Can you try disabling native compression? |
Do you mean by following https://trino.io/docs/current/admin/properties-general.html#file-compression-and-decompression. Just tried that and it did not seem to make a difference in performance. Pretty much the same CPU Time and Execution Time Here's my jvm.config
|
So we need more performance data to analyze this case, can you share operator stats for that query? |
Is the query plan the same for both versions? Or different? |
I've updated issue title as it doesn't seem to be related to zstd at all. |
Can you add output of EXPLAIN ANALYZE and the queryinfo json from both versions ? |
Here are the query plans
|
And here are the JSONs |
In 463 table scan is |
@wendigo Just following up here. Do you need any more info from me? Or do you see something incorrect with my setup that has caused the increase in CPU? |
@raunaqmorarka do you have time and resources to triage that? |
Given that this is a CSV file, we're likely dealing with some missing optimization in the new native csv reader. |
@raunaqmorarka this is an upstream data source that I don't have control over. We use trino to aggregate this csv and transform it into orc for downstream consumers |
You can also notice a significant memory usage increase. I just tried to do a migration from Trino 419 to 464 and ran into excessive memory usage using just Iceberg catalogs. I can't share the SQL query, however query uses LEFT JOIN and GROUP BY. but here is some statistic results from Trino 419:
vs Trino 464:
So it doesn't appear to be Hive specific. |
@benrifkind are you able to share JFRs from old and new version ? |
@raunaqmorarka Can you explain how I would do that? Do you want recordings from the coordinator and the workers? |
A profile from any worker which reads the csv files would be sufficient https://www.baeldung.com/java-flight-recorder-monitoring |
@raunaqmorarka Does this work? I had to rename it to .txt because Github doesn't allow .jfr file. |
@raunaqmorarka this is Java decompressor, not a native one |
Yes, I see that, I was wondering why the native decompressor is not used and why it's slower than previous releases. Was legacy reader using native zstd from hadoop or the same Java implementation ? |
@raunaqmorarka the V3 Java implementation is the same as V2/V1 so there should be no performance penalty at all. |
Here is the Trino 419 jfr file trino-419.jfr.txt |
I am trying to understand a performance degradation that has happened on upgrading from Trino 419 to Trino 463. Querying hive tables with zstd compressed data in s3 seem to run significantly slower in Trino 463 than in Trino 419.
I have symlink Hive table built on top of zstd compressed data in s3. Querying this table is relatively fast in Trino 419 however when I tried to upgrade to the most recent version of Trino I saw a significant decrease in speed of execution and a spike in CPU.
It is a simple group by query like
These are the query stats in Trino 419
And these are the query stats in Trino 463
The same query now takes double the amount of time and the CPU is much higher.
I tried to pinpoint where in the upgrade path this performance degredation occurred but when I tried running on Trino 430 for example I get errors like
I am running this self hosted on AWS EC2 instances. The coordinator is of type r7g.4xlarge and there are 5 workers of type are r7g.8xlarge.
I don't think the fact that this is symlink table has anything to do with the performance issue. The reason that this is a symlink table is because the data is stored in s3 in a funky way that does not lend itself to the hive partitioning scheme.
This is some info about one of zstd file's
The text was updated successfully, but these errors were encountered: