-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow processings #207
Comments
Some work for producing performance tests has been started at the following branch: https://github.com/mezuro/kalibro_processor/tree/aggregation_performance |
I've added tests to But the results are still not close to be that slow with an mean run time of 0.55s. Insights are welcome @mezuro/core 😄 |
We'd probably need to test with a high concurrency level if the database itself is being the bottleneck. I think profiling the queries through PostgreSQL would be the easier way to understand the issues - we can look at the query plans and query times using something like pgBadger. |
On mezuro.org there is a single worker generating workloads to the database, so I don't believe we'll replicate the issue with concurrency. But I've tested with a single About pgBadger looks like a nice tool to use once we can produce a test that replicates the issues found at mezuro.org. |
I thought we had four workers for kalibro_processor, my mistake! |
I've fixed the perfomance test. It was fast because I missed proper data initialization. Now I have the following for a tree with height 10 and 2 children per node, so 512 leafs. Then I've produced the following:
Running with 8 metrics still not produced the 100% usage by postgresql seen at mezuro.org. The plot looks quadratic. |
Possible cause: https://github.com/mezuro/kalibro_processor/blob/master/lib/processor/aggregator.rb#L23. @danielkza pointed out that the |
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.
I've run the same tests as @rafamanzo and have got the following results:
Then, I've created indexes on the columns
From these results, I don't think the lack of indexes are the main problem for such sluggish processings. Another thing, could the metrics used affect the results? I've commented the metrics I wouldn't use on the test from bottom to top (given the order on the performance script). Is this how you did too, @rafamanzo? Also, it is recommended to someone else corroborate these results. I've commited the changes and pushed a new branch called Finally, if the results are corroborated, I think the reason for the indexes to not affect the performance is that we make to many insertions on MetricResult's table. Note, we are using a b-tree as the index structure. On those structures, searches are greatly improved. The searching algorithm goes from What do you think? |
From branch
Starting from 4 metrics one CPU have become pinned at 100% usage. But not by postgres. The aggregation itself was using 100% of one CPU. Looks like this test is stressing some other performance issue rather than the missing indexes that we need to address before setting those up. But I believe they will play an important role afterwards on the large database we have at mezuro.org. They actually look to have eased some load on my machine starting at 4 metrics. |
Adding the following snippet at the end of File.open('profile_pt.html', 'w') { |file| RubyProf::CallStackPrinter.new(@results['Process Time'].first).print(file) } I was able to narrow 51% of the process time to def descendant_values
module_result.children.map { |child|
metric_result = child.tree_metric_result_for(self.metric)
metric_result.value if metric_result
}.compact
end Which look really macarronic and inefficient. Then I replaced by: def descendant_values
self.class.
where(module_result: module_result.children,
metric_configuration_id: metric_configuration_id).
select(:value).map(&:value)
end But the execution time for the test increased 😭 If someone else can get to the same result, I'd say we have to write some performance tests for this method as well. Can someone validate those results? What do you think about a test focusing on this? |
@rafamanzo Did you measure wall time also? If we're removing database waits the process time might not change that much, but the wall time might. |
The wall time increased as well 😞 |
Looking at same report format for wall time File.open('profile_wt.html', 'w') { |file| RubyProf::CallStackPrinter.new(@results['Wall Time'].first).print(file) } For the code with the new |
Little improvement, but still slower than base. I'm taking out now, but if anyone gets in the mood I'd give a try to https://github.com/zdennis/activerecord-import. If we can iterate through the modules tree height, we could transform 1022 insertions (for the 2 metrics case) into 10 insertions. |
Can you try with this versions of def descendant_values
self.class.
joins(:module_result).
where('module_results.parent_id' => module_result_id).
pluck(:value)
end |
I get some good speedup with my version (and database indexes). Old:
New:
edit: got it to 12s now by walking the tree in whole "levels" (instead of just children of one parent) and doing bulk inserts. |
@danielkza I've tested here your version for
Which is a slight slowdown from 14s previously (and now with no load on the machine running the tests I've got 12s). I believe you are running for the 8 metrics case, right? It is a small setback for the smaller cases and a huge improvement for both run time for large metrics sets and even code readability. Thanks for the help. If you are up to make those changes into a PR, I'd ask you in advance to split them into one for each modification and be careful with the commit sizes. Otherwise I'll get to it probably on Friday afternoon and I'd be happy if you'd review my PR then 😄 Indeed a nice job. Congratulations. |
I'll probably make 2 or 3 PRs. I have some improvements for the perf script, the indexes, refactoring of the aggregation, and possibly one for the compound metrics calculation (I noticed it was walking the metrics tree regardless of whether there are any actual compound metrics being used). |
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207.
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207.
Closing as the Aggregation performance has been successfully addressed and further investigation for other performance improvements have been extracted into separated issues. |
Running the following on mezuro.org console gives us an hint of why some processings take so long
It means that on average each Processing takes approximately 30min aggregating.
By looking at the database machine processor we can see it at 100% full time during aggregation.
I suggest:
RubyProf::CallStackPrinter
TreeMetricResult
s bottom-upmetric_results
table may be worth the extra insertion cost. We need to investigate that once the bulk insertions are up.TreeMetricResult#descendant_values
performanceThe text was updated successfully, but these errors were encountered: