Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow processings #207

Closed
5 tasks done
rafamanzo opened this issue Jul 8, 2016 · 20 comments
Closed
5 tasks done

Slow processings #207

rafamanzo opened this issue Jul 8, 2016 · 20 comments

Comments

@rafamanzo
Copy link
Member

rafamanzo commented Jul 8, 2016

Running the following on mezuro.org console gives us an hint of why some processings take so long

=> ProcessTime.all.group('state').average('time').map { |k,v| {k => v.to_f/60.0} }

=> [
  {"PREPARING"=>0.0021892476280869332},
  {"CHECKING"=>2.1039809739626834},
  {"AGGREGATING"=>29.408375646260332},
  {"BUILDING"=>0.919068727626195},
  {"INTERPRETING"=>1.7958658936723333},
  {"DOWNLOADING"=>0.20159135590306834},
  {"CALCULATING"=>3.2441646428687334},
  {"COLLECTING"=>4.56774077964555}
]

It means that on average each Processing takes approximately 30min aggregating.

By looking at the database machine processor we can see it at 100% full time during aggregation.

I suggest:

@rafamanzo
Copy link
Member Author

rafamanzo commented Jul 8, 2016

Some work for producing performance tests has been started at the following branch: https://github.com/mezuro/kalibro_processor/tree/aggregation_performance

@rafamanzo
Copy link
Member Author

I've added tests to Aggregator with similar complexity to the one found at mezuro.org: ced0eee

But the results are still not close to be that slow with an mean run time of 0.55s.

Insights are welcome @mezuro/core 😄

@danielkza
Copy link
Contributor

We'd probably need to test with a high concurrency level if the database itself is being the bottleneck. I think profiling the queries through PostgreSQL would be the easier way to understand the issues - we can look at the query plans and query times using something like pgBadger.

@rafamanzo
Copy link
Member Author

rafamanzo commented Jul 15, 2016

On mezuro.org there is a single worker generating workloads to the database, so I don't believe we'll replicate the issue with concurrency. But I've tested with a single MetricResult. Perhaps it is worth to increase its count on the test as mezuro.org issues happen with configurations holding at least half dozen metrics.

About pgBadger looks like a nice tool to use once we can produce a test that replicates the issues found at mezuro.org.

@danielkza
Copy link
Contributor

I thought we had four workers for kalibro_processor, my mistake!

@rafamanzo
Copy link
Member Author

rafamanzo commented Jul 15, 2016

I've fixed the perfomance test. It was fast because I missed proper data initialization.

Now I have the following for a tree with height 10 and 2 children per node, so 512 leafs. Then I've produced the following:

Metric Count Wall Time (s) Process Time (s)
1 8.672761964797974 7.54611889
2 14.596535205841064 24.3851091414
4 77.55368666648864 127.275106022
8 256.8674069881439 556.1682938326

Running with 8 metrics still not produced the 100% usage by postgresql seen at mezuro.org. The plot looks quadratic.

@rafamanzo
Copy link
Member Author

Possible cause: https://github.com/mezuro/kalibro_processor/blob/master/lib/processor/aggregator.rb#L23. @danielkza pointed out that the MetricResults table has no indexation which make the repetition of such query really slow.

danielkza added a commit that referenced this issue Jul 16, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.

Additionally, remove the timestamp columns from kalibro_modules,
module_results and metric_results: they are not used in any way, and
there are millions of rows containing them. It's possible and probably
desirable to just look at the timestamps in the processing.
danielkza added a commit that referenced this issue Jul 16, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.

Additionally, remove the timestamp columns from kalibro_modules,
module_results and metric_results: they are not used in any way, and
there are millions of rows containing them. It's possible and probably
desirable to just look at the timestamps in the processing.
@diegoamc
Copy link
Contributor

diegoamc commented Jul 16, 2016

I've run the same tests as @rafamanzo and have got the following results:

Metric Count Wall Time (s) Process Time (s)
1 6.992499828338623 7.272484427600001
2 13.529412364959716 12.9735503228
4 26.20373945236206 28.432685047400003
8 62.70585794448853 64.16384417580001

Then, I've created indexes on the columns module_result_id and metric_configuration_id of table metric_results and rerun the same tests. The following are the results for them:

Metric Count Wall Time (s) Process Time (s)
1 6.903397846221924 7.131537662599999
2 13.142084169387818 13.2359803324
4 28.782897806167604 28.8164754306
8 64.86161379814148 56.7507790644

From these results, I don't think the lack of indexes are the main problem for such sluggish processings.
However, I also have some doubts about my tests. First, my results are much faster than @rafamanzo's. What could be the reasons for that? Just processing power?

Another thing, could the metrics used affect the results? I've commented the metrics I wouldn't use on the test from bottom to top (given the order on the performance script). Is this how you did too, @rafamanzo?

Also, it is recommended to someone else corroborate these results. I've commited the changes and pushed a new branch called metric_result_indexes. The commit goes with the modified database schema, so a single rollback is needed to return the database to its previous state (without the indexes).

Finally, if the results are corroborated, I think the reason for the indexes to not affect the performance is that we make to many insertions on MetricResult's table. Note, we are using a b-tree as the index structure. On those structures, searches are greatly improved. The searching algorithm goes from O(n) to O[t*log(n)], where t is the number os elements in each node of a b-tree. However, insertions become slower. The insertion algorithm goes from O(1) to O[log(n)]. Therefore, making a lot of insertions on the table, which is our case (on the script setup, with 8 metrics, we create 8184 metric results), is not recommended. That is because the time gained by a faster searching may be compensated by a slower insertion. In my opinion, that is what the results are showing. We can only conclude that the sluggish performance of the aggregation state is not due to lack of indexes on the MetricResult's table. Reference for b-tree's algorithms and time complexities I used here.

What do you think?

@rafamanzo
Copy link
Member Author

From branch metric_result_indexes I've got the following on the same hardware as the previous tests:

Metric Count Wall Time (s) Process Time (s)
1 7.591419649124146 12.0535053682
2 31.928905773162843 39.1385960514
4 74.23858404159546 140.50534778839997
8 247.94011964797974 535.5620568106

Starting from 4 metrics one CPU have become pinned at 100% usage. But not by postgres. The aggregation itself was using 100% of one CPU.

Looks like this test is stressing some other performance issue rather than the missing indexes that we need to address before setting those up. But I believe they will play an important role afterwards on the large database we have at mezuro.org. They actually look to have eased some load on my machine starting at 4 metrics.

@rafamanzo
Copy link
Member Author

rafamanzo commented Jul 17, 2016

Adding the following snippet at the end of Performance::Base#print:

File.open('profile_pt.html', 'w') { |file| RubyProf::CallStackPrinter.new(@results['Process Time'].first).print(file) }

I was able to narrow 51% of the process time to TreeMetricResult#descendant_values which has the following implementation:

def descendant_values
  module_result.children.map { |child|
    metric_result = child.tree_metric_result_for(self.metric)
    metric_result.value if metric_result
  }.compact
end

Which look really macarronic and inefficient. Then I replaced by:

def descendant_values
  self.class.
    where(module_result: module_result.children,
          metric_configuration_id: metric_configuration_id).
    select(:value).map(&:value)
end

But the execution time for the test increased 😭

If someone else can get to the same result, I'd say we have to write some performance tests for this method as well. Can someone validate those results? What do you think about a test focusing on this?

@danielkza
Copy link
Contributor

@rafamanzo Did you measure wall time also? If we're removing database waits the process time might not change that much, but the wall time might.

@rafamanzo
Copy link
Member Author

The wall time increased as well 😞

@rafamanzo
Copy link
Member Author

Looking at same report format for wall time

File.open('profile_wt.html', 'w') { |file| RubyProf::CallStackPrinter.new(@results['Wall Time'].first).print(file) }

For the code with the new descendant_values implementation, ActiveRecord::Transactions#save takes 54% of the time. I'm experimenting with ActiveRecord::Base.transaction to see if we can speed the save by making bigger transactions.

@rafamanzo
Copy link
Member Author

Little improvement, but still slower than base. I'm taking out now, but if anyone gets in the mood I'd give a try to https://github.com/zdennis/activerecord-import.

If we can iterate through the modules tree height, we could transform 1022 insertions (for the 2 metrics case) into 10 insertions.

@danielkza
Copy link
Contributor

danielkza commented Jul 17, 2016

Can you try with this versions of descendant_values? It should avoid the extra query to fetch the children module_results at the start.

  def descendant_values
    self.class.
      joins(:module_result).
      where('module_results.parent_id' => module_result_id).
      pluck(:value)
  end

@danielkza
Copy link
Contributor

danielkza commented Jul 17, 2016

I get some good speedup with my version (and database indexes). Old:

WALL_TIME:
  198.24422693252563
  205.54687118530273
  223.26766705513
  Average: 209.01958839098612

New:

WALL_TIME:
  93.48759293556213
  98.39741396903992
  101.10500907897949
  Average: 97.66333866119385

edit: got it to 12s now by walking the tree in whole "levels" (instead of just children of one parent) and doing bulk inserts.

@rafamanzo
Copy link
Member Author

@danielkza I've tested here your version for descendant_results (with no indexes) on the same hardware as before and I've got the following for the 2 metrics case:

Performance::Aggregation
----------
* Process Time: 17.7611858672
* Wall Time: 17.437276744842528

Which is a slight slowdown from 14s previously (and now with no load on the machine running the tests I've got 12s). I believe you are running for the 8 metrics case, right? It is a small setback for the smaller cases and a huge improvement for both run time for large metrics sets and even code readability.

Thanks for the help. If you are up to make those changes into a PR, I'd ask you in advance to split them into one for each modification and be careful with the commit sizes. Otherwise I'll get to it probably on Friday afternoon and I'd be happy if you'd review my PR then 😄

Indeed a nice job. Congratulations.

@danielkza
Copy link
Contributor

I'll probably make 2 or 3 PRs. I have some improvements for the perf script, the indexes, refactoring of the aggregation, and possibly one for the compound metrics calculation (I noticed it was walking the metrics tree regardless of whether there are any actual compound metrics being used).

danielkza added a commit that referenced this issue Jul 18, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.

Additionally, remove the timestamp columns from kalibro_modules,
module_results and metric_results: they are not used in any way, and
there are millions of rows containing them. It's possible and probably
desirable to just look at the timestamps in the processing.
danielkza added a commit that referenced this issue Jul 19, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.

Additionally, remove the timestamp columns from kalibro_modules,
module_results and metric_results: they are not used in any way, and
there are millions of rows containing them. It's possible and probably
desirable to just look at the timestamps in the processing.
danielkza added a commit that referenced this issue Jul 19, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.

Additionally, remove the timestamp columns from kalibro_modules,
module_results and metric_results: they are not used in any way, and
there are millions of rows containing them. It's possible and probably
desirable to just look at the timestamps in the processing.
danielkza added a commit that referenced this issue Jul 19, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.

Additionally, remove the timestamp columns from kalibro_modules,
module_results and metric_results: they are not used in any way, and
there are millions of rows containing them. It's possible and probably
desirable to just look at the timestamps in the processing.
@rafamanzo
Copy link
Member Author

Those following this issue, have a look at #214 and #215.

danielkza added a commit that referenced this issue Jul 19, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.

Additionally, remove the timestamp columns from kalibro_modules,
module_results and metric_results: they are not used in any way, and
there are millions of rows containing them. It's possible and probably
desirable to just look at the timestamps in the processing.
danielkza added a commit that referenced this issue Jul 20, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.
danielkza added a commit that referenced this issue Jul 22, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.
danielkza added a commit that referenced this issue Jul 26, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.
danielkza added a commit that referenced this issue Jul 28, 2016
The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.
@rafamanzo
Copy link
Member Author

Closing as the Aggregation performance has been successfully addressed and further investigation for other performance improvements have been extracted into separated issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants