Slow processings #207

rafamanzo · 2016-07-08T18:53:34Z

Running the following on mezuro.org console gives us an hint of why some processings take so long

=> ProcessTime.all.group('state').average('time').map { |k,v| {k => v.to_f/60.0} }

=> [
  {"PREPARING"=>0.0021892476280869332},
  {"CHECKING"=>2.1039809739626834},
  {"AGGREGATING"=>29.408375646260332},
  {"BUILDING"=>0.919068727626195},
  {"INTERPRETING"=>1.7958658936723333},
  {"DOWNLOADING"=>0.20159135590306834},
  {"CALCULATING"=>3.2441646428687334},
  {"COLLECTING"=>4.56774077964555}
]

It means that on average each Processing takes approximately 30min aggregating.

By looking at the database machine processor we can see it at 100% full time during aggregation.

I suggest:

Create benchmarks for aggregation
- Aggregation performance #211
Investigate the benchmark results
- pgBadger
- See how much time each method call takes with RubyProf::CallStackPrinter
Write some cache and fixes where is possible to do so
- Bulk insertions of TreeMetricResults bottom-up
  - With bulk insertions, adding more indexes to the metric_results table may be worth the extra insertion cost. We need to investigate that once the bulk insertions are up.
  - Bulk insert aggregated results #225
- Improve TreeMetricResult#descendant_values performance
  - Optimize TreeMetricResult#descendant_values #224
- Propose database indexes (Investigate database indexes impact on data listing #229)

The text was updated successfully, but these errors were encountered:

rafamanzo · 2016-07-08T20:20:05Z

Some work for producing performance tests has been started at the following branch: https://github.com/mezuro/kalibro_processor/tree/aggregation_performance

rafamanzo · 2016-07-14T22:07:30Z

I've added tests to Aggregator with similar complexity to the one found at mezuro.org: ced0eee

But the results are still not close to be that slow with an mean run time of 0.55s.

Insights are welcome @mezuro/core 😄

danielkza · 2016-07-14T23:11:37Z

We'd probably need to test with a high concurrency level if the database itself is being the bottleneck. I think profiling the queries through PostgreSQL would be the easier way to understand the issues - we can look at the query plans and query times using something like pgBadger.

rafamanzo · 2016-07-15T00:09:16Z

On mezuro.org there is a single worker generating workloads to the database, so I don't believe we'll replicate the issue with concurrency. But I've tested with a single MetricResult. Perhaps it is worth to increase its count on the test as mezuro.org issues happen with configurations holding at least half dozen metrics.

About pgBadger looks like a nice tool to use once we can produce a test that replicates the issues found at mezuro.org.

danielkza · 2016-07-15T17:12:52Z

I thought we had four workers for kalibro_processor, my mistake!

rafamanzo · 2016-07-15T19:52:13Z

I've fixed the perfomance test. It was fast because I missed proper data initialization.

Now I have the following for a tree with height 10 and 2 children per node, so 512 leafs. Then I've produced the following:

Metric Count	Wall Time (s)	Process Time (s)
1	8.672761964797974	7.54611889
2	14.596535205841064	24.3851091414
4	77.55368666648864	127.275106022
8	256.8674069881439	556.1682938326

Running with 8 metrics still not produced the 100% usage by postgresql seen at mezuro.org. The plot looks quadratic.

rafamanzo · 2016-07-15T20:39:05Z

Possible cause: https://github.com/mezuro/kalibro_processor/blob/master/lib/processor/aggregator.rb#L23. @danielkza pointed out that the MetricResults table has no indexation which make the repetition of such query really slow.

The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.

diegoamc · 2016-07-16T15:44:40Z

I've run the same tests as @rafamanzo and have got the following results:

Metric Count	Wall Time (s)	Process Time (s)
1	6.992499828338623	7.272484427600001
2	13.529412364959716	12.9735503228
4	26.20373945236206	28.432685047400003
8	62.70585794448853	64.16384417580001

Then, I've created indexes on the columns module_result_id and metric_configuration_id of table metric_results and rerun the same tests. The following are the results for them:

Metric Count	Wall Time (s)	Process Time (s)
1	6.903397846221924	7.131537662599999
2	13.142084169387818	13.2359803324
4	28.782897806167604	28.8164754306
8	64.86161379814148	56.7507790644

From these results, I don't think the lack of indexes are the main problem for such sluggish processings.
However, I also have some doubts about my tests. First, my results are much faster than @rafamanzo's. What could be the reasons for that? Just processing power?

Another thing, could the metrics used affect the results? I've commented the metrics I wouldn't use on the test from bottom to top (given the order on the performance script). Is this how you did too, @rafamanzo?

Also, it is recommended to someone else corroborate these results. I've commited the changes and pushed a new branch called metric_result_indexes. The commit goes with the modified database schema, so a single rollback is needed to return the database to its previous state (without the indexes).

Finally, if the results are corroborated, I think the reason for the indexes to not affect the performance is that we make to many insertions on MetricResult's table. Note, we are using a b-tree as the index structure. On those structures, searches are greatly improved. The searching algorithm goes from O(n) to O[t*log(n)], where t is the number os elements in each node of a b-tree. However, insertions become slower. The insertion algorithm goes from O(1) to O[log(n)]. Therefore, making a lot of insertions on the table, which is our case (on the script setup, with 8 metrics, we create 8184 metric results), is not recommended. That is because the time gained by a faster searching may be compensated by a slower insertion. In my opinion, that is what the results are showing. We can only conclude that the sluggish performance of the aggregation state is not due to lack of indexes on the MetricResult's table. Reference for b-tree's algorithms and time complexities I used here.

What do you think?

rafamanzo · 2016-07-16T19:25:09Z

From branch metric_result_indexes I've got the following on the same hardware as the previous tests:

Metric Count	Wall Time (s)	Process Time (s)
1	7.591419649124146	12.0535053682
2	31.928905773162843	39.1385960514
4	74.23858404159546	140.50534778839997
8	247.94011964797974	535.5620568106

Starting from 4 metrics one CPU have become pinned at 100% usage. But not by postgres. The aggregation itself was using 100% of one CPU.

Looks like this test is stressing some other performance issue rather than the missing indexes that we need to address before setting those up. But I believe they will play an important role afterwards on the large database we have at mezuro.org. They actually look to have eased some load on my machine starting at 4 metrics.

rafamanzo · 2016-07-17T02:17:03Z

Adding the following snippet at the end of Performance::Base#print:

File.open('profile_pt.html', 'w') { |file| RubyProf::CallStackPrinter.new(@results['Process Time'].first).print(file) }

I was able to narrow 51% of the process time to TreeMetricResult#descendant_values which has the following implementation:

def descendant_values
  module_result.children.map { |child|
    metric_result = child.tree_metric_result_for(self.metric)
    metric_result.value if metric_result
  }.compact
end

Which look really macarronic and inefficient. Then I replaced by:

def descendant_values
  self.class.
    where(module_result: module_result.children,
          metric_configuration_id: metric_configuration_id).
    select(:value).map(&:value)
end

But the execution time for the test increased 😭

If someone else can get to the same result, I'd say we have to write some performance tests for this method as well. Can someone validate those results? What do you think about a test focusing on this?

danielkza · 2016-07-17T02:24:08Z

@rafamanzo Did you measure wall time also? If we're removing database waits the process time might not change that much, but the wall time might.

rafamanzo · 2016-07-17T02:25:16Z

The wall time increased as well 😞

rafamanzo · 2016-07-17T02:38:08Z

Looking at same report format for wall time

File.open('profile_wt.html', 'w') { |file| RubyProf::CallStackPrinter.new(@results['Wall Time'].first).print(file) }

For the code with the new descendant_values implementation, ActiveRecord::Transactions#save takes 54% of the time. I'm experimenting with ActiveRecord::Base.transaction to see if we can speed the save by making bigger transactions.

rafamanzo · 2016-07-17T02:48:37Z

Little improvement, but still slower than base. I'm taking out now, but if anyone gets in the mood I'd give a try to https://github.com/zdennis/activerecord-import.

If we can iterate through the modules tree height, we could transform 1022 insertions (for the 2 metrics case) into 10 insertions.

danielkza · 2016-07-17T02:59:38Z

Can you try with this versions of descendant_values? It should avoid the extra query to fetch the children module_results at the start.

  def descendant_values
    self.class.
      joins(:module_result).
      where('module_results.parent_id' => module_result_id).
      pluck(:value)
  end

danielkza · 2016-07-17T03:47:16Z

I get some good speedup with my version (and database indexes). Old:

WALL_TIME:
  198.24422693252563
  205.54687118530273
  223.26766705513
  Average: 209.01958839098612

New:

WALL_TIME:
  93.48759293556213
  98.39741396903992
  101.10500907897949
  Average: 97.66333866119385

edit: got it to 12s now by walking the tree in whole "levels" (instead of just children of one parent) and doing bulk inserts.

rafamanzo · 2016-07-17T14:50:07Z

@danielkza I've tested here your version for descendant_results (with no indexes) on the same hardware as before and I've got the following for the 2 metrics case:

Performance::Aggregation
----------
* Process Time: 17.7611858672
* Wall Time: 17.437276744842528

Which is a slight slowdown from 14s previously (and now with no load on the machine running the tests I've got 12s). I believe you are running for the 8 metrics case, right? It is a small setback for the smaller cases and a huge improvement for both run time for large metrics sets and even code readability.

Thanks for the help. If you are up to make those changes into a PR, I'd ask you in advance to split them into one for each modification and be careful with the commit sizes. Otherwise I'll get to it probably on Friday afternoon and I'd be happy if you'd review my PR then 😄

Indeed a nice job. Congratulations.

danielkza · 2016-07-17T15:25:58Z

I'll probably make 2 or 3 PRs. I have some improvements for the perf script, the indexes, refactoring of the aggregation, and possibly one for the compound metrics calculation (I noticed it was walking the metrics tree regardless of whether there are any actual compound metrics being used).

The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.

rafamanzo · 2016-07-19T13:36:24Z

Those following this issue, have a look at #214 and #215.

The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207. Additionally, remove the timestamp columns from kalibro_modules, module_results and metric_results: they are not used in any way, and there are millions of rows containing them. It's possible and probably desirable to just look at the timestamps in the processing.

The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207.

rafamanzo · 2016-07-29T18:01:55Z

Closing as the Aggregation performance has been successfully addressed and further investigation for other performance improvements have been extracted into separated issues.

rafamanzo added bug enhancement labels Jul 8, 2016

rafamanzo added the in progress label Jul 12, 2016

rafamanzo mentioned this issue Jul 15, 2016

Aggregation performance #211

Merged

rafamanzo mentioned this issue Jul 22, 2016

Refactor ModuleResult#pre_order #222

Open

This was referenced Jul 22, 2016

Bulk insert aggregated results #225

Merged

Optimize TreeMetricResult#descendant_values #224

Merged

rafamanzo closed this as completed Jul 29, 2016

rafamanzo removed the in progress label Jul 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow processings #207

Slow processings #207

rafamanzo commented Jul 8, 2016 •

edited

Loading

rafamanzo commented Jul 8, 2016 •

edited

Loading

rafamanzo commented Jul 14, 2016

danielkza commented Jul 14, 2016

rafamanzo commented Jul 15, 2016 •

edited

Loading

danielkza commented Jul 15, 2016

rafamanzo commented Jul 15, 2016 •

edited

Loading

rafamanzo commented Jul 15, 2016

diegoamc commented Jul 16, 2016 •

edited

Loading

rafamanzo commented Jul 16, 2016

rafamanzo commented Jul 17, 2016 •

edited

Loading

danielkza commented Jul 17, 2016

rafamanzo commented Jul 17, 2016

rafamanzo commented Jul 17, 2016

rafamanzo commented Jul 17, 2016

danielkza commented Jul 17, 2016 •

edited

Loading

danielkza commented Jul 17, 2016 •

edited

Loading

rafamanzo commented Jul 17, 2016

danielkza commented Jul 17, 2016

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 29, 2016

Slow processings #207

Slow processings #207

Comments

rafamanzo commented Jul 8, 2016 • edited Loading

rafamanzo commented Jul 8, 2016 • edited Loading

rafamanzo commented Jul 14, 2016

danielkza commented Jul 14, 2016

rafamanzo commented Jul 15, 2016 • edited Loading

danielkza commented Jul 15, 2016

rafamanzo commented Jul 15, 2016 • edited Loading

rafamanzo commented Jul 15, 2016

diegoamc commented Jul 16, 2016 • edited Loading

rafamanzo commented Jul 16, 2016

rafamanzo commented Jul 17, 2016 • edited Loading

danielkza commented Jul 17, 2016

rafamanzo commented Jul 17, 2016

rafamanzo commented Jul 17, 2016

rafamanzo commented Jul 17, 2016

danielkza commented Jul 17, 2016 • edited Loading

danielkza commented Jul 17, 2016 • edited Loading

rafamanzo commented Jul 17, 2016

danielkza commented Jul 17, 2016

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 29, 2016

rafamanzo commented Jul 8, 2016 •

edited

Loading

rafamanzo commented Jul 8, 2016 •

edited

Loading

rafamanzo commented Jul 15, 2016 •

edited

Loading

rafamanzo commented Jul 15, 2016 •

edited

Loading

diegoamc commented Jul 16, 2016 •

edited

Loading

rafamanzo commented Jul 17, 2016 •

edited

Loading

danielkza commented Jul 17, 2016 •

edited

Loading

danielkza commented Jul 17, 2016 •

edited

Loading