Overhaul database schema #215

danielkza · 2016-07-19T11:57:42Z

Overhaul database schema by adding indexes, foreign key constraints, cleaning up stale data, and fixing tests that relied on hardcoded/non-existing IDs.

The changes are a bit long but relatively straight-forward. I ended up changing the runner steps layout a lot because it was very hard to understand what was going on and how models were related, which proved necessary to remove the hardcoding issues.

The constraints also showed that running tests using the transaction DatabaseCleaner strategy was prone to failure - the processing jobs don't run in the same transaction as Cucumber and won't see much of the data created. Switching to the truncation strategy fixes it.

I tested the data cleanup migration quite extensively, using the database dump from production at mezuro.org. It would be wise to perform a backup anyway and be ready to restore it if anything goes wrong.

rafamanzo · 2016-07-19T13:27:55Z

Have you tested deletion instead of truncation? Which one is faster? What is the performance drop for leaving the transaction strategy? (https://github.com/DatabaseCleaner/database_cleaner#additional-activerecord-options-for-truncation)

Could you pull 4d0982e out into a new PR?

It'd be nice because I'm thinking about creating a new branch (v2.0.0 ?) where we can place these breaking changes and still hold the master following the v1 series until we finish such adjustments. What do you think?

rafamanzo · 2016-07-19T13:34:19Z

Looking at changes I have to ask you the same for those three changes:

Database clean up (target to v2)
Foreign keys (targeted to v2)
Indexes (targeted to v1 or v2? Looks like they are not unique so we could add them to v1?)
Date fields removal (v2)

rafamanzo · 2016-07-19T13:56:10Z

Does it fixes #212?

danielkza · 2016-07-19T13:56:57Z

Have you tested deletion instead of truncation? Which one is faster? What is the performance drop for leaving the transaction strategy? (https://github.com/DatabaseCleaner/database_cleaner#additional-activerecord-options-for-truncation)

I don't think : transaction is even a choice, since it causes incorrect results. As I mentioned in the commit message, changes from the Cucumber transaction should not be visible to the processing job - the default isolation level in Postgres (READ COMMITED) does not allow it.

We can use a tag to mark the scenarios that require processing and set their cleaner strategy, but they are actually a majority, so I just switched the default.

I haven't tested :deletion at all, but I'd guess we don't create enough records for it to matter too much.

Could you pull 4d0982e out into a new PR?

Sure.

It'd be nice because I'm thinking about creating a new branch (v2.0.0 ?) where we can place these breaking changes and still hold the master following the v1 series until we finish such adjustments. What do you think?

AFAIK none of the changes should be breaking. I made sure that any invalid data is cleaned up in the migration, and that data couldn't be accessed anyway, either because it was improperly referenced, or because it contained duplicates where the code assumes only one record exists (and acts accordingly).

I would prefer to fork the current master into a v1 branch instead of making the dev branch the fork - otherwise we'll make outdated code the most prominent version.

danielkza · 2016-07-19T13:58:59Z

Does it fixes #212?

It does AFAIK, I actually had to fix a bunch of integrity violations, some that were dependent on the order used by

rafamanzo · 2016-07-19T14:05:16Z

I'm being conservative about calling those changes breaking ones, because of the side effect on current installations not the API. There is the small risk of data loss or updates that require human action.

I agree about leaving the up to date code on master and fork it to v1.

About #212 when testing the separate PRs I'll try to test it too to figure out at which point it got fixed.

Great work and suggestions. Thanks.

danielkza · 2016-07-19T20:55:26Z

About splitting this up more: the cleaning, foreign key and indexes changes are somewhat tied. The cleaning migration only exists to allow the foreign keys to be created. Some of the indexes are in fact unique, and others exist to provide the opposite side to some foreign key.

It will also be quite complicated to ensure the tests are reliable in each separate step: fixing them all was quite a bit of work in itself, and I never tested it with any schema other than the final version.

Also, considering we will need to do tests with the production DB backups, doing it piecemeal seems like more work than it's really worth.

rafamanzo · 2016-07-19T21:56:25Z

OK, makes sense mostly.

At least the timestamp removal can we extract?

As one final try, is it possible to extract the new indexes regarding performance issues? I'd like to test them separately.

rafamanzo · 2016-07-19T22:03:25Z

Also, giving a second look to the changes it came to me if you'd also be gentle and make the tree walking methods modifications into a separate PR as well? Sorry about the annoyance.

rafamanzo · 2016-07-19T22:03:49Z

Finally this branch conflicts with master now.

danielkza · 2016-07-19T22:07:08Z

At least the timestamp removal can we extract?

Should be fine.

As one final try, is it possible to extract the new indexes regarding performance issues? I'd like to test them separately.

I think the easiest way to do that is probably to just edit the migrations and comment out the foreign keys. The TreeMetricResult unique index can also be commented, as considering only performance, it should be covered by the individual indexes in the columns.

Also, giving a second look to the changes it came to me if you'd also be gentle and make the tree walking methods modifications into a separate PR as well? Sorry about the annoyance.

These actually were not meant to be there, I think I accidentally added them by being careless with git-stash. Very observant, thanks! 👀

Finally this branch conflicts with master now.

I'll rebase and make the other changes in one go.

danielkza · 2016-07-20T22:46:26Z

Rebased to master, updated with split of timestamp changes (#219) and removal of the accidental module result change.

I tested the : deletion strategy, and unfortunately it isn't reliable: it disables constraints before cleaning, but that will not work if the user is not a super-user in PostgreSQL, and the cleaning will fail with integrity violations. :truncation works because tables are cleaned in a cascading manner if references exist.

rafamanzo · 2016-07-21T13:40:09Z

I have cherry-picked b95d15f into master.

The transaction time is: rake cucumber 75.08s user 6.12s system 46% cpu 2:52.88 total
The truncation time is: rake cucumber 95.90s user 6.67s system 37% cpu 4:36.68 total

I'm still unsure if that is significant given how often we run acceptance tests and that KalibroProcessor does not seem to be growing much in the short/midterm.

How hard would be to make this truncation strategy into an tag used only where needed? That would ensure minimum impact on test run time and that the application can grow, if necessary, without this being an issue. And probably make that another PR :/

What do you think?

danielkza · 2016-07-21T13:59:53Z

Using a tag to select the strategy seems like a good idea, but I think splitting up the strategy change may not be necessary - it doesn't matter until the integrity constraints are added.

danielkza · 2016-07-26T19:03:30Z

I implemented the selective usage of the truncation strategy. Any further comments @mezuro/core?

diegoamc · 2016-07-27T21:40:06Z

db/migrate/20160720185407_clean_inconsistencies.rb

+  def self.up
+    # Unset project reference for repositories with non-existing projects
+    execute <<-SQL
+      UPDATE repositories AS r


Could this generate inconsistencies in ownerships of Prezento? Should we do something similar there?

I don't see how it could matter considering this only unsets the references to records that do not exist. If Prezento had those same references they would already be broken if anyone tried to actually use them for anything.

IMO we should have proper database constraints everywhere, but adding them after a long time is quite a bit harder. I don't know whether it's worth the extra work. I won't have the free time to do the same for the other services.

diegoamc · 2016-07-27T21:46:15Z

db/migrate/20160720185407_clean_inconsistencies.rb

@@ -0,0 +1,73 @@
+class CleanInconsistencies < ActiveRecord::Migration


Should this script be a migration? We are not changing the database structure, just removing records.

We are semantically changing the database structure, from allowing inconsistencies to not allowing. We can run this manually, but it will mean anyone using the software will have to do the same, otherwise the next migrations won't work.

I don't think this script is preventing us from creating inconsistencies. It is merely cleaning things up so other scripts (migrations) change the structure (by adding indexes and foreign keys, for example).

Yes, the next scripts add the constraints to prevent the inconsistencies, but they would fail if the inconsistencies already exist. If you can suggest a better method to handle it I'll be happy to use it :)

Maybe a rake task can achieve similar goals, with warning messages for those running the migrations. Skipping them if they should fail would be good too.
That way, we don't force anyone to erase data without warning them first.

Any data that the migration erases was already unusable and/or unreachable anyway. It existed purely due to bugs and poor validations, and can't even be retrieved by the APIs (such as multiple metric results with the same metric in the same module result).

That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD
Notice that on those cases it may make sense to not erase a processing because it is not associated with a repository, for example.

I think that because we are changing data, not structure, this should not be a migration. Adding the extra step is surely more bureaucratic, but I think it's a good pratice to at least warn anyone who uses this software before erasing anything. Maybe a rake task is not the best option, but I don't think a migration is neither (even though rails guides say migrations can be made to add or modify data).

Anyway, if I'm the only one not happy with this, we can move on. 😄

That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD

It's true from the point of view of the model that the database is an expression of. If we had made a more faithful conversion, we would have foreign keys from the start, and the data being deleted now would never exist. They only reason it does exist is because we failed to keep garbage out. The fact that it got in doesn't make it not garbage.

What meaningful statistics can you gather from data that violates the semantic constraints of your model, and that should never exist? In the particular cases of things this migration deletes, you can't even reference the context that would allow you to make any sense of them.

What worth are process times if you can't even know what was processed? What worth is a kalibro module with no associated module result? What worth is a metric result that would never be looked at, and that is possibly complete garbage because it's associated with the wrong module result because we generated the wrong name?

I think that because we are changing data, not structure, this should not be a migration.

I don't see why that would be the case. Migrations are a mechanism to allow evolving a database progressively without throwing it away. Nothing makes them specific to structure. Also, changing structure many times entails changing data to be able to do it properly - this is just another case of that that is ugly because the situation is ugly.

Maybe a rake task is not the best option, but I don't think a migration is neither

The migration is the least bad option, as it is at least run in a controlled environment, and is a part of the known upgrade steps that any admin must run. How would someone upgrading find out that they need to run the rake task and how? Could we make it work in an automated processes without requiring manual intervention?

If you processed a repository and then deleted it, all the relevant data would be there. Suppose we had 500 repositories processed and then all of them were deleted. If you looked at their process times and saw that say, the aggregation time was surprinsingly high, would you not consider the data because you can't see the repository associated?

I agree that changing structure many times entails changing data. But I think it's just wrong to delete data without even notifying the users. I think it's not a good policy because the data is not yours, even though you may think it's garbage. In my opinion, we have to make users aware of it even if it means the process is not entirely automatic. I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?.

I've come to the conclusion that the migration is not a bad option. However, deleting data without informing the users, in my opinion, definitely is.

@diegoamc I understand your concerns, but I am confident that everything that this migration deletes is data that could not be accessed at all, or that doesn't even make sense. In your particular example, it's hard to judge whether a processing took long when the repository used is unknown. The fact that a processing does not make sense without it's repository is encoded in the model: these changes just enforce what was already supposed to happen at the database level.

If there is any data being deleted in a way that is not specified in the model, we should absolutely fix the migration or the model (whichever is wrong).

But I think it's just wrong to delete data without even notifying the usersI think it's not a good policy because the data is not yours

We have to keep in mind that we had the responsibility to create valid data and failed it. We have the knowledge to separate what is garbage (according to the model) and what is not, such that the remaining data can be trusted from now into the future. I'm not proposing to delete old data or anything like that, but data that should never have existed at all. No judgements of value can be made about its contents because they can be absolute nonsense.

I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?.

I don't know of a good way to do it, since we have no guarantee that the migration is interactive - it can be running in an automated way.

I understood your point too, but I can't accept the PR. I wish I had the time right now to sit for a while, study and propose another solution.
If nobody else is unhappy about it, please accept it :)

diegoamc · 2016-07-28T16:09:01Z

I don't understand why you closed this. There is no need for a consensus to accept a PR. As I said, if I'm the only one unhappy, please accept it!

danielkza · 2016-07-28T18:00:51Z

I guess sometimes we need a little pushback to spur creativity. I changed the migration to take backups of all the deleted data. Tell me what you think @diegoamc.

The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207.

Hardcoded IDs only worked because we had no referential intergrity constraints. Remove them and clean up the whole model creation code.

The transaction strategy was only working because we didn't have referential integrity checks. The job workers are not guaranteed to stay in the same the same transaction as the tests - the PHPMD tests, for example, spawned a new thread - and therefore, we must use truncation instead so all the models become visible to the workers in time.

PostgreSQL does not create them automatically, but we want them when deletions are cascade to other tables - otherwise we'll have to scan the whole table at that time. Some of the are also used for searching, such as the processing_id in process_times, or the project_id in repositories.

A Processing is created unconditionally, and therefore, always needs to be mocked, even when an error will immediately be thrown.

danielkza · 2016-07-28T18:04:20Z

Forgot to rebase to master, done now.

rafamanzo · 2016-07-29T17:43:06Z

I've tried to add just the indexes for the db_indexes branch. There my aggregation perfomance test results have increased from ~35s to ~58s (process times with 8 metrics). It is small enough so we can live with this extra time, but we still need to investigate further to show with other tests that there are real performance improvements when reading data probably by creating new tests (#229).

I think we should hold this until we get the proper discussion related to data consistency restrictions (#228) and performance gain of indexes (#229)

danielkza added the in progress label Jul 19, 2016

danielkza changed the title ~~Overhault database schema~~ Overhaul database schema Jul 19, 2016

danielkza force-pushed the optimize_db branch 4 times, most recently from 980194e to 90a710d Compare July 19, 2016 13:19

rafamanzo mentioned this pull request Jul 19, 2016

Slow processings #207

Closed

5 tasks

danielkza force-pushed the optimize_db branch from 90a710d to 91c98b2 Compare July 19, 2016 20:49

danielkza mentioned this pull request Jul 19, 2016

Fix PersistenceStrategy to wrap ModuleResult creation in transaction #218

Merged

danielkza force-pushed the optimize_db branch 2 times, most recently from 4f1cc05 to c0f4936 Compare July 20, 2016 22:41

danielkza force-pushed the optimize_db branch from c0f4936 to 7826ce9 Compare July 22, 2016 18:09

danielkza force-pushed the optimize_db branch 2 times, most recently from 02a9fc3 to 402e0cb Compare July 26, 2016 19:06

diegoamc reviewed Jul 27, 2016
View reviewed changes

danielkza closed this Jul 28, 2016

danielkza removed the in progress label Jul 28, 2016

danielkza reopened this Jul 28, 2016

danielkza added the in progress label Jul 28, 2016

danielkza added 6 commits July 28, 2016 15:03

Refactor runner acceptance steps to avoid hardcoded IDs

dbca269

Hardcoded IDs only worked because we had no referential intergrity constraints. Remove them and clean up the whole model creation code.

Fix missing mock in RepositoriesController test

6a1a8f7

A Processing is created unconditionally, and therefore, always needs to be mocked, even when an error will immediately be thrown.

Fix missing mock in ProcessingJob spec

a57e571

danielkza force-pushed the optimize_db branch from a168734 to 812a9f9 Compare July 28, 2016 18:03

danielkza force-pushed the optimize_db branch from 812a9f9 to 9bb15a2 Compare July 28, 2016 18:11

Make backups in CleanInconsistencies migration

a8777c6

danielkza force-pushed the optimize_db branch from 9bb15a2 to a8777c6 Compare July 28, 2016 19:29

This was referenced Jul 29, 2016

Database restrictions #228

Open

Investigate database indexes impact on data listing #229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul database schema #215

Overhaul database schema #215

danielkza commented Jul 19, 2016 •

edited

Loading

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 19, 2016 •

edited

Loading

rafamanzo commented Jul 19, 2016

danielkza commented Jul 19, 2016 •

edited

Loading

danielkza commented Jul 19, 2016

rafamanzo commented Jul 19, 2016

danielkza commented Jul 19, 2016 •

edited

Loading

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 19, 2016

danielkza commented Jul 19, 2016

danielkza commented Jul 20, 2016 •

edited

Loading

rafamanzo commented Jul 21, 2016 •

edited

Loading

danielkza commented Jul 21, 2016

danielkza commented Jul 26, 2016

diegoamc Jul 27, 2016

danielkza Jul 27, 2016 •

edited

Loading

diegoamc Jul 27, 2016

danielkza Jul 27, 2016

diegoamc Jul 27, 2016

danielkza Jul 27, 2016

diegoamc Jul 27, 2016

diegoamc Jul 27, 2016

danielkza Jul 28, 2016

diegoamc Jul 28, 2016

danielkza Jul 28, 2016 •

edited

Loading

diegoamc Jul 28, 2016

diegoamc commented Jul 28, 2016

danielkza commented Jul 28, 2016

danielkza commented Jul 28, 2016

rafamanzo commented Jul 29, 2016

		@@ -0,0 +1,73 @@
		class CleanInconsistencies < ActiveRecord::Migration

Overhaul database schema #215

Are you sure you want to change the base?

Overhaul database schema #215

Conversation

danielkza commented Jul 19, 2016 • edited Loading

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 19, 2016 • edited Loading

rafamanzo commented Jul 19, 2016

danielkza commented Jul 19, 2016 • edited Loading

danielkza commented Jul 19, 2016

rafamanzo commented Jul 19, 2016

danielkza commented Jul 19, 2016 • edited Loading

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 19, 2016

rafamanzo commented Jul 19, 2016

danielkza commented Jul 19, 2016

danielkza commented Jul 20, 2016 • edited Loading

rafamanzo commented Jul 21, 2016 • edited Loading

danielkza commented Jul 21, 2016

danielkza commented Jul 26, 2016

Choose a reason for hiding this comment

danielkza Jul 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielkza Jul 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diegoamc commented Jul 28, 2016

danielkza commented Jul 28, 2016

danielkza commented Jul 28, 2016

rafamanzo commented Jul 29, 2016

danielkza commented Jul 19, 2016 •

edited

Loading

rafamanzo commented Jul 19, 2016 •

edited

Loading

danielkza commented Jul 19, 2016 •

edited

Loading

danielkza commented Jul 19, 2016 •

edited

Loading

danielkza commented Jul 20, 2016 •

edited

Loading

rafamanzo commented Jul 21, 2016 •

edited

Loading

danielkza Jul 27, 2016 •

edited

Loading

danielkza Jul 28, 2016 •

edited

Loading