Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul database schema #215

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Overhaul database schema #215

wants to merge 7 commits into from

Conversation

danielkza
Copy link
Contributor

@danielkza danielkza commented Jul 19, 2016

Overhaul database schema by adding indexes, foreign key constraints, cleaning up stale data, and fixing tests that relied on hardcoded/non-existing IDs.

The changes are a bit long but relatively straight-forward. I ended up changing the runner steps layout a lot because it was very hard to understand what was going on and how models were related, which proved necessary to remove the hardcoding issues.

The constraints also showed that running tests using the transaction DatabaseCleaner strategy was prone to failure - the processing jobs don't run in the same transaction as Cucumber and won't see much of the data created. Switching to the truncation strategy fixes it.

I tested the data cleanup migration quite extensively, using the database dump from production at mezuro.org. It would be wise to perform a backup anyway and be ready to restore it if anything goes wrong.

@danielkza danielkza changed the title Overhault database schema Overhaul database schema Jul 19, 2016
@danielkza danielkza force-pushed the optimize_db branch 4 times, most recently from 980194e to 90a710d Compare July 19, 2016 13:19
@rafamanzo
Copy link
Member

Have you tested deletion instead of truncation? Which one is faster? What is the performance drop for leaving the transaction strategy? (https://github.com/DatabaseCleaner/database_cleaner#additional-activerecord-options-for-truncation)

Could you pull 4d0982e out into a new PR?

It'd be nice because I'm thinking about creating a new branch (v2.0.0 ?) where we can place these breaking changes and still hold the master following the v1 series until we finish such adjustments. What do you think?

@rafamanzo
Copy link
Member

rafamanzo commented Jul 19, 2016

Looking at changes I have to ask you the same for those three changes:

  • Database clean up (target to v2)
  • Foreign keys (targeted to v2)
  • Indexes (targeted to v1 or v2? Looks like they are not unique so we could add them to v1?)
  • Date fields removal (v2)

@rafamanzo rafamanzo mentioned this pull request Jul 19, 2016
5 tasks
@rafamanzo
Copy link
Member

Does it fixes #212?

@danielkza
Copy link
Contributor Author

danielkza commented Jul 19, 2016

Have you tested deletion instead of truncation? Which one is faster? What is the performance drop for leaving the transaction strategy? (https://github.com/DatabaseCleaner/database_cleaner#additional-activerecord-options-for-truncation)

I don't think : transaction is even a choice, since it causes incorrect results. As I mentioned in the commit message, changes from the Cucumber transaction should not be visible to the processing job - the default isolation level in Postgres (READ COMMITED) does not allow it.

We can use a tag to mark the scenarios that require processing and set their cleaner strategy, but they are actually a majority, so I just switched the default.

I haven't tested :deletion at all, but I'd guess we don't create enough records for it to matter too much.

Could you pull 4d0982e out into a new PR?

Sure.

It'd be nice because I'm thinking about creating a new branch (v2.0.0 ?) where we can place these breaking changes and still hold the master following the v1 series until we finish such adjustments. What do you think?

AFAIK none of the changes should be breaking. I made sure that any invalid data is cleaned up in the migration, and that data couldn't be accessed anyway, either because it was improperly referenced, or because it contained duplicates where the code assumes only one record exists (and acts accordingly).

I would prefer to fork the current master into a v1 branch instead of making the dev branch the fork - otherwise we'll make outdated code the most prominent version.

@danielkza
Copy link
Contributor Author

Does it fixes #212?

It does AFAIK, I actually had to fix a bunch of integrity violations, some that were dependent on the order used by

@rafamanzo
Copy link
Member

I'm being conservative about calling those changes breaking ones, because of the side effect on current installations not the API. There is the small risk of data loss or updates that require human action.

I agree about leaving the up to date code on master and fork it to v1.

About #212 when testing the separate PRs I'll try to test it too to figure out at which point it got fixed.

Great work and suggestions. Thanks.

@danielkza
Copy link
Contributor Author

danielkza commented Jul 19, 2016

About splitting this up more: the cleaning, foreign key and indexes changes are somewhat tied. The cleaning migration only exists to allow the foreign keys to be created. Some of the indexes are in fact unique, and others exist to provide the opposite side to some foreign key.

It will also be quite complicated to ensure the tests are reliable in each separate step: fixing them all was quite a bit of work in itself, and I never tested it with any schema other than the final version.

Also, considering we will need to do tests with the production DB backups, doing it piecemeal seems like more work than it's really worth.

@rafamanzo
Copy link
Member

OK, makes sense mostly.

At least the timestamp removal can we extract?

As one final try, is it possible to extract the new indexes regarding performance issues? I'd like to test them separately.

@rafamanzo
Copy link
Member

Also, giving a second look to the changes it came to me if you'd also be gentle and make the tree walking methods modifications into a separate PR as well? Sorry about the annoyance.

@rafamanzo
Copy link
Member

Finally this branch conflicts with master now.

@danielkza
Copy link
Contributor Author

At least the timestamp removal can we extract?

Should be fine.

As one final try, is it possible to extract the new indexes regarding performance issues? I'd like to test them separately.

I think the easiest way to do that is probably to just edit the migrations and comment out the foreign keys. The TreeMetricResult unique index can also be commented, as considering only performance, it should be covered by the individual indexes in the columns.

Also, giving a second look to the changes it came to me if you'd also be gentle and make the tree walking methods modifications into a separate PR as well? Sorry about the annoyance.

These actually were not meant to be there, I think I accidentally added them by being careless with git-stash. Very observant, thanks! 👀

Finally this branch conflicts with master now.

I'll rebase and make the other changes in one go.

@danielkza danielkza force-pushed the optimize_db branch 2 times, most recently from 4f1cc05 to c0f4936 Compare July 20, 2016 22:41
@danielkza
Copy link
Contributor Author

danielkza commented Jul 20, 2016

Rebased to master, updated with split of timestamp changes (#219) and removal of the accidental module result change.

I tested the : deletion strategy, and unfortunately it isn't reliable: it disables constraints before cleaning, but that will not work if the user is not a super-user in PostgreSQL, and the cleaning will fail with integrity violations. :truncation works because tables are cleaned in a cascading manner if references exist.

@rafamanzo
Copy link
Member

rafamanzo commented Jul 21, 2016

I have cherry-picked b95d15f into master.

The transaction time is: rake cucumber 75.08s user 6.12s system 46% cpu 2:52.88 total
The truncation time is: rake cucumber 95.90s user 6.67s system 37% cpu 4:36.68 total

I'm still unsure if that is significant given how often we run acceptance tests and that KalibroProcessor does not seem to be growing much in the short/midterm.

How hard would be to make this truncation strategy into an tag used only where needed? That would ensure minimum impact on test run time and that the application can grow, if necessary, without this being an issue. And probably make that another PR :/

What do you think?

@danielkza
Copy link
Contributor Author

Using a tag to select the strategy seems like a good idea, but I think splitting up the strategy change may not be necessary - it doesn't matter until the integrity constraints are added.

@danielkza
Copy link
Contributor Author

I implemented the selective usage of the truncation strategy. Any further comments @mezuro/core?

@danielkza danielkza force-pushed the optimize_db branch 2 times, most recently from 02a9fc3 to 402e0cb Compare July 26, 2016 19:06
def self.up
# Unset project reference for repositories with non-existing projects
execute <<-SQL
UPDATE repositories AS r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this generate inconsistencies in ownerships of Prezento? Should we do something similar there?

Copy link
Contributor Author

@danielkza danielkza Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it could matter considering this only unsets the references to records that do not exist. If Prezento had those same references they would already be broken if anyone tried to actually use them for anything.

IMO we should have proper database constraints everywhere, but adding them after a long time is quite a bit harder. I don't know whether it's worth the extra work. I won't have the free time to do the same for the other services.

@@ -0,0 +1,73 @@
class CleanInconsistencies < ActiveRecord::Migration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this script be a migration? We are not changing the database structure, just removing records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are semantically changing the database structure, from allowing inconsistencies to not allowing. We can run this manually, but it will mean anyone using the software will have to do the same, otherwise the next migrations won't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this script is preventing us from creating inconsistencies. It is merely cleaning things up so other scripts (migrations) change the structure (by adding indexes and foreign keys, for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the next scripts add the constraints to prevent the inconsistencies, but they would fail if the inconsistencies already exist. If you can suggest a better method to handle it I'll be happy to use it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a rake task can achieve similar goals, with warning messages for those running the migrations. Skipping them if they should fail would be good too.
That way, we don't force anyone to erase data without warning them first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any data that the migration erases was already unusable and/or unreachable anyway. It existed purely due to bugs and poor validations, and can't even be retrieved by the APIs (such as multiple metric results with the same metric in the same module result).

That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD
Notice that on those cases it may make sense to not erase a processing because it is not associated with a repository, for example.

I think that because we are changing data, not structure, this should not be a migration. Adding the extra step is surely more bureaucratic, but I think it's a good pratice to at least warn anyone who uses this software before erasing anything. Maybe a rake task is not the best option, but I don't think a migration is neither (even though rails guides say migrations can be made to add or modify data).

Anyway, if I'm the only one not happy with this, we can move on. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD

It's true from the point of view of the model that the database is an expression of. If we had made a more faithful conversion, we would have foreign keys from the start, and the data being deleted now would never exist. They only reason it does exist is because we failed to keep garbage out. The fact that it got in doesn't make it not garbage.

What meaningful statistics can you gather from data that violates the semantic constraints of your model, and that should never exist? In the particular cases of things this migration deletes, you can't even reference the context that would allow you to make any sense of them.

What worth are process times if you can't even know what was processed? What worth is a kalibro module with no associated module result? What worth is a metric result that would never be looked at, and that is possibly complete garbage because it's associated with the wrong module result because we generated the wrong name?

I think that because we are changing data, not structure, this should not be a migration.

I don't see why that would be the case. Migrations are a mechanism to allow evolving a database progressively without throwing it away. Nothing makes them specific to structure. Also, changing structure many times entails changing data to be able to do it properly - this is just another case of that that is ugly because the situation is ugly.

Maybe a rake task is not the best option, but I don't think a migration is neither

The migration is the least bad option, as it is at least run in a controlled environment, and is a part of the known upgrade steps that any admin must run. How would someone upgrading find out that they need to run the rake task and how? Could we make it work in an automated processes without requiring manual intervention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you processed a repository and then deleted it, all the relevant data would be there. Suppose we had 500 repositories processed and then all of them were deleted. If you looked at their process times and saw that say, the aggregation time was surprinsingly high, would you not consider the data because you can't see the repository associated?

I agree that changing structure many times entails changing data. But I think it's just wrong to delete data without even notifying the users. I think it's not a good policy because the data is not yours, even though you may think it's garbage. In my opinion, we have to make users aware of it even if it means the process is not entirely automatic. I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?.

I've come to the conclusion that the migration is not a bad option. However, deleting data without informing the users, in my opinion, definitely is.

Copy link
Contributor Author

@danielkza danielkza Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegoamc I understand your concerns, but I am confident that everything that this migration deletes is data that could not be accessed at all, or that doesn't even make sense. In your particular example, it's hard to judge whether a processing took long when the repository used is unknown. The fact that a processing does not make sense without it's repository is encoded in the model: these changes just enforce what was already supposed to happen at the database level.

If there is any data being deleted in a way that is not specified in the model, we should absolutely fix the migration or the model (whichever is wrong).

But I think it's just wrong to delete data without even notifying the usersI think it's not a good policy because the data is not yours

We have to keep in mind that we had the responsibility to create valid data and failed it. We have the knowledge to separate what is garbage (according to the model) and what is not, such that the remaining data can be trusted from now into the future. I'm not proposing to delete old data or anything like that, but data that should never have existed at all. No judgements of value can be made about its contents because they can be absolute nonsense.

I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?.

I don't know of a good way to do it, since we have no guarantee that the migration is interactive - it can be running in an automated way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood your point too, but I can't accept the PR. I wish I had the time right now to sit for a while, study and propose another solution.
If nobody else is unhappy about it, please accept it :)

@diegoamc
Copy link
Contributor

I don't understand why you closed this. There is no need for a consensus to accept a PR. As I said, if I'm the only one unhappy, please accept it!

@danielkza
Copy link
Contributor Author

I guess sometimes we need a little pushback to spur creativity. I changed the migration to take backups of all the deleted data. Tell me what you think @diegoamc.

The database schema was lacking many integrity checks and indexes.
Correct it by first applying a migration that removes all old/stale
data, then creating those indexes.

The driving reason for this is the very slow performance of processing
(specially aggregation) on the new mezuro.org servers. It will hopefully
remove (or at least heavily improve) the superlinear slowdown when the
number of metrics rises, as observed in #207.
Hardcoded IDs only worked because we had no referential intergrity
constraints. Remove them and clean up the whole model creation code.
The transaction strategy was only working because we didn't have
referential integrity checks. The job workers are not guaranteed to stay
in the same the same transaction as the tests - the PHPMD tests, for
example, spawned a new thread - and therefore, we must use truncation
instead so all the models become visible to the workers in time.
PostgreSQL does not create them automatically, but we want them when
deletions are cascade to other tables - otherwise we'll have to scan the
whole table at that time. Some of the are also used for searching, such
as the processing_id in process_times, or the project_id in
repositories.
A Processing is created unconditionally, and therefore, always needs to
be mocked, even when an error will immediately be thrown.
@danielkza
Copy link
Contributor Author

Forgot to rebase to master, done now.

@rafamanzo
Copy link
Member

I've tried to add just the indexes for the db_indexes branch. There my aggregation perfomance test results have increased from ~35s to ~58s (process times with 8 metrics). It is small enough so we can live with this extra time, but we still need to investigate further to show with other tests that there are real performance improvements when reading data probably by creating new tests (#229).

I think we should hold this until we get the proper discussion related to data consistency restrictions (#228) and performance gain of indexes (#229)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants