Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more advanced (poach_session based) conflict resolution for poached data #40

Open
bitspook opened this issue Nov 13, 2021 · 0 comments
Labels
debt Technical Debt

Comments

@bitspook
Copy link
Owner

bitspook commented Nov 13, 2021

Context:

Entropy (the app) has a thing called "poachers"; which are scrapers which go out
and gather/poach content from different sources. e.g they go to meetup.com to
collect which events are being organized in Chandigarh, they check the local
filesystem (e.g `./events/` directory) to see if new events are being added
there etc.

Problem:

Performing update operation on poached data is not possible because we
can't tell which records have been deleted at the source.

Example:

- We collect 4 blog posts from a directory =posts= using local poacher and keep
  them in database.
- Next time the poacher is ran, user have
  - changed content of 2 blog posts
  - deleted 1
  - 1 is left intact
  - added 1 new post
- If we perform an =upsert= operation, we can update the existing posts, create
  new, but we can't tell which (or if) post was deleted.

Right now, for the sake of simplicity (of implementation), we delete all
previously poached data when we re-poach. e.g all the groups/events collected
from local (i.e fs) are deleted before next poach is performed.

This ticket proposes a better behavior:

At its root, problem we are trying to solve is identifying the deleted data.

  1. Keep a poach_session column in all the tables which store poached data
    • poach_session is a new sequence per source i.e local data has a different
      poach_session sequence going, meetup poacher has a different one and so on
  2. On next poach, increment the version by 1 (hereby refereed to as
    current_poach_sesion)
  3. On conflict, update the row with new data
  4. After the poacher is done, all the rows with version current_poach_sesion - 1
    have been deleted at the original source. These can be safely deleted.
@bitspook bitspook added the debt Technical Debt label Nov 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Technical Debt
Projects
None yet
Development

No branches or pull requests

1 participant