You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Entropy (the app) has a thing called "poachers"; which are scrapers which go out
and gather/poach content from different sources. e.g they go to meetup.com to
collect which events are being organized in Chandigarh, they check the local
filesystem (e.g `./events/` directory) to see if new events are being added
there etc.
Problem:
Performing update operation on poached data is not possible because we
can't tell which records have been deleted at the source.
Example:
- We collect 4 blog posts from a directory =posts= using local poacher and keep
them in database.
- Next time the poacher is ran, user have
- changed content of 2 blog posts
- deleted 1
- 1 is left intact
- added 1 new post
- If we perform an =upsert= operation, we can update the existing posts, create
new, but we can't tell which (or if) post was deleted.
Right now, for the sake of simplicity (of implementation), we delete all
previously poached data when we re-poach. e.g all the groups/events collected
from local (i.e fs) are deleted before next poach is performed.
This ticket proposes a better behavior:
At its root, problem we are trying to solve is identifying the deleted data.
Keep a poach_session column in all the tables which store poached data
poach_session is a new sequence per source i.e local data has a different poach_session sequence going, meetup poacher has a different one and so on
On next poach, increment the version by 1 (hereby refereed to as current_poach_sesion)
On conflict, update the row with new data
After the poacher is done, all the rows with version current_poach_sesion - 1
have been deleted at the original source. These can be safely deleted.
The text was updated successfully, but these errors were encountered:
Context:
Entropy (the app) has a thing called "poachers"; which are scrapers which go out
and gather/poach content from different sources. e.g they go to meetup.com to
collect which events are being organized in Chandigarh, they check the local
filesystem (e.g `./events/` directory) to see if new events are being added
there etc.
Problem:
Performing update operation on poached data is not possible because we
can't tell which records have been deleted at the source.
Example:
Right now, for the sake of simplicity (of implementation), we delete all
previously poached data when we re-poach. e.g all the groups/events collected
from local (i.e fs) are deleted before next poach is performed.
This ticket proposes a better behavior:
At its root, problem we are trying to solve is identifying the deleted data.
poach_session
column in all the tables which store poached datapoach_session
is a new sequence persource
i.elocal
data has a differentpoach_session
sequence going,meetup
poacher has a different one and so oncurrent_poach_sesion
)current_poach_sesion - 1
have been deleted at the original source. These can be safely deleted.
The text was updated successfully, but these errors were encountered: