Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use event_id as ID where possible #52

Open
hut8 opened this issue Feb 1, 2016 · 3 comments
Open

Use event_id as ID where possible #52

hut8 opened this issue Feb 1, 2016 · 3 comments

Comments

@hut8
Copy link
Member

hut8 commented Feb 1, 2016

This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.

Thoughts, @joshjordan ?

@joshjordan
Copy link
Member

I think that is definitely worthwhile. I didn't realize Mongo was trying to
keep 4 indexes in memory. Is it also possible to specify which indexes
should be on disk vs in memory?

On Mon, Feb 1, 2016 at 9:29 AM Liam [email protected] wrote:

This just occured to me. The pre-2015 events (in the timeline directory)
don't have event_id attributes. However, the new ones all do. Maybe I could
replace the MongoDB _id attribute with event_id for the post-2015 events.
Dropping that index would likely result in a huge increase in insert
performance, which we really need. Right now there are 4 indexes on that
collection, and not being able to fit them in memory is what really slows
things to a crawl.

Thoughts, @joshjordan https://github.com/joshjordan ?


Reply to this email directly or view it on GitHub
#52.

@s2t2
Copy link

s2t2 commented Mar 17, 2016

I just came across a few event objects missing an _event_id attribute and was wondering what was going on. Regardless of how you decide to handle this in mongo on the back-end, as an API consumer of these events, it would be confusing to expect an integer _event_id and instead get a string representation of the _id attribute.

@hut8
Copy link
Member Author

hut8 commented Mar 18, 2016

The _event_id attribute is only present in events that were from the "Event API", which includes "events" from January 1, 2015 on. Prior to that, the GitHub Archive was using the Timeline API, which didn't have an "Event ID" per se. The main reason I'm actually using an index on the _event_id field (or dealing with that field at all) is to work around the fact that you can't atomically load thousands of documents in MongoDB, so a unique index on it guarantees duplicates aren't inserted. I should probably document that better 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants