Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are the tweets hydrated less than read? #67

Closed
santoshbs opened this issue Nov 3, 2020 · 11 comments
Closed

Why are the tweets hydrated less than read? #67

santoshbs opened this issue Nov 3, 2020 · 11 comments

Comments

@santoshbs
Copy link

I just started using Hydrator. After first run I see that number of tweets hydrated (~20,000 tweets) is way less than the total tweets to hydrate and read by the Hydrator app (~5 million tweet ids). Not sure why is this happening. Am I doing anything wrong?

@gavinrozzi
Copy link

You are not doing anything wrong. This is the expected behavior. I am currently using Hydrator for a project and about ~50% of the tweet IDs are unable to be hydrated. Could be a variety of reasons - the user could have deleted their tweet, their account could have got suspended, or Twitter took their tweet down etc.

@edsu
Copy link
Member

edsu commented Dec 2, 2020

50% is high though. Delete rates I've seen are usually less than 20%. But I guess it is dataset dependent. One thing to make sure is that you haven't corrupted the ids by opening them with Excel and saving it. Excel can't deal with the large integers and they overflow so that the last two digits are always zero. So take a look at your ids and make sure they don't all end in zero.

@gavinrozzi
Copy link

Thanks Ed. I think for me it's my specific dataset - I'm working with COVID tweets and others doing the same have noticed a rather low hydration rate due to misinformation being deleted etc. I do most of my work in R / Python so I processed the IDs there to avoid Excel causing any weirdness, so that doesn't apply to my case but maybe it could be a factor in @santoshbs's low hyrdation rate.

@edsu
Copy link
Member

edsu commented Dec 3, 2020

Oh, that's possible yeah. Which dataset are you working with? I can test to make sure things look ok.

@AsmaZbt
Copy link

AsmaZbt commented Dec 28, 2020

50% is high though. Delete rates I've seen are usually less than 20%. But I guess it is dataset dependent. One thing to make sure is that you haven't corrupted the ids by opening them with Excel and saving it. Excel can't deal with the large integers and they overflow so that the last two digits are always zero. So take a look at your ids and make sure they don't all end in zero.

hello @edsu ,
I have an issue that I don't understand, can you help me, please?
I got the IDs tweets from a CSV file, using data frame (python) then I created a .'txt' file that contains each line a Tweet ID
I checked the IDs they are correct in the TXT file and don't end with Zero.
but after using the Hydrator, I got 33 125 tweets from a total of 55 000 IDs.
and after checking the given CSV file, All the IDs end with Four "4" or Five "5" ZEROs.

is that normal? what should I do?

@edsu
Copy link
Member

edsu commented Dec 28, 2020

@AsmaZbt does the i.d_str property in your hydrated data also have zeros at the end? Javascript does not handle long integers well (see #25) so the .id value will often be incorrect.

It is normal for the hydrated number to be less than than the tweet ids read. The discrepancy reflects the number of tweets that have been deleted or protected since the dataset was created.

@AsmaZbt
Copy link

AsmaZbt commented Dec 28, 2020

@AsmaZbt does the i.d_str property in your hydrated data also have zeros at the end? Javascript does not handle long integers well (see #25) so the .id value will often be incorrect.

It is normal for the hydrated number to be less than than the tweet ids read. The discrepancy reflects the number of tweets that have been deleted or protected since the dataset was created.

@edsu Thank you so much for the quick reply
@edsu I'm not sure what do you mean by i.d_str but in my hydrated data I found "in_reply_to_status_id" and yes they end with a 0000.

it's not a problem for me to get less data, I when I noted that the hydrated number is less than the tweet ids read, I understood that the missed tweets are deleted or protected. However, I need absolutely the correct IDs to match them with my DATA.

please, is there any solution?

@AsmaZbt
Copy link

AsmaZbt commented Dec 28, 2020

@edsu I checked the value of the '.csv' hydrated file with panda , and the surprise is that the IDs do not end with 0000 but the real IDs. So I think the problem is in EXCEL because when I open the file with excel I don't see the same values in the dataframe (that's strange).

However, while doing a lot of execution with the Hydrator tools, I release that each time I do it, I get more tweets
the first execution, I got 33 125
and on the five execution, I got 33 168. what does it mean ??!

thank you so much for sharing this beautiful work

@edsu
Copy link
Member

edsu commented Dec 29, 2020

Does the number always go up? Excel does overflow the tweet ids so do be careful how you use it!

@AsmaZbt
Copy link

AsmaZbt commented Dec 29, 2020

Hi @edsu, yes I checked the IDs in the hydrated file and they match perfectly the correct IDs. thank you much

Yes, the number always goes up. I don't understand why!

best regards

@edsu
Copy link
Member

edsu commented Dec 29, 2020

Can you share the tweet id dataset? I can take a look to see what might be happening.

@edsu edsu closed this as completed Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants