-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Sample for Search #566
base: main
Are you sure you want to change the base?
Conversation
I need to retrieve tweets over 2 weeks 4x in 2018. I have the academic access. I only want to get a random sample of 10k tweets per day. All the solutions I see are about sampling based on user ID, I want to sample in terms of tweet count per day so that I get tweets from random times during each day. Is this possible? Thank you! |
Yes, maybe - my current prototype implementation for this is still forthcoming unfortunately. |
I see, in the meantime, I am using a loop that iterates over days, hours, and minutes for chunks of 5 seconds - in case that's helpful to other folks trying to get a sample of tweets. I look forward to seeing your implementation! |
Hi @digi686. Would you happen to have the code available for your "randomising" loop? I too have academic access and I'm looking to take a sample based on a hashtag search over a 10-year period for sentiment analysis. |
@igorbrigadir I wonder if it makes sense to release your prototype as a plugin while it is in development? |
Hi @troyneilson, sure! Here's my loop. I put it inside the main function, before defining my query. Since my last comment, I opted for chunks of 2 seconds to reduce the volume of tweets. Hope it helps.
|
Thanks heaps for that, really appreciated.
… On 7 Aug 2022, at 9:45 pm, msa-digi ***@***.***> wrote:
|
For #453, second attempt at #459
Twitter "sample" stream is based on selecting tweets with ids where the millisecond timestamp matches a defined range.
Use
since_id
anduntil_id
parameters and snowflake id tricks to simulate asample:
operator that samples tweets based on millisecond time windows.--sample
command line option can apply to any endpoint that has a since / until id option.The idea is to is to accept an integer between 1 and 100 to get a sample of n% of tweets, or
--sample gardenhose
or--sample spritzer
or--sample v1
(alias for--sample 1
and--sample spritzer
) or--sample v2
which is also a 1% sample but with different sampling windows as far as i can tell.I still have to make sure my assumptions are correct - but so far the millisecond ranges are like this: