Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Sample for Search #566

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Random Sample for Search #566

wants to merge 4 commits into from

Conversation

igorbrigadir
Copy link
Contributor

For #453, second attempt at #459

Twitter "sample" stream is based on selecting tweets with ids where the millisecond timestamp matches a defined range.

Use since_id and until_id parameters and snowflake id tricks to simulate a sample: operator that samples tweets based on millisecond time windows.

--sample command line option can apply to any endpoint that has a since / until id option.

The idea is to is to accept an integer between 1 and 100 to get a sample of n% of tweets, or --sample gardenhose or --sample spritzer or --sample v1 (alias for --sample 1 and --sample spritzer) or --sample v2 which is also a 1% sample but with different sampling windows as far as i can tell.

I still have to make sure my assumptions are correct - but so far the millisecond ranges are like this:

        _1% v1 "Spritzer" Sample   [657-666]
        10% v1 "Gardenhose" Sample [657-756]
        10% v1 "Enterprise" Sample [*0*]
        _1% v2 Sample           [*0*]
        _N% v2 Sample           [?]

@digi686
Copy link

digi686 commented Jun 4, 2022

I need to retrieve tweets over 2 weeks 4x in 2018. I have the academic access. I only want to get a random sample of 10k tweets per day. All the solutions I see are about sampling based on user ID, I want to sample in terms of tweet count per day so that I get tweets from random times during each day. Is this possible? Thank you!

@igorbrigadir
Copy link
Contributor Author

Yes, maybe - my current prototype implementation for this is still forthcoming unfortunately.

@digi686
Copy link

digi686 commented Jun 4, 2022

Yes, maybe - my current prototype implementation for this is still forthcoming unfortunately.

I see, in the meantime, I am using a loop that iterates over days, hours, and minutes for chunks of 5 seconds - in case that's helpful to other folks trying to get a sample of tweets. I look forward to seeing your implementation!

@troyneilson
Copy link

troyneilson commented Aug 6, 2022

Hi @digi686. Would you happen to have the code available for your "randomising" loop? I too have academic access and I'm looking to take a sample based on a hashtag search over a 10-year period for sentiment analysis.

@edsu
Copy link
Member

edsu commented Aug 7, 2022

@igorbrigadir I wonder if it makes sense to release your prototype as a plugin while it is in development?

@msa-digi
Copy link

msa-digi commented Aug 7, 2022

Hi @digi686. Would you happen to have the code available for your "randomising" loop? I too have academic access and I'm looking to take a sample based on a hashtag search over a 10-year period for sentiment analysis.

Hi @troyneilson, sure! Here's my loop. I put it inside the main function, before defining my query. Since my last comment, I opted for chunks of 2 seconds to reduce the volume of tweets. Hope it helps.

    for day in tqdm(range(13)):
        for hour in tqdm(range(24)):
            for minute in tqdm(range(60)):
                # Specify the start time in UTC for the time period you want Tweets from
                start_time = datetime.datetime(2018, 11, 1+day, hour, minute, 0, 0, datetime.timezone.utc)

                # Specify the end time in UTC for the time period you want Tweets from
                end_time = datetime.datetime(2018, 11, 1+day, hour, minute, 2, 0, datetime.timezone.utc)

@troyneilson
Copy link

troyneilson commented Aug 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants