Unable to scrape all the data of an Twitter account, need help in scraping all of its contents #4618

rarelygoeshere · 2023-10-03T17:20:24Z

rarelygoeshere
Oct 3, 2023

Hello everyone, a recent problem occurred to me while I was using gallery-dl on Twitter.
I'm trying to scrape the data on this account, and while it was scraping, it seemingly arbitrarily stopped scraping at around over 3k images, despite there being much more than that, as you can see below.

Regarding what config I am using for Twitter, I am using the config from #4534, with afaik no altercations to it yet. ~~However, if anyone wishes to review my config.json in its entirety, I have placed it at the bottom of this issue.~~ Edit: I have deleted most of the irrelevant parts of the config and left out the parts I feel are relevant for the question.

I hope someone is willing to assist me on this matter. thanks 👍

the relevant config.json parts

{
    "extractor":
    {
        "base-directory": "./gallery-dl/",
        "parent-directory": false,
        "postprocessors": null,
        "archive": null,
        "cookies": null,
        "cookies-update": true,
        "proxy": null,
        "skip": true,

        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:115.0) Gecko/20100101 Firefox/115.0",
        "retries": 4,
        "timeout": 30.0,
        "verify": true,
        "fallback": true,

        "sleep": 0,
        "sleep-request": 0,
        "sleep-extractor": 0,

        "path-restrict": "auto",
        "path-replace": "_",
        "path-remove": "\\u0000-\\u001f\\u007f",
        "path-strip": "auto",
        "path-extended": true,

        "extension-map": {
            "jpeg": "jpg",
            "jpe" : "jpg",
            "jfif": "jpg",
            "jif" : "jpg",
            "jfi" : "jpg"
        },

        },
        "twitter":
        {
            "username": "null",
            "password": "null",
			"filename": "{author[name]}-{author[id]}({author[date]:%Y%m%d_%H%M%S})-{tweet_id}({date:%Y%m%d_%H%M%S}).{extension}",
            "cards": false,
            "conversations": false,
            "pinned": false,
            "quoted": false,
            "replies": true,
            "retweets": false,
            "strategy": null,
            "text-tweets": false,
            "twitpic": false,
            "unique": true,
            "users": "timeline",
            "videos": true

    "downloader":
    {
        "filesize-min": null,
        "filesize-max": null,
        "mtime": true,
        "part": true,
        "part-directory": null,
        "progress": 3.0,
        "rate": null,
        "retries": 4,
        "timeout": 30.0,
        "verify": true,

        "http":
        {
            "adjust-extensions": true,
            "chunk-size": 32768,
            "headers": null,
            "validate": true
        },

        "ytdl":
        {
            "format": null,
            "forward-cookies": false,
            "logging": true,
            "module": null,
            "outtmpl": null,
            "raw-options": null
        }
    },

    "output":
    {
        "mode": "auto",
        "progress": true,
        "shorten": true,
        "ansi": false,
        "colors": {
            "success": "1;32",
            "skip"   : "2"
        },
        "skip": true,
        "log": "[{name}][{levelname}] {message}",
        "logfile": null,
        "unsupportedfile": null
    },

    "netrc": false
}

Answered by mikf

Oct 29, 2023

TLDR: Twitter search bad, but the only option

As far as I know, it is not really possible to get more items from a Twitter user than gallery-dl already gets, unless maybe with a more precise search query.

The current strategy is using the /media timeline for the first ~1000 items, followed by a search for from:USERNAME maxid:LAST_MEDIA_ID. The problem with this is 1) there are no other ways to get older Tweets other than searching and 2) search got severely crippled, rate limited, and is inaccurate (i.e. it misses Tweets).

Your config settings shouldn't matter all that much or at all as long as you run gallery-dl with twitter.com/USERAME as input URL.

I'm not sure what's the etiquette fo…

View full answer

anthonyj-codestuff · 2023-10-04T18:55:01Z

anthonyj-codestuff
Oct 4, 2023

Friend, you may want to double-check those credentials before you publish your username and password on a public forum. Hope you've got 2FA turned on.

3 replies

rarelygoeshere Oct 6, 2023
Author

oh right I forgot to delete that, That's my mistake, thank you for notifying me. I'll change that now. That's only a throwaway, so it's not that important thankfully

anthonyj-codestuff Oct 6, 2023

You already know this I'm sure, but for anyone reading this, you should also change passwords on any other accounts that had this same password. But of course you don't need to because you never re-use passwords :)

rarelygoeshere Oct 6, 2023
Author

Good advice indeed, haha XD, you should also use a password manager to manage them all too to save the trouble forgetting them all.

rarelygoeshere · 2023-10-29T10:14:04Z

rarelygoeshere
Oct 29, 2023
Author

I'm not sure what's the etiquette for questions that aren't answered in a couple weeks but I'll make an input requesting for help again.

I recently got my g-dl up to date to the latest version (I figured out that I haven't updated it in a while) and retried scraping from the same account as above again, however, it still has the same problem, although this time it stopped downloading when it reached 3215 items. I hope that someone with better insights into g-dl might be able to assist me in this matter.

Thank you.

2 replies

mikf Oct 29, 2023
Maintainer

TLDR: Twitter search bad, but the only option

As far as I know, it is not really possible to get more items from a Twitter user than gallery-dl already gets, unless maybe with a more precise search query.

The current strategy is using the /media timeline for the first ~1000 items, followed by a search for from:USERNAME maxid:LAST_MEDIA_ID. The problem with this is 1) there are no other ways to get older Tweets other than searching and 2) search got severely crippled, rate limited, and is inaccurate (i.e. it misses Tweets).

Your config settings shouldn't matter all that much or at all as long as you run gallery-dl with twitter.com/USERAME as input URL.

I'm not sure what's the etiquette for questions that aren't answered in a couple weeks but I'll make an input requesting for help again.

Sorry about that, it wasn't my intention to ignore your question. Sometimes I put off answering a question for "later" (basically Soon:tm:) for various reasons. It then moves to page 2 or 3 in my notifications, never to be seen or remembered again.

(It also doesn't help that I've already answered very similar questions far too often)

Answer selected by rarelygoeshere

rarelygoeshere Nov 4, 2023
Author

Hello! Thank you for coming by to answer this question for me! And don't worry about feeling like you ignored my question, I don't mind it.

That being said, what you say is rather unfortunate to hear, as I was able to scrape two accounts with over 10k media tweets with gallery-dl, yet for the above account, it caps at around over 3k tweets. I was really hoping I could scrape this entire account altogether.
So I suppose I should probably try to look into alternatives to tackle this large account for myself :( It's tough but it is what it is.

Thank you for coming by and answering my question, I appreciate it.

FebX23 · 2024-02-15T11:59:16Z

FebX23
Feb 15, 2024

If you're hitting a limit with gallery-dl on Twitter after scraping around 3k images, despite there being more, it sounds like you might be encountering API rate limits or the tool's internal limitations. The config from #4534 should generally work well, but every tool has its constraints, especially with large volumes of data.

For overcoming such limitations, Crawlbase could offer a potential solution. While it's not specifically designed for Twitter scraping, it's robust handling of requests and data could help manage and possibly bypass the stumbling blocks you're facing. It's particularly useful for large-scale scraping projects where managing requests efficiently and avoiding detection become crucial. You can also try brighdata or scrapingbee as alternatives.

1 reply

rarelygoeshere Feb 15, 2024
Author

Thank you for your suggestions. However, upon investigation, I've found that they are all paid services, so I don't think I can afford to use them for my needs.

Emmeline-Hawthorne · 2025-02-08T12:38:42Z

Emmeline-Hawthorne
Feb 8, 2025

If you're struggling to scrape Twitter data, try Crawlbase API for proxy rotation and CAPTCHA handling. Selenium/Playwright can help with dynamic content. Also, use IP rotation & rate limits to avoid blocks. Let me know if you need more help! FYI: I also tried same

0 replies

docholllidae · 2025-02-14T00:37:09Z

docholllidae
Feb 14, 2025

hi guys, i have not ever had a problem with downloading everything from a twitter profile, including both text posts & images/videos.
for example one user i've downloaded has almost 6,300 media according to twitter, my download folder for them as over 7,100 media (i assume discrepancy is related in part to deleted posts, and i'm not sure how twitter counts media when a single post has multiple media in it)
this user also has media going back to late 2019. my first download of them was in early 2023

here is my config (i assume the relevant part for getting everything is the include setting, mixed with my first download of a user is always .\gallery-dl.exe https://twitter.com/USERNAME --abort 6969 )

    "extractor": {
        "base-directory": "X:/My Drive/!pr0n/scrapers/",
        "#archive": "%appdata%/gallery-dl/archive.sqlite3",
        "path-restrict": "^A-Za-z0-9_.~!-",
        "format-date": "%Y-%m-%d_%H-%M-%S",
        "skip": "abort:15",
        "parent-skip": "abort:15",
        "keywords-default": "",
        "parent-directory": "true",
        
        "twitter": {
            "archive": "X:/My Drive/!pr0n/scrapers/zzTwitter/archive.twitter.sqlite3",
            "skip": "abort:4",
            
            "#cookies": "X:/My Drive/!pr0n/scrapers/zzTwitter/cookies.twitter.user1.txt",
            "cookies": "X:/My Drive/!pr0n/scrapers/zzTwitter/cookies.twitter.user2.txt",

            "#sleep": [9.9, 24.2],
            "sleep": [14.9, 29.2],
            "#sleep-request": [3.8, 12.6],
            "sleep-request": [13.8, 33.6],
            
            "image-filter": "author is user",
            "logout": 1,
            "pinned": 1,
            "syndication": 1,
            "text-tweets": 1,
            "include": ["avatar","background","timeline", "media"],
            
            "directory": {
                "count ==0":        ["zzTwitter","downloads","{author[id]}.{author[name]}","text_tweets"],
                "":                 ["zzTwitter","downloads","{author[id]}.{author[name]}","media"]
            },
            "filename": "{date}~_~{tweet_id}-{num}.{author[name]}~_~{content[0:69]}~_~{filename}.{extension}",
            "avatar": {
                "archive": "",
                "directory": ["zzTwitter","downloads","{author[id]}.{author[name]}","media","avatar"],
                "filename": "{date}_avatar_{author[id]}.{author[name]}~_~{filename}.{extension}"
            },
            "background": {
                "archive": "",
                "directory": ["zzTwitter","downloads","{author[id]}.{author[name]}","media","background"],
                "filename": "background_{date}~_~{filename}.{extension}"
            },
            
            "metadata": 1,
            "postprocessors":[{
                "name": "metadata",
                "event": "post",
                "directory": "metadata",
                "filename": "{date}~_~{tweet_id}.{author[name]}~_~{content[0:69]}.json"
            }]            
        },

    "downloader": {
        "rate": "10M",
        "progress": 2.0
    },
    
    "output": {
        "ansi": true,
        "mode": "color"
    }
}

there are a couple of similar lines such as sleep, this is to quickly play around with and test different values

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to scrape all the data of an Twitter account, need help in scraping all of its contents #4618

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unable to scrape all the data of an Twitter account, need help in scraping all of its contents #4618

rarelygoeshere Oct 3, 2023

Replies: 5 comments · 6 replies

anthonyj-codestuff Oct 4, 2023

rarelygoeshere Oct 6, 2023 Author

anthonyj-codestuff Oct 6, 2023

rarelygoeshere Oct 6, 2023 Author

rarelygoeshere Oct 29, 2023 Author

mikf Oct 29, 2023 Maintainer

rarelygoeshere Nov 4, 2023 Author

FebX23 Feb 15, 2024

rarelygoeshere Feb 15, 2024 Author

Emmeline-Hawthorne Feb 8, 2025

docholllidae Feb 14, 2025

rarelygoeshere
Oct 3, 2023

Replies: 5 comments 6 replies

anthonyj-codestuff
Oct 4, 2023

rarelygoeshere Oct 6, 2023
Author

rarelygoeshere Oct 6, 2023
Author

rarelygoeshere
Oct 29, 2023
Author

mikf Oct 29, 2023
Maintainer

rarelygoeshere Nov 4, 2023
Author

FebX23
Feb 15, 2024

rarelygoeshere Feb 15, 2024
Author

Emmeline-Hawthorne
Feb 8, 2025

docholllidae
Feb 14, 2025