Unable to scrape all the data of an Twitter account, need help in scraping all of its contents #4618
-
Hello everyone, a recent problem occurred to me while I was using gallery-dl on Twitter. Regarding what config I am using for Twitter, I am using the config from #4534, with afaik no altercations to it yet. I hope someone is willing to assist me on this matter. thanks 👍 the relevant config.json parts
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 6 replies
-
Friend, you may want to double-check those credentials before you publish your username and password on a public forum. Hope you've got 2FA turned on. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure what's the etiquette for questions that aren't answered in a couple weeks but I'll make an input requesting for help again. I recently got my g-dl up to date to the latest version (I figured out that I haven't updated it in a while) and retried scraping from the same account as above again, however, it still has the same problem, although this time it stopped downloading when it reached 3215 items. I hope that someone with better insights into g-dl might be able to assist me in this matter. Thank you. |
Beta Was this translation helpful? Give feedback.
-
If you're hitting a limit with gallery-dl on Twitter after scraping around 3k images, despite there being more, it sounds like you might be encountering API rate limits or the tool's internal limitations. The config from #4534 should generally work well, but every tool has its constraints, especially with large volumes of data. For overcoming such limitations, Crawlbase could offer a potential solution. While it's not specifically designed for Twitter scraping, it's robust handling of requests and data could help manage and possibly bypass the stumbling blocks you're facing. It's particularly useful for large-scale scraping projects where managing requests efficiently and avoiding detection become crucial. You can also try brighdata or scrapingbee as alternatives. |
Beta Was this translation helpful? Give feedback.
-
If you're struggling to scrape Twitter data, try Crawlbase API for proxy rotation and CAPTCHA handling. Selenium/Playwright can help with dynamic content. Also, use IP rotation & rate limits to avoid blocks. Let me know if you need more help! FYI: I also tried same |
Beta Was this translation helpful? Give feedback.
-
hi guys, i have not ever had a problem with downloading everything from a twitter profile, including both text posts & images/videos. here is my config (i assume the relevant part for getting everything is the "extractor": {
"base-directory": "X:/My Drive/!pr0n/scrapers/",
"#archive": "%appdata%/gallery-dl/archive.sqlite3",
"path-restrict": "^A-Za-z0-9_.~!-",
"format-date": "%Y-%m-%d_%H-%M-%S",
"skip": "abort:15",
"parent-skip": "abort:15",
"keywords-default": "",
"parent-directory": "true",
"twitter": {
"archive": "X:/My Drive/!pr0n/scrapers/zzTwitter/archive.twitter.sqlite3",
"skip": "abort:4",
"#cookies": "X:/My Drive/!pr0n/scrapers/zzTwitter/cookies.twitter.user1.txt",
"cookies": "X:/My Drive/!pr0n/scrapers/zzTwitter/cookies.twitter.user2.txt",
"#sleep": [9.9, 24.2],
"sleep": [14.9, 29.2],
"#sleep-request": [3.8, 12.6],
"sleep-request": [13.8, 33.6],
"image-filter": "author is user",
"logout": 1,
"pinned": 1,
"syndication": 1,
"text-tweets": 1,
"include": ["avatar","background","timeline", "media"],
"directory": {
"count ==0": ["zzTwitter","downloads","{author[id]}.{author[name]}","text_tweets"],
"": ["zzTwitter","downloads","{author[id]}.{author[name]}","media"]
},
"filename": "{date}~_~{tweet_id}-{num}.{author[name]}~_~{content[0:69]}~_~{filename}.{extension}",
"avatar": {
"archive": "",
"directory": ["zzTwitter","downloads","{author[id]}.{author[name]}","media","avatar"],
"filename": "{date}_avatar_{author[id]}.{author[name]}~_~{filename}.{extension}"
},
"background": {
"archive": "",
"directory": ["zzTwitter","downloads","{author[id]}.{author[name]}","media","background"],
"filename": "background_{date}~_~{filename}.{extension}"
},
"metadata": 1,
"postprocessors":[{
"name": "metadata",
"event": "post",
"directory": "metadata",
"filename": "{date}~_~{tweet_id}.{author[name]}~_~{content[0:69]}.json"
}]
},
"downloader": {
"rate": "10M",
"progress": 2.0
},
"output": {
"ansi": true,
"mode": "color"
}
} there are a couple of similar lines such as |
Beta Was this translation helpful? Give feedback.
TLDR: Twitter search bad, but the only option
As far as I know, it is not really possible to get more items from a Twitter user than gallery-dl already gets, unless maybe with a more precise search query.
The current strategy is using the
/media
timeline for the first ~1000 items, followed by a search forfrom:USERNAME maxid:LAST_MEDIA_ID
. The problem with this is 1) there are no other ways to get older Tweets other than searching and 2) search got severely crippled, rate limited, and is inaccurate (i.e. it misses Tweets).Your config settings shouldn't matter all that much or at all as long as you run gallery-dl with
twitter.com/USERAME
as input URL.