Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report: scrape_filing_docs.R #100

Open
jingyujzhang opened this issue Aug 24, 2021 · 0 comments
Open

Bug report: scrape_filing_docs.R #100

jingyujzhang opened this issue Aug 24, 2021 · 0 comments

Comments

@jingyujzhang
Copy link
Contributor

jingyujzhang commented Aug 24, 2021

It seems that the logic of this script is to do incremental update of edgar.filing_docs.

  • Def14_a never gets updated, so the code will scrape the same set of files in the while loop endlessly.

  • SEC has traffic controls. Paralleling with 8 cores will cause the IP to be blocked. I tested and found that 2 cores plus sleeping for 0.5s works (at least on my server). The key is to avoid submitting more than 10 requests per second, otherwise SEC will block the IP for 10 min.

jingyujzhang added a commit to jingyujzhang/edgar that referenced this issue Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant