-
Notifications
You must be signed in to change notification settings - Fork 0
Weekly library data scraping #9
Comments
Now that the dev deployment is complete, we should make this the next thing to put in place. @pdelong42 would you like to take a crack at this? The requirements are basically to run 'python scrape.py' for each of the ElasticSearch indexes within crontab. I think the only modification required for each of the scraper scripts would be to check for an existing index first, remove it if it exists, then run the rest of the script as normal (I can add that code to the existing scripts, you just need to set up the crontab). |
@mik3caprio Sure, just give me the path to the scraping script, as well as the way it ought to be called, and I'll drop it into a crontab. |
So there are four sets of two scripts, one set for each Library system. The path is /home/apiproject/API-Portal/scrape/ and then the directories containing the Python scripts are dspace, omeka, sierra, and xeac. In each directory there is a scrape.py and a search.py. You would just need to run python scrape.py and python search.py for each Library system, and have them run weekly. The only other thing in question is what we would do to delete the indexes from ElasticSearch before scraping and re-indexing. I'm assuming the cron would have another CLI command to remove ElasticSearch indices relating to the system first. In other words: [ES CLIs to delete dspace* indices] And so on. I think we could/should set this cron up but not turn it on just yet. |
Hey @pdelong42 just confirm with me that you've got this set up and I'll close out this ticket. |
@mik3caprio, I tried running those scripts while logged-in as the "apiproject" user, but it threw some errors about missing python modules. Try it in dev to see what I mean. Are these the same scripts that were used to populate the initial data set into Elasticsearch in the first place? |
Sorry, I closed it by mistake. Wrong button, oops... |
Ah yes excellent point... I ran them originally from my mcaprio account so
I must have only installed modules there to run it! I'll get that sorted
out.
Mike
On Wed, Jul 5, 2017 at 16:46 Paul DeLong ***@***.***> wrote:
@mik3caprio <https://github.com/mik3caprio>, I tried running those
scripts while logged-in as the "apiproject" user, but it threw some errors
about missing python modules. Try it in dev to see what I mean.
Are these the same scripts that were used to populate the initial data set
into Elasticsearch in the first place?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAO8ho1lUMkN4AAsJLmFbjplhkaoMwZpks5sK_YggaJpZM4M0glQ>
.
--
Mike Caprio
[email protected]
https://brainewave.nyc/
|
Okay, but let's install as many of these Python modules as RPMs, whenever they're available, and only grab from pip as needed. Let me the names of the modules that are missing, and I'll make my best effort to find and install RPM packages of them from reputable sources. |
Set up cron in dev for scraping content - Scripts should fire off WEEKLY on weekends
The text was updated successfully, but these errors were encountered: