Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework notebooks to use the static self-hosted fake job board #350

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

martin-martin
Copy link
Contributor

indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes.

I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails.

I've previously reworked the written tutorial to use a self-hosted fake job board that I set up just for the purpose of the tutorial.

As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks.

The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.

Where to put new files:

  • New files should go into a top-level subfolder, named after the article slug. For example: my-awesome-article

How to merge your changes:

  1. Make sure the CI code style tests all pass (+ run the automatic code formatter if necessary).
  2. Find an RP Team member on Slack and ask them to review & approve your PR.
  3. Once the PR has one positive ("approved") review, GitHub lets you merge the PR.
  4. 🎉

martin-martin and others added 3 commits December 21, 2022 14:48
indeed.com has tightened their bot protection against web scraping, which is why requests to their site as they are described in this course return 403 Forbidden status codes.

I've attempted to circumvent this using fake headers (something that would be explainable in an intro course) but no luck, 403 prevails.

I've previously [reworked the written tutorial](https://realpython.com/beautiful-soup-web-scraper-python/#step-1-inspect-your-data-source) to use a self-hosted [fake job board](https://realpython.github.io/fake-jobs/) that I set up just for the purpose of the tutorial.

As a quick fix for the video course, I added an explanatory lesson to the video coure and reworked the Jupyter notebooks.

The information and processes that I explain in the rest of the course are still valid and a good introduction for how to approach scraping a static website.
Copy link
Contributor

@gahjelle gahjelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-martin Great job updating this! I agree with you in removing the output from the notebooks!

I found one tiny bug (title -> title_element) that's noted as a line comment.

Otherwise, this looks good to me!

We could potentially ask @KateFinegan to have a quick LE glance on the changes as well.

"source": [
"link_text = title_link.text\n",
"link_text"
"title = title.text\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title is currently not defined, should we refer to title_element?

Suggested change
"title = title.text\n",
"title = title_element.text\n",

Co-authored-by: gahjelle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants