Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework notebooks to use the static self-hosted fake job board #350

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,735 changes: 22 additions & 1,713 deletions build-a-web-scraper/01_inspect.ipynb

Large diffs are not rendered by default.

59 changes: 25 additions & 34 deletions build-a-web-scraper/02_scrape.ipynb

Large diffs are not rendered by default.

2,121 changes: 57 additions & 2,064 deletions build-a-web-scraper/03_parse.ipynb

Large diffs are not rendered by default.

34 changes: 24 additions & 10 deletions build-a-web-scraper/04_pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,28 @@
"- Target & Save Specific Information You Want"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ⚠️ Durabilty Warning ⚠️\n",
"\n",
"Like [mentioned in the course](https://realpython.com/lessons/challenge-of-durability/), websites frequently change. Unfortunately the job board that you'll see in the course, indeed.com, has started to block scraping of their site since the recording of the course.\n",
"\n",
"Just like in the associated written tutorial on [web scraping with beautiful soup](https://realpython.com/beautiful-soup-web-scraper-python/#scrape-the-fake-python-job-site), you can instead use [Real Python's fake jobs site](https://realpython.github.io/fake-jobs/) to practice scraping a static website.\n",
"\n",
"All the concepts discussed in the course lessons are still accurate. Translating what you see onto a different website will be a good learning opportunity where you'll have to synthesize the information and apply it practically."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Your Tasks:\n",
"\n",
"- Scrape the first 100 available search results\n",
"- Scrape all 100 available job postings\n",
"- Generalize your code to allow searching for different locations/jobs\n",
"- Pick out information about the URL, job title, and job location\n",
"- Pick out information about the apply URL, job title, and job location\n",
"- Save the results to a file"
]
},
Expand All @@ -40,8 +53,7 @@
"source": [
"### Part 1: Inspect\n",
"\n",
"- How do the URLs change when you navigate to the next results page?\n",
"- How do the URLs change when you use a different location and/or job title search?\n",
"- How do the URLs change when you navigate to a job detail?\n",
"- Which HTML elements contain the link, title, and location of each job?"
]
},
Expand All @@ -58,8 +70,9 @@
"source": [
"### Part 2: Scrape\n",
"\n",
"- Build the code to fetch the first 100 search results. This means you will need to automatically navigate to multiple results pages\n",
"- Write functions that allow you to specify the job title, location, and amount of results as arguments"
"- Build the code to fetch all 100 available job postings.\n",
"- Write functions that allow you to specify the job title, location, and amount of results as arguments\n",
"- Also fetch the information provided on each job details page. For this, you'll need to automatically follow URLs that you've fetched when getting the job postings."
]
},
{
Expand All @@ -75,8 +88,9 @@
"source": [
"### Part 3: Parse\n",
"\n",
"- Sieve through your HTML soup to pick out only the job title, link, and location\n",
"- Format the results in a readable format (e.g. JSON)\n",
"- Sieve through your HTML soup to pick out only the job title, link, and location from the main page\n",
"- Sieve through the HTML of each details page to get the job description and combine it with the other information\n",
"- Format the results in a readable format (e.g. JSON, TXT, TOML, ...)\n",
"- Save the results to a file"
]
},
Expand All @@ -90,7 +104,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -104,7 +118,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
"version": "3.11.0"
}
},
"nbformat": 4,
Expand Down
Loading