-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TABLED: Review a few articles reorganized by ChatGPT #44
Draft
renoirb
wants to merge
1
commit into
2020
Choose a base branch
from
renoirb-patch-1
base: 2020
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,135 +28,101 @@ keywords: | |
- static site | ||
- convert from cms | ||
--- | ||
I've been asked twice now to convert websites running on a CMS into static file-based websites. This practice is useful for preserving site content without maintaining the CMS database and backend services. The goal is to create an exact copy of the CMS-generated site but as simple HTML files. | ||
|
||
Its been two times now that I've been asked to make a website that was running | ||
on a CMS and make it static. | ||
## Benefits of Static Sites | ||
Converting to static sites helps with migration, as sites that won't receive new content become folders of HTML files. The challenge is to maintain the original structure, allowing the web server to find and serve these files. | ||
|
||
This is an useful practice if you want to keep the site content for posterity | ||
without having to maintain the underlying CMS. It makes it easier to migrate | ||
sites since the sites that you know you won't add content to anymore becomes | ||
simply a bunch of HTML files in a folder. | ||
## Process | ||
Here are the steps I took to achieve this. Keep in mind that your experience may vary, but these steps worked for me with WordPress and ExpressionEngine. | ||
|
||
My end goal was to make an EXACT copy of what the site is like when generated by | ||
the CMS, BUT now stored as simple HTML files. When I say EXACT, I mean it, even | ||
as to keep documents at their original location from the new static files. It | ||
means that each HTML document had to keep their same value BUT that a file will | ||
exist and the web server will find it. For example, if a link points to `/foo`, | ||
the link in the page remain as-is, even though its now a static file at | ||
`/foo.html`, but the web server will serve `/foo.html` anyway. | ||
### 1. Gathering Page URLs | ||
- Use the browser's network inspector to capture all document requests. | ||
- Export the list using "Save as HAR." | ||
- Extract URLs from the HAR file using `underscore-cli`. | ||
- Clean the list to have one URL per line in `list.txt`. | ||
|
||
Here are a few steps I made to achieve just that. Notice that your mileage may | ||
vary, I've done those steps and they worked for me. | ||
|
||
I've done this procedure a few times with WordPress blogs along with | ||
[**webat25.org** that is now hosted as **w3.org/webat25/**][0] website that was | ||
running on ExpressionEngine. | ||
|
||
## Steps | ||
|
||
## 1\. Browse and get all pages you think could be lost in scraping | ||
|
||
We want a simple file with one web page per line with its full address. This | ||
will help the crawler to not forget pages. | ||
|
||
- Use a web browser developer tool Network inspector, keep it open with | ||
"preserve log". | ||
- Once you browsed the site a bit, from the network inspector tool, list all | ||
documents and then export using the "Save as HAR" feature. | ||
- Extract urls from har file using `underscore-cli` | ||
|
||
npm install underscore-cli cat site.har | underscore select '.entries .request | ||
.url' \> workfile.txt | ||
|
||
- Remove first and last lines (its a JSON array and we want one document per | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keep those notes. Humans like to get human friendly explanations and/or rewording in context what the docs says |
||
line) | ||
- Remove the trailing remove hostname from each line (i.e. start by /path), in | ||
vim you can do `%s/http:\/\/www\.example.org//g` | ||
- Remove `"` and `",` from each lines, in vim you can do `%s/",$//g` | ||
- At the last line, make sure the `"` is removed too because the last regex | ||
missed it | ||
- Remove duplicate lines, in vim you can do `:sort u` | ||
- Save this file as `list.txt` for the next step. | ||
|
||
## 2\. Let's scrape it all | ||
|
||
We'll do two scrapes. First one is to get all assets it can get, then we'll go | ||
again with different options. | ||
|
||
The following are the commands I ran on the last successful attempt to replicate | ||
the site I was working on. This is not a statement that this method is the most | ||
efficient technique. Please feel free to improve the document as you see fit. | ||
|
||
First a quick TL;DR of `wget` options | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keep those notes. |
||
|
||
- `-m` is the same as `--mirror` | ||
- `-k` is the same as `--convert-links` | ||
- `-K` is the same as `--backup-converted` which creates .orig files | ||
- `-p` is the same as `--page-requisites` makes a page to get ALL requirements | ||
- `-nc` ensures we dont download the same file twice and end up with duplicates | ||
(e.g. file.html AND file.1.html) | ||
- `--cut-dirs` would prevent creating directories and mix things around, do not | ||
use. | ||
|
||
Notice that we're sending headers as if we were a web browser. Its up to you. | ||
### 2. Scraping | ||
Perform two scrapes: one for assets and one for content. | ||
|
||
Commands for `wget`: | ||
```bash | ||
export UA='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' | ||
export ACCEPTL='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' | ||
export ACCEPTT='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' | ||
wget -i list.txt -nc --random-wait \ | ||
--mirror \ | ||
-e robots=off \ | ||
--no-cache \ | ||
-k -E --page-requisites \ | ||
--user-agent="$UA" \ | ||
--header="$ACCEPTT" \ | ||
http://www.example.org/ | ||
|
||
# First pass | ||
wget -i list.txt \ | ||
-nc \ | ||
--random-wait \ | ||
--mirror \ | ||
-e robots=off \ | ||
--no-cache \ | ||
-k \ | ||
-E \ | ||
--page-requisites \ | ||
--user-agent="$UA" \ | ||
--header="$ACCEPTT" \ | ||
http://www.example.org/ | ||
|
||
# Second pass | ||
wget -i list.txt \ | ||
--mirror \ | ||
-e robots=off \ | ||
-k \ | ||
-K \ | ||
-E \ | ||
--no-cache \ | ||
--no-parent \ | ||
--user-agent="$UA" \ | ||
--header="$ACCEPTL" \ | ||
--header="$ACCEPTT" \ | ||
http://www.example.org/ | ||
``` | ||
|
||
Then, another pass | ||
### 3. Cleanup | ||
Commands to clean up the fetched files: | ||
|
||
```bash | ||
wget -i list.txt --mirror \ | ||
-e robots=off \ | ||
-k -K -E --no-cache --no-parent \ | ||
--user-agent="$UA" \ | ||
--header="$ACCEPTL" \ | ||
--header="$ACCEPTT" \ | ||
http://www.example.org/ | ||
# Remove empty lines | ||
find . \ | ||
-type f \ | ||
-regextype posix-egrep \ | ||
-regex '.*\.orig$' \ | ||
-exec sed \ | ||
-i 's/\r//' {} \; | ||
|
||
# Rename .orig files to html | ||
find . \ | ||
-name '*orig' \ | ||
| sed \ | ||
-e "p;s/orig/html/" \ | ||
| xargs \ | ||
-n2 mv | ||
|
||
# Remove duplicated .html in filename | ||
find . \ | ||
-type f \ | ||
-name '*\.html\.html' \ | ||
| sed \ | ||
-e "p;s/\.html//" \ | ||
| xargs \ | ||
-n2 mv | ||
|
||
# Simplify folders with only index.html | ||
find . \ | ||
-type f \ | ||
-name 'index.html' \ | ||
| sed \ | ||
-e "p;s/\/index\.html/.html/" \ | ||
| xargs \ | ||
-n2 mv | ||
|
||
# Remove files with .1 or similar | ||
find . \ | ||
-type f \ | ||
-name '*\.1\.*' \ | ||
-exec rm -rf {} \; | ||
``` | ||
|
||
## 3\. Do some cleanup on the fetched files | ||
|
||
Here are a few commands I ran to clean the files a bit | ||
|
||
- Remove empty lines in every .orig files. They're the ones we'll use in the end | ||
after all | ||
|
||
```bash | ||
find . -type f -regextype posix-egrep -regex '.*\.orig$' -exec sed -i 's/\r//' {} \; | ||
``` | ||
|
||
- Rename the .orig file into html | ||
|
||
```bash | ||
find . -name '*orig' | sed -e "p;s/orig/html/" | xargs -n2 mv | ||
|
||
find . -type f -name '*\.html\.html' | sed -e "p;s/\.html//" | xargs -n2 mv | ||
``` | ||
|
||
- Many folders might have only an index.html file in it. Let's just make them a | ||
file without directory | ||
|
||
```bash | ||
find . -type f -name 'index.html' | sed -e "p;s/\/index\.html/.html/" | xargs -n2 mv | ||
``` | ||
|
||
- Remove files that has a `.1` (or any number in them), they are most likely | ||
duplicates anyway | ||
|
||
```bash | ||
find . -type f -name '*\.1\.*' -exec rm -rf {} \; | ||
``` | ||
|
||
[0]: https://www.w3.org/webat25/ | ||
Please note that the `wget` commands and cleanup steps are based on my experience and may require adjustments for your specific case. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably something better than underscore-cli today.