-
Notifications
You must be signed in to change notification settings - Fork 123
Basic
WebScrapBook provides various options and there are lots of ways using it. Below are three principal approaches to use it:
This mimics the native browser saving functionality and saves each captured page as an independent file, which can then be found and viewed from the file manager. The save path for each captured page can be specified if the browser is configured to ask the saving path for each downloaded file.
-
Set
Save captured data to:
toFile
. -
Set
Save captured data as:
to the desired saving format.Single HTML
is generally most convenient. Additional configuration is required to open archive files directly from the file manager if set toHTZ package
orMAFF package
. -
Optionally set
Filename to save:
to a desired default filename for the captured page, such as%title%
.
This approach is to store every web page under the specified directory, and can then be found and viewed from the file manager. This approach supports saving the captured page as a folder, and can be configured to generate sub-directories. However, due to the security restriction of the browser, only a sub-directory of <default download folder>
can be specified as the target.
-
Set
Save captured data to:
toScrapbook folder
. -
Set
Save captured data as:
to the desired saving format.Folder
supports the most features andSingle HTML
is more convenient. Additional configuration is required to open archive files directly from the file manager if set toHTZ package
orMAFF package
.For Google Chrome or some Chromium-based browsers with this options set to
Folder
, it's recommended to uncheck theAsk where to save each file before downloading
browser option (located atchrome://settings/downloads
), or every file to be saved will trigger a prompt. (see Known issues for details) -
Optionally set
Scrapbook folder:
to put the captured pages under another directory under<default download folder>
. -
Optionally set
Filename to save:
to a desired default filename for the captured page, such as%title%
, or%create-Y%/%create-m%/%title%
if organizing files by time is desired.
This approach requires setting up a backend server. Captured files will be directly saved to the backend server and can be accessed through the sidebar (toolbar button > Open scrapbook
) for a browser with WebScrapBook extension installed.
-
Install PyWebScrapBook package.
-
Follow the instruction to set up the backend server.
For example, to host
C:\Users\MyUserName\WebScrapBook
as a scrapbook, change working directory to it in the command prompt and runwsb config -ba
to generate config files. AndC:\Users\MyUserName\WebScrapBook\.wsb\serve.py
can be run to start the server (do not close the prompted window unless it's intended to shutdown the server).For more advanced configuration of the backend server, see here.
-
Enter WebScrapBook options and set
Address
,User
, andPassword
of theBackend server
. (Defaults tohttp://localhost:8080/
and blank user and password if the backend server is not otherwise configured.) -
Set
Save captured data to:
toBackend server
. -
Set
Save captured data as:
to the desired saving format.Folder
is generally most recommended, andHTZ package
orMAFF package
can be used to save space and reduce file number. This only affects newly captured web pages, exist files are not affected, and data with different saving format may coexist without problem.Although
Single HTML
is supported, it takes more space, cannot preserve certain complicated information, and certain advanced features are not supported, and thus is generally not recommended.PyWebScrapBook provides a
wsb convert
utility that can convert the file format on demand. -
Optionally set
Filename to save:
to a desired value, though default%ID%
is most recommended to prevent a potential compatibility issue. -
Start the backend server before usage, and then capture a web page or access captured data via the sidebar.
To start the backend server automatically when the devise is booted or when the user logs in, add the starting file
.wsb/serve.py
to system service or something like the startup folder on Windows.To hide the command prompt of the backend server on Windows, rename
.wsb/serve.py
to.wsb/serve.pyw
. To shut down a backend server with hidden command prompt, use the task manager. -
Optionally generate a site index on the backend server (
toolbar button > Options > Run indexer
) for clients without WebScrapBook to browse the captured pages.
-
Open a web page and wait until it's completely loaded and ready for a capture.
NOTE: A web page may load resources dynamically using scripts, and thus it may be required to wait for a while, to scroll down the screen, or to perform some interactive operations to ensure that wanted contents and resources are loaded.
-
Click the
toolbar button
(also calledbrowser action button
), and selectcapture tab
from the dropdown list. A capture dialog will prompt and the capture will start. Data will be saved after the capture succeeded using the previously configured way (see above section for details). -
If there's a selection in the web page,
capture tab
will capture only the selected range. Some browsers (such as Firefox) supports selecting multiple ranges via pressing Ctrl, and all of them will be captured. -
Capture tab
captures the currently shown content on the screen (and the tab cannot be closed before the capture completes).Capture tab (source)
captures the original web page HTML content before processed by scripts.Capture tab (bookmark)
captures the page as a bookmark file (or a bookmark item in a scrapbook), which can then be opened to visit the source web page.Capture as...
prompts a dialog for customization before a capture. -
Use
Edit tab
to start annotating or editing the web page. The edited web page can be captured afterwards. -
Various web page elements can be captured individually through the context menu. For example, right-click on a link to capture the linked web page, right-click on an image or media to capture it, right-click on a frame to capture the frame page, right-click in a page or a frame page with something selected to capture only the selected range, etc.
-
The keyboard shortcut for various WebScrapBook functions can be customized using the built-in browser shortcuts manager.
-
With backend server configured, a capture with positioning can be invoked through dragging the
capture tab
(or alike) command button onto a desired position in the sidebar. -
Similarly, a capture with positioning can be invoked through dragging a link, image, etc., in a page.
-
When clicking or dragging a toolbar button command like
capture tab
, holdShift
to toggle between capturing tab or source, holdAlt
to capture as bookmark, or holdCtrl
to open a dialog for customization.
A batch capture can be invoked to capture multiple web pages one by one through the following ways:
-
Hold Ctrl or Shift to select multiple tabs and invoke a batch capture through the
capture tab
(or alike) from the toolbar button. -
Invoke a batch capture through the
batch capture all tabs
from the toolbar button. All tabs will be pre-filled in the prompted dialog for later manipulation. -
Invoke a batch capture through the
batch capture selected links
from the toolbar button or context menu. All links or all selected links in the page will be pre-filled in the prompted dialog for later manipulation.
To capture images, audio, or other resource files attached using links in a web page, go to the capture - capture links
options, set download linked files
to match URL file extension
or match HTTP header and file extension
, and set appropriate conditions in the included file types for downloading linked files
. WebScrapBook will save those linked files together when capturing the web page.
Go to capture - capture links
, set depth to capture linked pages
to a positive integer and configure the filter rules in the included URLs for capturing linked pages
, to get WebScrapBook capture linked web pages that match the rules together when capturing a web page, and rebuild the interlinkings and generate a resource map. An in-depth captured item will be marked as a "site" type.
For example, for a web page at
http://example.com/foo
having a hyperlink targetinghttp://example.com/bar
, use the filterhttp://example.com/bar
to additionally capture that page, or use the filter/^http://example\.com//
to additionally capture the linked web pages under the same domain.
Hint:
depth to capture linked pages
can also be set to0
to generate a resource map while capturing only the current page. A "merge-capture" mentioned later can be performed to add other web pages to the item.
NOTE: It's recommended to set Save captured data as
to folder
when performing an in-depth capture. Although HTZ package
and MAFF package
are also permitted, it may cause a capture failure due to memory exhaust or a poor performance when annotating a page in a large archive file if there are too many pages captured, and merge-capture is not available.
If the backend server is configured, a re-capture can be invoked from the context menu of a web page, site, bookmark, etc., item. Alternatively, in the capture tab as...
dialog, set capture type
to re-capture
and select a suitable target item
.
A re-capture replaces the content of the original item, updates its item type, index, modified time, source URL, favicon, title, and comment, and attempts to copy annotations from the original web page (this may fail if both versions are significantly different, though). The original web page will be moved to the backup directory (which is by default at .wsb/backup/
) after the re-capture succeeded (the automatic backup can be disabled through the backup when capturing again
option).
If the backend server is configured, this can be performed to capture a web page and merge it into a previously captured item. To do this, in the capture tab as...
dialog, set capture type
to merge-capture
and select a suitable target item
.
The target item for a merge-capture requires a resource map (index.json
), which is generated only for a site, i.e. when depth to capture linked pages
has been set to 0 or more. Additionally, its captured data should have been saved as folder.
Except for the main page, a merge-capture determines whether a resource (page or file) already exists via the resource map. An existing resource will not be downloaded again, even if it has changed on the original site. It's generally not recommended to perform a merge-capture too long after the original capture to avoid an inconsistency (e.g. a new version page referencing an old version resource).
To update resources, edit the resource map file and delete their entries from the files
property, and perform a merge-capture for the referencing page.
A merge-capture may miss redirects. For example, a merge-capture on the page http://example.com/redirected
cannot rewrite a hyperlink http://example.com/link
that redirects to it in the already captured page http://example.com/main
. To fix this issue, edit the resource map and add ["http://example.com/link", "http://example.com/redirected"]
to the redirects
property, and perform a merge-capture on http://example.com/main
(or another page) to trigger link rebuilding.
Capture helpers allow customization for specific sites. Check enable capture helpers
in the options and set an adequate JSON config to get it work. Below are some usage examples:
[
{
"name": "DeferredImageFixer",
"description": "Save deferred images defined by data-*",
"commands": [
["attr", {"css": "img[data-src]"}, "src", ["get_attr", null, "data-src"]],
["attr", {"css": "img[data-srcset]"}, "srcset", ["get_attr", null, "data-srcset"]]
]
}
]
[
{
"description": "Don't capture images on this site",
"pattern": "/^https?://example\\.com//i",
"options": {
"capture.image": "blank",
"capture.imageBackground": "blank"
}
}
]
Check enable auto-capture
in the options and set an adequate JSON config to get it work. Below are some usage examples:
[{}]
[{"pattern": "/^https?://example\\.com//"}]
[{"delay": 10000}]
[{"repeat": 60000}]
[{"taskInfo": {"mode": "bookmark"}}]
-
save captured data to
set toscrapbook folder
.scrapbook folder
set toWebScrapBook/data
.
[{"taskInfo": {"options": {"capture.saveFolder": "WebScrapBook/data/autocaptures"}}}]
-
save captured data to
set tobackend server
.
[{"taskInfo": {"parentId": "20200101020304567"}}]
-
save captured data to
set tobackend server
.
[{"taskInfo": {"index": 0}}]
-
save captured data to
set tobackend server
.
[{"eachTaskInfo": {"comment": "#autocapture"}}]