-
Notifications
You must be signed in to change notification settings - Fork 123
A yellow messages is a warning rather then an error, and means that some information in the web page cannot be correctly handled. The capture is grossly considered success even if there is a warning.
The cause of warning can be determined by its source URL and error message. Common causes are:
- Contents not following the specification in the source page: such bad content may be written by the source page or by another extension.
-
Streamed video: presents as a
<video>
element with a URL starting withblob:
. WebScrapBook cannot capture such content. -
Revoked dynamic object: mostly because the source page revoked a dynamic object URL after creation. Such URL usually starts with
blob:
. WebScrapBook cannot capture such content. -
Extension specific content: such URL usually starts with
chrome-extension:
ormoz-extension:
. WebScrapBook cannot capture content from another browser extension. - Capture options related: which can be either abnormal or normal (such as resource size limit).
When a warning is shown, check whether the captured web page is correct first and simply ignore it if no real problem is seen. If there is really something wrong, try disabling other extensions (a refresh of the source web page may be required afterwards) or changing the capture options to fix it. If the problem still persists, raise an issue with complete log for investigation.
The browser has a built-in blacklist to restrict extensions from accessing the extension store or related domains to avoid a security hazard by a malicious extension. For example, Firefox forbids accessing Firefox Add-ons and some Mozilla related sites, Google Chrome forbids Chrome Web Store, and Opera forbids Opera Addons.
Since WebScrapBook cannot access the built-in blacklist, it can only show a general error message when it fails to access the content, such as "the extensions gallery cannot be scripted", "missing host permission for the tab", or more generally, "network request failed".
To workaround the problem, the easiest way is to use another browser. Alternatively, some browsers allow custom configuration to remove the restriction (at your own risk):
-
Firefox: open
about:config
and modify the following entries:
This could be due to the fact that the web page uses images in WebP format, which is a new image format that improves the data compression ratio and tramission speed, but is only supported by Chromium-based browsers and Firefox ≥ 65. As a result, WebP images captured by Google Chrome are not vissible by other browsers.
Open the captured page with a browser that supports WebP format would solve the issue. To get saved images viewable by more browsers, capture the web page with a browser that DOESN'T support WebP format, as the source page is likely to provide alternative formats, such as PNG or JPEG, for browsers not supporting WebP.
Some web page draws image using canvas, and some canvas clears related information after the image is drawn, causing the browser unable to get it, and the saved image will be all blank. It's required to rewrite the original drawing code to capture such canvas, and as the code differs in every site, writing a customized tool is the only way to go. Another simpler workaround is using a screenshot instead.
The saved scripts (primarly JavaScripts) and plugins (such as JAVA applets or Flash) may not work normally as the captured page has left the original server environment and has had partial content rewritten. In certain rare cases there could even be a security hazard.
In some cases the captured scripts are impossible to work, for example, we cannot expected a captured Gmail page really do the email sending and receiving. In some cases the non-working is caused by the rewritten of the captured page, and changing the capture options may fix: it is generally recommended to set capture options to "save" and "canvas" option to "save initial status" to avoid a potential DOM error of the captured scripts. In some cases a slight modification can get the captured scripts work, but it tightly depends on how they have been written and there's no clear rule for it.
Some web pages load images only after the screen is scrolled down. Is there a way to capture the complete page without scrolling the screen?
As the deferred loading technique differs among sites, there's no simple way that works universally. Capture helper
can be used for some simple cases.
For example, most web pages record the real image source URL at the data-src
(or data-srcset
) attribute, and set the src
(or srcset
) attribute to that value when scrolling is detected. A capture helper can be configured this way to automatically rewrite the src
(or srcset
) attributes:
[
{
"description": "save deferred images defined by data-*",
"commands": [
["attr", {"css": "img[data-src]"}, "src", ["get_attr", null, "data-src"]],
["attr", {"css": "img[data-srcset]"}, "srcset", ["get_attr", null, "data-srcset"]]
]
}
]
For Chromium-based browsers, checking "Allow access to file URLs" option for WebScrapBook does the trick. On the other hand, Firefox does not allow extensions to access a file URL due to a security concern.
Use the built-in archive page viewer, install PyWebScrapBook or other assistant tools, or open after unzipping. The least dependent way is using a Chromium-based browser with "Allow access to file URLs" option checked; while the best supported way is using PyWebScrapBook. See the View Archive Files page for more details.
Note that every way has a limitation respectively, for example, for the built-in archive page viewer:
- A very large zip archive file (around 2 GiB) cannot be read by the browser.
- A large file in the zip archive (around 400~500 MiB) may exhaust the memory and crash the extension.
- Scripts cannot run in the archive page viewer due to a security concern and the rule of the browser extension stores.
- An archive page in a frame cannot be viewed via the archive page viewer directly due to an unfixed bug.
It depends, as every save format has its advantages and weak points, and that's why they all exist. You can read the tooltip for the related capture option for the characteristics of every save format, and here's a quick summary about them:
Feature | Folder | HTZ | MAFF | Single HTML |
---|---|---|---|---|
feasibility for capture[1] | high | moderate | moderate | low |
performance for capture[2] | low | highest | high | moderate |
size[3] | moderate | smallest | small | large |
loading speed | fast | moderate | moderate | slow |
convenience for viewing[4] | moderate | low | lowest | high |
cross-platform compatibility[5] | highest | high | high | low |
feasibility for editing[6] | high | moderate | moderate | low |
feasibility for format conversion[7] | highest | high | moderate | low |
dispatchability as a static site[8] | yes | no | no | yes |
compatibility with a VCS[9] | high | low | low | moderate |
[1]: Folder mimics the original web site structure and is most reliable and supports most usage, while other formats are single-file, don't work if total size is too large, and do not support merge capture. Single HTML has a size limitation for every embeded file (and even stricter to be loadable by older browsers) due to the data URL technique used; it also cannot retain certain complicated circular-referencing data structure in the source page, and does not support in-depth capture.
[2]: "Folder" format is slow for capture due to large file entries the browser has to deal with. When saving the captured data to scrapbook folder in a Chromium-based browser, checking "Ask where to save each file before downloading" causes massive prompts when saving each file; and several file formats (such as .js or .exe) are blocked by the browser, and the user has to select "Keep" manually for each of them to download them. This issue can be bypassed by saving to the backend server rather than the scrapbook folder, though.
[3]: File size is reduced in HTZ and MAFF as they both use compression. Size of every embedded file is increased to a 4/3 fold due to data URI technique used by single HTML. Additionally, the size of a single HTML file can bloat if the source page refers to a same large resource repeatedly, as no good deduplication technique is available.
[4]: Single HTML file can be easily viewed with a supported browser. Folder requires the user to seek for the index page in the folder in prior and is less convenient. HTZ and MAFF requires an assistant tool, which is not available for all platforms, and is thus worse; MAFF is even worse as its schema is more complicated than HTZ and an assistant tool to support it is harder to write. If the sidebar and a backend server is used, all of them are equally easy to be viewed with, though.
[5]: A ZIP-based archive can be decompressed and viewed as plain web pages anyway in the worst case. A single HTML file requires the browser to support certain techniques such as data URL, srcdoc attribute of iframe, and CSS variable, and is bad in cross-browser compatibility, especially for an old browser.
[6]: The backend server doesn't perform well when saving changes to a ZIP-based archive file due to the need to decompress and recompress, and to a single HTML file due to its large size. For manually editing of the source code, on the other hand, a single HTML file is usually more difficult to edit due to the long embedded data URL strings, and the text editor may perform poorly or even crash when dealing with them.
[7]: Folder and HTZ are essentially interchangeable, but the latter requires an additional support for ZIP manipulation. MAFF has a more complicated inner-structure and is harder for a tool to deal with. Single HTML file is much more complicated and cannot retain certain information in the source page, and is generally very difficult for an application to support format conversion.
[8]: For HTZ or MAFF, the server must run PyWebScrapBook or another specially designed application, or the user has to install WebScrapBook in the browser or install another assistant tool to view the captured page directly.
[9]: Most version control systems (VCS) like Git and Mercurial are text-based. For a ZIP-based archive, an extension is required to support diffing, and merging is generally not supported. For a single HTML file, the VCS should be able to handle the HTML code text, but may have a problem handling the large embedded data URL strings.
Due to the security restriction of the browser, WebScrapBook can automatically download a file only to a subfolder of the browser default download folder. Here are some workarounds:
If you simply want to capture as a single file, you can set Save captured data to
to "File", and configure the browser to ask for saving location for each download. In this way you'll be able to pick any path (usually defaults to the previous picked one) to save the capture.
To save captures automatically, you can create a symbolic link (or directory junction in Windows) to achieve this indirectly.
For example, to save to D:\ScrapBooks
when <default download folder>
is C:\User\Test\Downloads
in Windows, set Scrapbook folder:
to WebScrapBook
and create a junction link using below command in CMD:
MKLINK /J "C:\User\Test\Downloads\WebScrapBook" "D:\ScrapBooks"
Replace /J
with /D
to create a symbolic link instead, with similar effect. However, creating a symbolic link requires administrator privilege.
In Linux, use the command instead:
ln -s /path/to/desired/directory ~/Downloads/WebScrapBook
Simply setup a backend server at the desired directory, and you'll be able to save captured data there. See basic page for setup details.
WebScrapBook by default does its best to preserve complete information of the source page, and thus takes more space. If size is a concern, however, several options can be tweaked to eliminate minor information.
For example, setting images
, audios
, and videos
to "save current", rewrite styles
to advanced tidy
, and/or setting remove hidden elements
to "remove undisplayed elements" can significantly reduce size in most cases, and they are actually the default behavior of many other web page capture tools. Be sure to read the tooltip of a related option before making such decision, though, as there's a possible side effect.
Also consider about the data format, a ZIP-based archive format usually results in smaller size than other ones. Check the related topic for more details.
Yes. Put related web page files under a folder, install PyWebScrapBook and then run the command in CLI:
wsb convert file2wsb /path/to/data-folder /path/to/webscrapbook
to import them into a newly generated scrapbook. You can then transfer items from the new scrapbook to another scrapbook.
How do I add web pages captured to the local disk to the scrapbook of PyWebScrapBook backend server?
Copy physical files of these web pages to the configured data
directory of the scrapbook, start backend server, go to the options page, check Automatically generate metadata for detected new pages and add them to root TOC
, and run the checker.
Alternatively, enter the root directory of the scrapbook in the CLI and run
wsb check --resolve-unindexed-files
.
Or import them to a new scrapbook and then transfer its items to another scrapbook using the above method.