Skip to content

Commit

Permalink
Merge branch 'r/1.2.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
bitdruid committed Jun 8, 2024
1 parent 9ac5c53 commit f0966d0
Show file tree
Hide file tree
Showing 14 changed files with 12,601 additions and 253 deletions.
64 changes: 42 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# archive wayback downloader
# python wayback machine downloader

[![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
Expand All @@ -7,15 +7,17 @@

Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).

Internet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.
Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.

This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

## Installation

### Pip

1. Install the package <br>
```pip install pywaybackup```
2. Run the script <br>
2. Run the tool <br>
```waybackup -h```

### Manual
Expand All @@ -26,30 +28,25 @@ Internet-archive is a nice source for several OSINT-information. This script is
```pip install .```
- in a virtual env or use `--break-system-package`

## Usage

This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

### Arguments
## Arguments

- `-h`, `--help`: Show the help message and exit.
- `-a`, `--about`: Show information about the script and exit.
- `-a`, `--about`: Show information about the tool and exit.

#### Required Arguments
### Required

- `-u`, `--url`: The URL of the web page to download. This argument is required.

#### Mode Selection (Choose One)

- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
- `-s`, `--save`: Save a page to the Wayback Machine. (beta)

#### Optional Arguments
### Optional query parameters

- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots.
- `-o`, `--output`: The folder where downloaded files will be saved.
- `-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.

- **Range Selection:**<br>
Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
Expand All @@ -58,13 +55,36 @@ Specify the range in years or a specific timestamp either start, end or both. If
- `--start`: Timestamp to start searching.
- `--end`: Timestamp to end searching.

#### Additional

- `--csv`: Save a csv file with the list of snapshots inside the output folder or a specified folder. If you set `--list` the csv will contain the cdx list of snapshots. If you set either `--current` or `--full` the csv will contain the downloaded files.
- `--no-redirect`: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (`-c`).
- `--verbosity`: Set the verbosity: json (print json response), progress (show progress bar).
- `--retry`: Retry failed downloads. You can specify the number of retry attempts as an integer.
- `--workers`: The number of workers to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many workers will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.
### Additional behavior manipulation

- **`--csv`** `<path>`:<br>
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.

- **`--skip`** `<path>`:<br>
Path defaults to output-dir. Checks for an existing `waybackup_<domain>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.

- **`--no-redirect`**:<br>
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.

- **`--verbosity`** `<level>`:<br>
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).

- **`--retry`** `<attempts>`:<br>
Specifies number of retry attempts for failed downloads.

- **`--workers`** `<count>`:<br>
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.

**CDX Query Handling:**
- **`--cdxbackup`** `<path>`:<br>
Path defaults to output-dir. Saves the result of CDX query as a file. Useful for later downloading snapshots and overcoming refused connections by CDX server due to too many queries. Named as `waybackup_<sanitized_url>.cdx`.

- **`--cdxinject`** `<filepath>`:<br>
Injects a CDX query file to download snapshots. Ensure the query matches the previous `--url` for correct folder structure.

### Debug

- `--debug`: If set, full traceback will be printed in case of an error. The full exception will be written into `waybackup_error.log`.

### Examples

Expand Down Expand Up @@ -169,5 +189,5 @@ The csv contains the json response in a table format.

## Contributing

I'm always happy for some feature requests to improve the usability of this script.
Feel free to give suggestions and report issues. Project is still far from being perfect.
I'm always happy for some feature requests to improve the usability of this tool.
Feel free to give suggestions and report issues. Project is still far from being perfect.
7 changes: 2 additions & 5 deletions dev/pip_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,8 @@
SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
TARGET_PATH="$SCRIPT_PATH/.."

# check if venv is activated
if [ -z "$VIRTUAL_ENV" ]; then
echo "Please activate your virtual environment"
exit 1
fi
# install dependencies
pip install twine wheel setuptools

# build
python $TARGET_PATH/setup.py sdist bdist_wheel --verbose
Expand Down
73 changes: 73 additions & 0 deletions pywaybackup/Exception.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@

import sys
import os
from datetime import datetime
import linecache
import traceback

class Exception:

new_debug = True
debug = False
output = None
command = None

@classmethod
def init(cls, debug=False, output=None, command=None):
sys.excepthook = cls.exception_handler # set custom exception handler (uncaught exceptions)
cls.output = output
cls.command = command
cls.debug = True if debug else False

@classmethod
def exception(cls, message: str, e: Exception, tb=None):
custom_tb = sys.exc_info()[-1] if tb is None else tb
original_tb = "".join(traceback.format_exception(type(e), e, e.__traceback__))
exception_message = (
"-------------------------\n"
f"!-- Exception: {message}\n"
)
if custom_tb is not None:
while custom_tb.tb_next: # loop to last traceback frame
custom_tb = custom_tb.tb_next
tb_frame = custom_tb.tb_frame
tb_line = custom_tb.tb_lineno
func_name = tb_frame.f_code.co_name
filename = tb_frame.f_code.co_filename
codeline = linecache.getline(filename, tb_line).strip()
exception_message += (
f"!-- File: {filename}\n"
f"!-- Function: {func_name}\n"
f"!-- Line: {tb_line}\n"
f"!-- Segment: {codeline}\n"
)
else:
exception_message += "!-- Traceback is None\n"
exception_message += (
f"!-- Description: {e}\n"
"-------------------------")
print(exception_message)
if cls.debug:
debug_file = os.path.join(cls.output, "waybackup_error.log")
print(f"Exception log: {debug_file}")
print("-------------------------")
print(f"Full traceback:\n{original_tb}")
if cls.new_debug: # new run, overwrite file
cls.new_debug = False
f = open(debug_file, "w")
f.write("-------------------------\n")
f.write(f"Command: {cls.command}\n")
f.write("-------------------------\n\n")
else: # current run, append to file
f = open(debug_file, "a")
f.write(datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "\n")
f.write(exception_message + "\n")
f.write(original_tb + "\n")

@staticmethod
def exception_handler(exception_type, exception, traceback):
if issubclass(exception_type, KeyboardInterrupt):
sys.__excepthook__(exception_type, exception, traceback)
return
Exception.exception("UNCAUGHT EXCEPTION", exception, traceback) # uncaught exceptions also with custom scheme

75 changes: 12 additions & 63 deletions pywaybackup/SnapshotCollection.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
from urllib.parse import urlparse
from pywaybackup.helper import url_split
import os

class SnapshotCollection:

SNAPSHOT_COLLECTION = []
MODE_CURRENT = 0
MODE_CURRENT = 0

@classmethod
def create_list(cls, cdxResult, mode):
Expand All @@ -15,7 +15,7 @@ def create_list(cls, cdxResult, mode):
- mode `current`: Only the latest snapshot of each file is included.
"""
# creates a list of dictionaries for each snapshot entry
cls.SNAPSHOT_COLLECTION = sorted([{"timestamp": snapshot[0], "digest": snapshot[1], "mimetype": snapshot[2], "status": snapshot[3], "url": snapshot[4]} for snapshot in cdxResult.json()[1:]], key=lambda k: k['timestamp'], reverse=True)
cls.SNAPSHOT_COLLECTION = sorted([{"timestamp": snapshot[0], "digest": snapshot[1], "mimetype": snapshot[2], "status": snapshot[3], "url": snapshot[4]} for snapshot in cdxResult[1:]], key=lambda k: k['timestamp'], reverse=True)
if mode == "current":
cls.MODE_CURRENT = 1
cdxResult_list_filtered = []
Expand All @@ -29,21 +29,23 @@ def create_list(cls, cdxResult, mode):
# writes the index for each snapshot entry
cls.SNAPSHOT_COLLECTION = [{"id": idx, **entry} for idx, entry in enumerate(cls.SNAPSHOT_COLLECTION)]


@classmethod
def count_list(cls):
return len(cls.SNAPSHOT_COLLECTION)


@classmethod
def create_collection(cls):
new_collection = []
for idx, cdx_entry in enumerate(cls.SNAPSHOT_COLLECTION):
timestamp, url = cdx_entry["timestamp"], cdx_entry["url"]
url_archive = f"http://web.archive.org/web/{timestamp}{cls._url_get_filetype(url)}/{url}"
timestamp, url_origin = cdx_entry["timestamp"], cdx_entry["url"]
url_archive = f"https://web.archive.org/web/{timestamp}id_/{url_origin}"
collection_entry = {
"id": idx,
"timestamp": timestamp,
"url_archive": url_archive,
"url_origin": url,
"url_origin": url_origin,
"redirect_url": False,
"redirect_timestamp": False,
"response": False,
Expand All @@ -52,27 +54,18 @@ def create_collection(cls):
new_collection.append(collection_entry)
cls.SNAPSHOT_COLLECTION = new_collection

@classmethod
def snapshot_entry_create_output(cls, collection_entry: dict, output: str) -> str:
"""
Create the output path for a snapshot entry of the collection according to the mode.
Input:
- collection_entry: A single snapshot entry of the collection (dict).
- output: The output directory (str).

Output:
- download_file: The output path for the snapshot entry (str) with filename.
"""
timestamp, url = collection_entry["timestamp"], collection_entry["url_origin"]
domain, subdir, filename = cls.url_split(url, index=True)
@classmethod
def create_output(cls, url: str, timestamp: str, output: str):
domain, subdir, filename = url_split(url.split("id_/")[1], index=True)
if cls.MODE_CURRENT:
download_dir = os.path.join(output, domain, subdir)
else:
download_dir = os.path.join(output, domain, timestamp, subdir)
download_file = os.path.abspath(os.path.join(download_dir, filename))
return download_file


@classmethod
def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str):
"""
Expand All @@ -82,47 +75,3 @@ def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str):
- Modify an existing key-value pair if the key exists.
"""
collection_entry[key] = value

@classmethod
def url_get_timestamp(cls, url):
"""
Extract the timestamp from a wayback machine URL.
"""
timestamp = url.split("web.archive.org/web/")[1].split("/")[0]
timestamp = ''.join([char for char in timestamp if char.isdigit()])
return timestamp

@classmethod
def _url_get_filetype(cls, url):
file_extension = os.path.splitext(url)[1][1:]
urltype_mapping = {
"jpg": "im_",
"jpeg": "im_",
"png": "im_",
"gif": "im_",
"svg": "im_",
"ico": "im_",
"css": "cs_"
#"js": "js_"
}
urltype = urltype_mapping.get(file_extension, "id_")
return urltype

@classmethod
def url_split(cls, url, index=False):
"""
Split a URL into domain, subdir and filename.
"""
if not urlparse(url).scheme:
url = "http://" + url
parsed_url = urlparse(url)
domain = parsed_url.netloc.split("@")[-1].split(":")[0] # split mailto: and port
path_parts = parsed_url.path.split("/")
if not url.endswith("/") or "." in path_parts[-1]:
filename = path_parts[-1]
subdir = "/".join(path_parts[:-1]).strip("/")
else:
filename = "index.html" if index else ""
subdir = "/".join(path_parts).strip("/")
filename = filename.replace("%20", " ") # replace url encoded spaces
return domain, subdir, filename
28 changes: 15 additions & 13 deletions pywaybackup/Verbosity.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,44 +2,46 @@
import json
from pywaybackup.SnapshotCollection import SnapshotCollection as sc


class Verbosity:

mode = None
args = None
pbar = None

new_debug = True
debug = False
output = None
command = None

@classmethod
def open(cls, args: list):
cls.args = args
def init(cls, v_args: list, debug=False, output=None, command=None):
cls.args = v_args
cls.output = output
cls.command = command
if cls.args == "progress":
cls.mode = "progress"
elif cls.args == "json":
cls.mode = "json"
else:
cls.mode = "standard"
cls.debug = True if debug else False

@classmethod
def close(cls):
def fini(cls):
if cls.mode == "progress":
if cls.pbar is not None: cls.pbar.close()
if cls.mode == "progress" or cls.mode == "standard":
successed = len([snapshot for snapshot in sc.SNAPSHOT_COLLECTION if "file" in snapshot and snapshot["file"]])
failed = len([snapshot for snapshot in sc.SNAPSHOT_COLLECTION if "file" in snapshot and not snapshot["file"]])
print(f"\nFiles downloaded: {successed}")
print(f"Files missing: {failed}")
print("")
if cls.mode == "json":
print(json.dumps(sc.SNAPSHOT_COLLECTION, indent=4, sort_keys=True))

@classmethod
def write(cls, message: str = None, progress: int = None):
if cls.mode == "progress":
if progress == 0:
print("")
if cls.pbar is None and progress == 0:
maxval = sc.count_list()
cls.pbar = tqdm.tqdm(total=maxval, desc="Downloading", unit=" snapshot", ascii="░▒█")
elif cls.pbar is not None and progress == 1:
cls.pbar.update(1)
if cls.pbar is not None and progress is not None and progress > 0 :
cls.pbar.update(progress)
cls.pbar.refresh()
elif cls.mode == "json":
pass
Expand Down
2 changes: 1 addition & 1 deletion pywaybackup/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.0.3"
__version__ = "1.2.0"
Loading

0 comments on commit f0966d0

Please sign in to comment.