Merge branch 'r/1.2.0'

bitdruid · Jun 8, 2024 · f0966d0 · f0966d0
1 parent 9ac5c53
commit f0966d0
Show file tree

Hide file tree

Showing 14 changed files with 12,601 additions and 253 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# archive wayback downloader
+# python wayback machine downloader
 
 [![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
 [![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
@@ -7,15 +7,17 @@
 
 Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
 
-Internet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.
+Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.
+
+This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
 
 ## Installation
 
 ### Pip
 
 1. Install the package <br>
    ```pip install pywaybackup```
-2. Run the script <br>
+2. Run the tool <br>
    ```waybackup -h```
 
 ### Manual
@@ -26,30 +28,25 @@ Internet-archive is a nice source for several OSINT-information. This script is
    ```pip install .```
    - in a virtual env or use `--break-system-package`
 
-## Usage
-
-This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
-
-### Arguments
+## Arguments
 
 - `-h`, `--help`: Show the help message and exit.
-- `-a`, `--about`: Show information about the script and exit.
+- `-a`, `--about`: Show information about the tool and exit.
 
-#### Required Arguments
+### Required
 
 - `-u`, `--url`: The URL of the web page to download. This argument is required.
 
 #### Mode Selection (Choose One)
-
 - `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
 - `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
 - `-s`, `--save`: Save a page to the Wayback Machine. (beta)
 
-#### Optional Arguments
+### Optional query parameters
 
 - `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
 - `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots.
-- `-o`, `--output`: The folder where downloaded files will be saved.
+- `-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
 
 - **Range Selection:**<br>
 Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
@@ -58,13 +55,36 @@ Specify the range in years or a specific timestamp either start, end or both. If
    - `--start`: Timestamp to start searching.
    - `--end`: Timestamp to end searching.
 
-#### Additional
-
-- `--csv`: Save a csv file with the list of snapshots inside the output folder or a specified folder. If you set `--list` the csv will contain the cdx list of snapshots. If you set either `--current` or `--full` the csv will contain the downloaded files.
-- `--no-redirect`: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (`-c`).
-- `--verbosity`: Set the verbosity: json (print json response), progress (show progress bar).
-- `--retry`: Retry failed downloads. You can specify the number of retry attempts as an integer.
-- `--workers`: The number of workers to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many workers will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.
+### Additional behavior manipulation
+
+- **`--csv`** `<path>`:<br>
+Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.
+
+- **`--skip`** `<path>`:<br>
+Path defaults to output-dir. Checks for an existing `waybackup_<domain>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
+
+- **`--no-redirect`**:<br>
+Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
+
+- **`--verbosity`** `<level>`:<br>
+Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
+
+- **`--retry`** `<attempts>`:<br>
+Specifies number of retry attempts for failed downloads.
+
+- **`--workers`** `<count>`:<br>
+Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
+
+**CDX Query Handling:**
+- **`--cdxbackup`** `<path>`:<br>
+Path defaults to output-dir. Saves the result of CDX query as a file. Useful for later downloading snapshots and overcoming refused connections by CDX server due to too many queries. Named as `waybackup_<sanitized_url>.cdx`.
+
+- **`--cdxinject`** `<filepath>`:<br>
+Injects a CDX query file to download snapshots. Ensure the query matches the previous `--url` for correct folder structure.
+
+### Debug
+
+- `--debug`: If set, full traceback will be printed in case of an error. The full exception will be written into `waybackup_error.log`.
 
 ### Examples
 
@@ -169,5 +189,5 @@ The csv contains the json response in a table format.
 
 ## Contributing
 
-I'm always happy for some feature requests to improve the usability of this script.
-Feel free to give suggestions and report issues. Project is still far from being perfect.
+I'm always happy for some feature requests to improve the usability of this tool.
+Feel free to give suggestions and report issues. Project is still far from being perfect.
diff --git a/dev/pip_build.sh b/dev/pip_build.sh
@@ -4,11 +4,8 @@
 SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
 TARGET_PATH="$SCRIPT_PATH/.."
 
-# check if venv is activated
-if [ -z "$VIRTUAL_ENV" ]; then
-    echo "Please activate your virtual environment"
-    exit 1
-fi
+# install dependencies
+pip install twine wheel setuptools
 
 # build 
 python $TARGET_PATH/setup.py sdist bdist_wheel --verbose

diff --git a/pywaybackup/Exception.py b/pywaybackup/Exception.py
@@ -0,0 +1,73 @@
+
+import sys
+import os
+from datetime import datetime
+import linecache
+import traceback
+
+class Exception:
+
+    new_debug = True
+    debug = False
+    output = None
+    command = None
+
+    @classmethod
+    def init(cls, debug=False, output=None, command=None):
+        sys.excepthook = cls.exception_handler # set custom exception handler (uncaught exceptions)
+        cls.output = output
+        cls.command = command
+        cls.debug = True if debug else False
+
+    @classmethod
+    def exception(cls, message: str, e: Exception, tb=None):
+        custom_tb = sys.exc_info()[-1] if tb is None else tb
+        original_tb = "".join(traceback.format_exception(type(e), e, e.__traceback__))
+        exception_message = (
+            "-------------------------\n" 
+            f"!-- Exception: {message}\n"
+        )
+        if custom_tb is not None:
+            while custom_tb.tb_next: # loop to last traceback frame
+                custom_tb = custom_tb.tb_next 
+            tb_frame = custom_tb.tb_frame
+            tb_line = custom_tb.tb_lineno
+            func_name = tb_frame.f_code.co_name
+            filename = tb_frame.f_code.co_filename
+            codeline = linecache.getline(filename, tb_line).strip()
+            exception_message += (
+                f"!-- File: {filename}\n"
+                f"!-- Function: {func_name}\n"
+                f"!-- Line: {tb_line}\n"
+                f"!-- Segment: {codeline}\n"
+            )
+        else:
+            exception_message += "!-- Traceback is None\n"
+        exception_message += (
+            f"!-- Description: {e}\n"
+            "-------------------------")
+        print(exception_message)
+        if cls.debug:
+            debug_file = os.path.join(cls.output, "waybackup_error.log")
+            print(f"Exception log: {debug_file}")
+            print("-------------------------")
+            print(f"Full traceback:\n{original_tb}")
+            if cls.new_debug: # new run, overwrite file
+                cls.new_debug = False
+                f = open(debug_file, "w")
+                f.write("-------------------------\n")
+                f.write(f"Command: {cls.command}\n")
+                f.write("-------------------------\n\n")
+            else: # current run, append to file
+                f = open(debug_file, "a")
+            f.write(datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "\n")
+            f.write(exception_message + "\n")
+            f.write(original_tb + "\n")
+
+    @staticmethod
+    def exception_handler(exception_type, exception, traceback):
+        if issubclass(exception_type, KeyboardInterrupt):
+            sys.__excepthook__(exception_type, exception, traceback)
+            return
+        Exception.exception("UNCAUGHT EXCEPTION", exception, traceback) # uncaught exceptions also with custom scheme
+
diff --git a/pywaybackup/SnapshotCollection.py b/pywaybackup/SnapshotCollection.py
@@ -1,10 +1,10 @@
-from urllib.parse import urlparse
+from pywaybackup.helper import url_split
 import os
 
 class SnapshotCollection:
 
     SNAPSHOT_COLLECTION = []
-    MODE_CURRENT = 0
+    MODE_CURRENT = 0        
 
     @classmethod
     def create_list(cls, cdxResult, mode):
@@ -15,7 +15,7 @@ def create_list(cls, cdxResult, mode):
         - mode `current`: Only the latest snapshot of each file is included.
         """
         # creates a list of dictionaries for each snapshot entry
-        cls.SNAPSHOT_COLLECTION = sorted([{"timestamp": snapshot[0], "digest": snapshot[1], "mimetype": snapshot[2], "status": snapshot[3], "url": snapshot[4]} for snapshot in cdxResult.json()[1:]], key=lambda k: k['timestamp'], reverse=True)
+        cls.SNAPSHOT_COLLECTION = sorted([{"timestamp": snapshot[0], "digest": snapshot[1], "mimetype": snapshot[2], "status": snapshot[3], "url": snapshot[4]} for snapshot in cdxResult[1:]], key=lambda k: k['timestamp'], reverse=True)
         if mode == "current": 
             cls.MODE_CURRENT = 1
             cdxResult_list_filtered = []
@@ -29,21 +29,23 @@ def create_list(cls, cdxResult, mode):
         # writes the index for each snapshot entry
         cls.SNAPSHOT_COLLECTION = [{"id": idx, **entry} for idx, entry in enumerate(cls.SNAPSHOT_COLLECTION)]
 
+
     @classmethod
     def count_list(cls):
         return len(cls.SNAPSHOT_COLLECTION)
 
+
     @classmethod
     def create_collection(cls):
         new_collection = []
         for idx, cdx_entry in enumerate(cls.SNAPSHOT_COLLECTION):
-            timestamp, url = cdx_entry["timestamp"], cdx_entry["url"]
-            url_archive = f"http://web.archive.org/web/{timestamp}{cls._url_get_filetype(url)}/{url}"
+            timestamp, url_origin = cdx_entry["timestamp"], cdx_entry["url"]
+            url_archive = f"https://web.archive.org/web/{timestamp}id_/{url_origin}"
             collection_entry = {
                 "id": idx,
                 "timestamp": timestamp,
                 "url_archive": url_archive,
-                "url_origin": url,
+                "url_origin": url_origin,
                 "redirect_url": False,
                 "redirect_timestamp": False,
                 "response": False,
@@ -52,27 +54,18 @@ def create_collection(cls):
             new_collection.append(collection_entry)
         cls.SNAPSHOT_COLLECTION = new_collection
 
-    @classmethod
-    def snapshot_entry_create_output(cls, collection_entry: dict, output: str) -> str:
-        """
-        Create the output path for a snapshot entry of the collection according to the mode.
-
-        Input:
-        - collection_entry: A single snapshot entry of the collection (dict).
-        - output: The output directory (str).
 
-        Output:
-        - download_file: The output path for the snapshot entry (str) with filename.
-        """
-        timestamp, url = collection_entry["timestamp"], collection_entry["url_origin"]
-        domain, subdir, filename = cls.url_split(url, index=True)
+    @classmethod
+    def create_output(cls, url: str, timestamp: str, output: str):
+        domain, subdir, filename = url_split(url.split("id_/")[1], index=True)
         if cls.MODE_CURRENT:
             download_dir = os.path.join(output, domain, subdir)
         else:
             download_dir = os.path.join(output, domain, timestamp, subdir)
         download_file = os.path.abspath(os.path.join(download_dir, filename))
         return download_file
 
+
     @classmethod
     def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str):
         """
@@ -82,47 +75,3 @@ def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str):
         - Modify an existing key-value pair if the key exists.
         """
         collection_entry[key] = value
-
-    @classmethod
-    def url_get_timestamp(cls, url):
-        """
-        Extract the timestamp from a wayback machine URL.
-        """
-        timestamp = url.split("web.archive.org/web/")[1].split("/")[0]
-        timestamp = ''.join([char for char in timestamp if char.isdigit()])
-        return timestamp
-
-    @classmethod
-    def _url_get_filetype(cls, url):
-        file_extension = os.path.splitext(url)[1][1:]
-        urltype_mapping = {
-            "jpg": "im_",
-            "jpeg": "im_",
-            "png": "im_",
-            "gif": "im_",
-            "svg": "im_",
-            "ico": "im_",
-            "css": "cs_"
-            #"js": "js_"
-        }
-        urltype = urltype_mapping.get(file_extension, "id_")
-        return urltype
-
-    @classmethod
-    def url_split(cls, url, index=False):
-        """
-        Split a URL into domain, subdir and filename.
-        """
-        if not urlparse(url).scheme:
-            url = "http://" + url
-        parsed_url = urlparse(url)
-        domain = parsed_url.netloc.split("@")[-1].split(":")[0] # split mailto: and port
-        path_parts = parsed_url.path.split("/")
-        if not url.endswith("/") or "." in path_parts[-1]:
-            filename = path_parts[-1]
-            subdir = "/".join(path_parts[:-1]).strip("/")
-        else:
-            filename = "index.html" if index else ""
-            subdir = "/".join(path_parts).strip("/")
-        filename = filename.replace("%20", " ") # replace url encoded spaces
-        return domain, subdir, filename
diff --git a/pywaybackup/Verbosity.py b/pywaybackup/Verbosity.py
@@ -2,44 +2,46 @@
 import json
 from pywaybackup.SnapshotCollection import SnapshotCollection as sc
 
+
 class Verbosity:
 
     mode = None
     args = None
     pbar = None
 
+    new_debug = True
+    debug = False
+    output = None
+    command = None
+
     @classmethod
-    def open(cls, args: list):
-        cls.args = args
+    def init(cls, v_args: list, debug=False, output=None, command=None):
+        cls.args = v_args
+        cls.output = output
+        cls.command = command
         if cls.args == "progress":
             cls.mode = "progress"
         elif cls.args == "json":
             cls.mode = "json"
         else:
             cls.mode = "standard"
+        cls.debug = True if debug else False
 
     @classmethod
-    def close(cls):
+    def fini(cls):
         if cls.mode == "progress":
             if cls.pbar is not None: cls.pbar.close()
-        if cls.mode == "progress" or cls.mode == "standard":
-            successed = len([snapshot for snapshot in sc.SNAPSHOT_COLLECTION if "file" in snapshot and snapshot["file"]])
-            failed = len([snapshot for snapshot in sc.SNAPSHOT_COLLECTION if "file" in snapshot and not snapshot["file"]])
-            print(f"\nFiles downloaded: {successed}")
-            print(f"Files missing: {failed}")
-            print("")
         if cls.mode == "json":
             print(json.dumps(sc.SNAPSHOT_COLLECTION, indent=4, sort_keys=True))
 
     @classmethod
     def write(cls, message: str = None, progress: int = None):
         if cls.mode == "progress":
-            if progress == 0:
-                print("")
+            if cls.pbar is None and progress == 0:
                 maxval = sc.count_list()
                 cls.pbar = tqdm.tqdm(total=maxval, desc="Downloading", unit=" snapshot", ascii="░▒█")
-            elif cls.pbar is not None and progress == 1:
-                cls.pbar.update(1)
+            if cls.pbar is not None and progress is not None and progress > 0 :
+                cls.pbar.update(progress)
                 cls.pbar.refresh()
         elif cls.mode == "json":
             pass

diff --git a/pywaybackup/__version__.py b/pywaybackup/__version__.py
@@ -1 +1 @@
-__version__ = "1.0.3"
+__version__ = "1.2.0"