Skip to content

Releases: bitdruid/python-wayback-machine-downloader

2.0.0

19 Oct 12:42
Compare
Choose a tag to compare

Last version with dictionary-logic and no db-functionality: v1.5.7

This release makes old queries incompatible

Full Changelog: 1.5.0...2.0.0

Removed the use of a dictionary and replaced with sqlite. This was in reference to #20, where I realised that due to the amount of information this tool processes and provides, a very large query would crash because the system rans out of memory.

Main changes from 1.5.0

  • removed dictionary and replaced with sqlite db to handle large jobs - making system memory less important
  • removed --list because you can simply cat the cdx-file
  • removed --json because for large jobs you have to parse the csv anyways
  • removed --skip because an existing job will always be resumed until it is finished or just in case use --reset
  • removed --cdxbackup because no use after a job is finished or just in case use --keep
  • removed --cdxinject because an existing job will always try to inject an existing cdxfile or just in case use --reset
  • removed --auto because skip functionality is now default and --cdxbackup and --cdxinject were removed
  • removed --csv because its now the default output of this tool
  • removed --verbosity because progressbar is now --progress and --json got removed
  • removed --debug because error-log is always produced
  • added --keep to prevent deletion of .cdx and .db after the job finished
  • added --reset to reset an interrupted query instead of resuming
  • added --progress to show progressbars instead of cli-output
  • added --filetype to filter snapshots by specific filetype (.html not working for now)

Behavior changes

  • the cdx-results are now streamed into the cdxfile instead of written from a variable to reduce memory-load
  • all optional paths for commands have been removed - defaults to output dir
  • a query is now identified by its required and optional query - parameters. if an existing query was identified, the download will resume (with a short info-message and the last status of the query)
  • the manipulate behavior paremeters do not affect that logic
  • the tool will give you a calculation of the snapshots to utilize, based on filtered/resumed/handled/skipped snapshots
  • there are 3 progress-bars to help you getting the status for very large jobs:
    • download of the cdx-results
    • insertion of the cdx-results into the db
    • download of the archived pages

Ideas for the future

  • merge all jobs into one db file instead of one db per query
  • restructure of the output dir to the logic waybackup_snapshots/<query>/domains+subdomains+queryfiles/... instead of waybackup_snapshots/domains+subdomains+queryfiles/... to split queries into exclusive folders

This is the first release with db. As there have been a lot of changes, there are bound to be some bugs, but due to my final thesis at university i have not had much time to find them all - but hopefully most of them. Do not hesitate and open an issue. Please in any case report bugs or improvements!

1.5.0

24 Aug 21:51
Compare
Choose a tag to compare

#20 made me aware of an issue with very large queries to the cdx server. The snapshots received can easily cause the system to run out of memory, resulting in a crash.

So beside some other changes there is a work in progress to reduce memory-usage (but in exchange for some I/O)

  • Results from the cdx server are now streamed into a .cdx file instead of system memory
  • Added a warning (abortable) if the amount of snapshots is on the larger scale
  • Added a progress indication for the download status of the cdx-query
  • Sometimes the json results of the cdx server were not transferred completely and an exception was thrown. This faulty json data should now be handled appropriately (also #20 )
  • Added command --limit to specify the maximum amount of snapshots to query
  • Removed command --debug; error-log will be always written
  • #19 fixed an error with timestamp extraction from urls

Feel free to submit bug reports at any time!

1.4.0

25 Jul 15:38
Compare
Choose a tag to compare
  • #17 --delay command can be now used to set a delay between GET for each worker
  • --log command can now be used to write a logfile (also works with --verbosity progress)
  • changes in the logic of --verbosity for a future loglevel setting to manage more/less output
  • more changes in argument-handling
  • minor fixes / cleanups

1.3.0

29 Jun 11:41
Compare
Choose a tag to compare
  • fixed requirements for win
  • fixed csv for win
  • added "auto" mode
  • fixed some minor bugs

1.2.0

08 Jun 14:57
Compare
Choose a tag to compare
  • fixed errors if snapshots colide path<->file #3 #4
  • fixed errors where a picture was stored as index.html #7
  • added url-encoding #4
  • prevent redirect loops #4
  • fixed SIGINT KeyboardInterrupt prevents csv-file from generating #8
  • added custom exception handler
  • added --debug to log exceptions into an error-log and print out full traceback instead of shortened
  • replaced batch-lists with queue for workers #9
  • added some cdx-queries from example.com to test
  • added --cdxbackup and --cdxinject to either store a cdx query for later use or use a backup
  • added --skip -> an existing csv-file will be used to check for already downloaded snapshots
  • changed user-agent to give archive.org the possibility to know who is scraping #11

1.0.3

03 Jun 18:27
Compare
Choose a tag to compare

fixes #3

v1.0.2

31 May 07:44
Compare
Choose a tag to compare
  • fixed paths for win #2 #1:
    • stripping ports from domain (:80 :443 ...) to prevent WinError
    • stripping mailto-prefixes to prevent WinError
    • changed url-parsing to prevent the case where subdir==filename caused WinError
  • url-encoded spaces in filenames are now decoded #1
  • clarified current-path structure in readme - changes may come in the future #1
  • optimized the parsing of cdx-query to keep inside a requested path
  • increased performance of collection-creation for very large requests

first release

22 Apr 07:10
Compare
Choose a tag to compare

Changes to beta:

  • --worker changed to --workers.
  • --csv appends requested url to filename to prevent overwriting
  • cleanup README
  • cleanup HELP