Releases · bitdruid/python-wayback-machine-downloader

19 Oct 12:42

bitdruid

2.0.0

933f4ee

2.0.0 Latest

Latest

Last version with dictionary-logic and no db-functionality: v1.5.7

This release makes old queries incompatible

Full Changelog: 1.5.0...2.0.0

Removed the use of a dictionary and replaced with sqlite. This was in reference to #20, where I realised that due to the amount of information this tool processes and provides, a very large query would crash because the system rans out of memory.

Main changes from 1.5.0

removed dictionary and replaced with sqlite db to handle large jobs - making system memory less important
removed --list because you can simply cat the cdx-file
removed --json because for large jobs you have to parse the csv anyways
removed --skip because an existing job will always be resumed until it is finished or just in case use --reset
removed --cdxbackup because no use after a job is finished or just in case use --keep
removed --cdxinject because an existing job will always try to inject an existing cdxfile or just in case use --reset
removed --auto because skip functionality is now default and --cdxbackup and --cdxinject were removed
removed --csv because its now the default output of this tool
removed --verbosity because progressbar is now --progress and --json got removed
removed --debug because error-log is always produced
added --keep to prevent deletion of .cdx and .db after the job finished
added --reset to reset an interrupted query instead of resuming
added --progress to show progressbars instead of cli-output
added --filetype to filter snapshots by specific filetype (.html not working for now)

Behavior changes

the cdx-results are now streamed into the cdxfile instead of written from a variable to reduce memory-load
all optional paths for commands have been removed - defaults to output dir
a query is now identified by its required and optional query - parameters. if an existing query was identified, the download will resume (with a short info-message and the last status of the query)
the manipulate behavior paremeters do not affect that logic
the tool will give you a calculation of the snapshots to utilize, based on filtered/resumed/handled/skipped snapshots
there are 3 progress-bars to help you getting the status for very large jobs:
- download of the cdx-results
- insertion of the cdx-results into the db
- download of the archived pages

Ideas for the future

merge all jobs into one db file instead of one db per query
restructure of the output dir to the logic waybackup_snapshots/<query>/domains+subdomains+queryfiles/... instead of waybackup_snapshots/domains+subdomains+queryfiles/... to split queries into exclusive folders

This is the first release with db. As there have been a lot of changes, there are bound to be some bugs, but due to my final thesis at university i have not had much time to find them all - but hopefully most of them. Do not hesitate and open an issue. Please in any case report bugs or improvements!

Assets 2

24 Aug 21:51

bitdruid

1.5.0

78c6535

1.5.0

#20 made me aware of an issue with very large queries to the cdx server. The snapshots received can easily cause the system to run out of memory, resulting in a crash.

So beside some other changes there is a work in progress to reduce memory-usage (but in exchange for some I/O)

Results from the cdx server are now streamed into a .cdx file instead of system memory
Added a warning (abortable) if the amount of snapshots is on the larger scale
Added a progress indication for the download status of the cdx-query
Sometimes the json results of the cdx server were not transferred completely and an exception was thrown. This faulty json data should now be handled appropriately (also #20 )
Added command --limit to specify the maximum amount of snapshots to query
Removed command --debug; error-log will be always written
#19 fixed an error with timestamp extraction from urls

Feel free to submit bug reports at any time!

Assets 2

25 Jul 15:38

bitdruid

1.4.0

b19b22a

1.4.0

#17 --delay command can be now used to set a delay between GET for each worker
--log command can now be used to write a logfile (also works with --verbosity progress)
changes in the logic of --verbosity for a future loglevel setting to manage more/less output
more changes in argument-handling
minor fixes / cleanups

Assets 2

29 Jun 11:41

bitdruid

1.3.0

f81afba

1.3.0

fixed requirements for win
fixed csv for win
added "auto" mode
fixed some minor bugs

Assets 2

08 Jun 14:57

bitdruid

1.2.0

f0966d0

1.2.0

fixed errors if snapshots colide path<->file #3 #4
fixed errors where a picture was stored as index.html #7
added url-encoding #4
prevent redirect loops #4
fixed SIGINT KeyboardInterrupt prevents csv-file from generating #8
added custom exception handler
added --debug to log exceptions into an error-log and print out full traceback instead of shortened
replaced batch-lists with queue for workers #9
added some cdx-queries from example.com to test
added --cdxbackup and --cdxinject to either store a cdx query for later use or use a backup
added --skip -> an existing csv-file will be used to check for already downloaded snapshots
changed user-agent to give archive.org the possibility to know who is scraping #11

Assets 2

03 Jun 18:27

bitdruid

1.0.3

9ac5c53

1.0.3

fixes #3

Assets 2

31 May 07:44

bitdruid

1.0.2

e5129b6

v1.0.2

fixed paths for win #2 #1:
- stripping ports from domain (:80 :443 ...) to prevent WinError
- stripping mailto-prefixes to prevent WinError
- changed url-parsing to prevent the case where subdir==filename caused WinError
url-encoded spaces in filenames are now decoded #1
clarified current-path structure in readme - changes may come in the future #1
optimized the parsing of cdx-query to keep inside a requested path
increased performance of collection-creation for very large requests

Assets 2

22 Apr 07:10

bitdruid

1.0.1

94e49e4

first release

Changes to beta:

--worker changed to --workers.
--csv appends requested url to filename to prevent overwriting
cleanup README
cleanup HELP

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last version with dictionary-logic and no db-functionality: v1.5.7

This release makes old queries incompatible

Releases: bitdruid/python-wayback-machine-downloader

2.0.0

Last version with dictionary-logic and no db-functionality: v1.5.7

This release makes old queries incompatible

1.5.0

1.4.0

1.3.0

1.2.0

1.0.3

v1.0.2

first release