Releases: bitdruid/python-wayback-machine-downloader
2.0.0
Last version with dictionary-logic and no db-functionality: v1.5.7
This release makes old queries incompatible
Full Changelog: 1.5.0...2.0.0
Removed the use of a dictionary
and replaced with sqlite
. This was in reference to #20, where I realised that due to the amount of information this tool processes and provides, a very large query would crash because the system rans out of memory.
Main changes from 1.5.0
- removed dictionary and replaced with sqlite db to handle large jobs - making system memory less important
- removed
--list
because you can simplycat
thecdx-file
- removed
--json
because for large jobs you have to parse the csv anyways - removed
--skip
because an existing job will always be resumed until it is finished or just in case use--reset
- removed
--cdxbackup
because no use after a job is finished or just in case use--keep
- removed
--cdxinject
because an existing job will always try to inject an existing cdxfile or just in case use--reset
- removed
--auto
because skip functionality is now default and--cdxbackup
and--cdxinject
were removed - removed
--csv
because its now the default output of this tool - removed
--verbosity
because progressbar is now--progress
and--json
got removed - removed
--debug
because error-log is always produced - added
--keep
to prevent deletion of.cdx
and.db
after the job finished - added
--reset
to reset an interrupted query instead of resuming - added
--progress
to show progressbars instead of cli-output - added
--filetype
to filter snapshots by specific filetype (.html not working for now)
Behavior changes
- the cdx-results are now streamed into the
cdxfile
instead of written from a variable to reduce memory-load - all optional
paths
for commands have been removed - defaults to output dir - a query is now identified by its
required
andoptional query
- parameters. if an existing query was identified, the download will resume (with a short info-message and the last status of the query) - the
manipulate behavior
paremeters do not affect that logic - the tool will give you a calculation of the snapshots to utilize, based on filtered/resumed/handled/skipped snapshots
- there are 3 progress-bars to help you getting the status for very large jobs:
- download of the cdx-results
- insertion of the cdx-results into the db
- download of the archived pages
Ideas for the future
- merge all jobs into one db file instead of one db per query
- restructure of the output dir to the logic
waybackup_snapshots/<query>/domains+subdomains+queryfiles/...
instead ofwaybackup_snapshots/domains+subdomains+queryfiles/...
to split queries into exclusive folders
This is the first release with db. As there have been a lot of changes, there are bound to be some bugs, but due to my final thesis at university i have not had much time to find them all - but hopefully most of them. Do not hesitate and open an issue. Please in any case report bugs or improvements!
1.5.0
#20 made me aware of an issue with very large queries to the cdx server. The snapshots received can easily cause the system to run out of memory, resulting in a crash.
So beside some other changes there is a work in progress to reduce memory-usage (but in exchange for some I/O)
- Results from the cdx server are now streamed into a .cdx file instead of system memory
- Added a warning (abortable) if the amount of snapshots is on the larger scale
- Added a progress indication for the download status of the cdx-query
- Sometimes the json results of the cdx server were not transferred completely and an exception was thrown. This faulty json data should now be handled appropriately (also #20 )
- Added command
--limit
to specify the maximum amount of snapshots to query - Removed command
--debug
; error-log will be always written - #19 fixed an error with timestamp extraction from urls
Feel free to submit bug reports at any time!
1.4.0
- #17
--delay
command can be now used to set a delay between GET for each worker --log
command can now be used to write a logfile (also works with--verbosity progress
)- changes in the logic of
--verbosity
for a future loglevel setting to manage more/less output - more changes in argument-handling
- minor fixes / cleanups
1.3.0
1.2.0
- fixed errors if snapshots colide path<->file #3 #4
- fixed errors where a picture was stored as index.html #7
- added url-encoding #4
- prevent redirect loops #4
- fixed SIGINT KeyboardInterrupt prevents csv-file from generating #8
- added custom exception handler
- added
--debug
to log exceptions into an error-log and print out full traceback instead of shortened - replaced batch-lists with queue for workers #9
- added some cdx-queries from example.com to test
- added
--cdxbackup
and--cdxinject
to either store a cdx query for later use or use a backup - added
--skip
-> an existing csv-file will be used to check for already downloaded snapshots - changed user-agent to give archive.org the possibility to know who is scraping #11
1.0.3
v1.0.2
- fixed paths for win #2 #1:
- stripping ports from domain (:80 :443 ...) to prevent WinError
- stripping mailto-prefixes to prevent WinError
- changed url-parsing to prevent the case where subdir==filename caused WinError
- url-encoded spaces in filenames are now decoded #1
- clarified current-path structure in readme - changes may come in the future #1
- optimized the parsing of cdx-query to keep inside a requested path
- increased performance of collection-creation for very large requests
first release
Changes to beta:
--worker
changed to--workers
.--csv
appends requested url to filename to prevent overwriting- cleanup README
- cleanup HELP