Releases: HobnobMancer/cazy_webscraper
v2.3.0.4
Update URL and fix retrieval of sequences from NCBI-GenBank
v2.3.0.3
Patch for incomplete NCBI reads
As flagged in issue #120, if the connection to NCBI is interrupted or terminated early an incomplete or corrupted read error is raised. try/except
blocks were updated to accept these incomplete read errors, and cazy_webscraper
will now re-try the connection until either a successful connection is made, or the number of reattempts is reached (which ever is achieved first).
What's Changed
- Issue 120 ncbi by @HobnobMancer in #126
Full Changelog: v2.3.0.2...v2.3.0.3
v2.3.0.2
Minor patch
Bug Fix
Fixes crashing when retrieving the latest taxonomy data from NCBI for CAZyme records that are associated with multiple taxa in CAZy.
- Catches and handles RunTime, NotXML and IncompleteRead errors
What's Changed
- Doc update by @HobnobMancer in #116
- Update config.yml by @HobnobMancer in #117
- Catch incomplete read error by @HobnobMancer in #121
Full Changelog: v2.3.0...v2.3.0.2
v2.3.0
What's Changed
- Issue 111 + 112 uniprot by @HobnobMancer in #115
Full Changelog: v2.2.8...v2.3.0
New in version 2.3.0
-
Downloading protein data from UniProt is several magnitudes faster than before - and should have fewer issues with using older version of
bioservices
- Uses
bioservices
mapping to map directly from NCBI protein version accession to UniProt cw_get_uniprot_data
not longer calls to NCBI and thus no longer requires an email address as a positional argument
- Uses
-
Updated database schema: Changed
Genbanks 1--* Uniprots
toGenbanks *--1 Uniprots
.Uniprots.uniprot_id
is now listed in theGenbanks
table, instead of listingGenbanks.genbank_id
in theUniprots
table -
Retrieve taxonomic classifications from UniProt
- Use the
--taxonomy
/-t
flag to retrieve the scientific name (genus and species) for proteins of interest - Adds downloaded taxonomic information to the
UniprotsTaxs
table
- Use the
-
Improved clarrification of deleting old records when using
cw_get_uniprot_data
- Separate arguments to delete Genbanks-EC number and Genbanks-PDB accession relationships that are no longer listed in UniProt for those proteins in the local CAZyme database for proteins whom data is downloaded from UniProt
- New args:
--delete_old_ec_relationships
= deletes Genbank(protein)-EC number relationships no longer in UniProt--delete_old_ecs
= deletes EC numbers in the local db not linked to any proteins--delete_old_pdb_relationships
= deletes Genbank(protein)-PDB relationships no longer in UniProt--delete_old_pdbs
= deletes PDB accessions in the local db not linked to any proteins
-
Retrieve the local db schema
- New command
cw_get_db_schema
added. - Retrieves the SQLite schema of a local CAZyme database and prints it to the terminal
- New command
-
Added option to skip retrieving the latest taxonomic classifications NCBI taxonomies
- By default, when retreiving data from CAZy,
cazy_webscraper
retrieves the latest taxonomic classifications for proteins listed under multiple tax - To increase scrapping time, and to reduce burden on the NCBI-Entrez server, if this data is not needed (e.g. GTDB taxs will be use) this step can be skipped by using the new
--skip_ncbi_tax
flag. - When skipping retrieval of the latest taxa classifications from NCBI,
cazy_webscraper
will add the first taxa retrieved from CAZy for those proteins listed under mutliple taxa
- By default, when retreiving data from CAZy,
v2.2.8
Bugs and improvements
- Addresses issue of incomplete retrieval of taxonomy data from NCBI
- Process of retrieving taxonomy data is faster
- PR #113
What's Changed
- add not on cw_get_uniprot before cw_get_pdb by @HobnobMancer in #112
- Add batch ncbi tax by @HobnobMancer in #113
Full Changelog: v2.2.7...v2.2.8
v2.2.7
Fixing bugs when downloading seqs from NCBI:
- Issue #109
- Adding missing args to func calls
- Accept UniProt-style accessions and non-standard NCBI accession formats that are used by NCBI
- Combine cached seqs with recently downloaded so don't need to manually combine multiple caches if the download is interrupted multiple times
- Remove unused args from func returns
What's Changed
- Issue 109 by @HobnobMancer in #110
Full Changelog: v2.2.6...v2.2.7
v2.2.6
Fix cazy_webscraper
crashing from missing arguments to function calls when retrieving the latest taxonomic classifications for proteins when a batch on protein IDs contains an invalid ID:
Traceback (most recent call last):
File "...bin/cazy_webscraper", line 33, in <module>
sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cazy_webscraper')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../cazy_webscraper/cazy_scraper.py", line 246, in main
get_cazy_data(
File "...//cazy_webscraper/cazy_scraper.py", line 355, in get_cazy_data cazy_data, successful_replacement = replace_multiple_tax(
^^^^^^^^^^^^^^^^^^^^^
File ".../cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 135, in replace_multiple_tax
cazy_data, success = replace_multiple_tax_with_invalid_ids(cazy_data, args)
What's Changed
- Fix tax invalid ids by @HobnobMancer in #108
Full Changelog: v2.2.5...v2.2.6
v2.2.5
Fix import error when retrieving protein sequences from NCBI, that was introduced in version 2.2.4
What's Changed
- Update imports by @HobnobMancer in #107
Full Changelog: v2.2.4...v2.2.5
v2.2.4
What's Changed
- Issue 99 by @HobnobMancer in #102
Fix Issue #99: Improve handling when incurring errors when retrieving data from NCBI
- Separate invalid IDs to IDs that suffered to failed connections
- Parse batches containing invalid IDs separately to and before failed connection batches
Downloaded protein sequences are cached to a FASTA file.
Updated information in the docs on caching.
Full Changelog: v2.2.3...v2.2.4
v2.2.3
Address issue #100 with failing to retrieve data from UniProt, owing to changes to the UniProt API.
All alters the methods for mapping UniProt accessions to GenBank accessions - including a more robust method for assigning data from UniProt to the correct protein in the local CAZyme database.
What's Changed
- Issue 100 uniprot by @HobnobMancer in #103
Full Changelog: v2.2.2...v2.2.3