Skip to content

Releases: HobnobMancer/cazy_webscraper

v2.3.0.4

19 Nov 16:43
Compare
Choose a tag to compare

Update URL and fix retrieval of sequences from NCBI-GenBank

v2.3.0.3

25 Mar 08:37
9d50e5c
Compare
Choose a tag to compare

Patch for incomplete NCBI reads

As flagged in issue #120, if the connection to NCBI is interrupted or terminated early an incomplete or corrupted read error is raised. try/except blocks were updated to accept these incomplete read errors, and cazy_webscraper will now re-try the connection until either a successful connection is made, or the number of reattempts is reached (which ever is achieved first).

What's Changed

Full Changelog: v2.3.0.2...v2.3.0.3

v2.3.0.2

23 Jan 09:13
c8b07ef
Compare
Choose a tag to compare

Minor patch

Bug Fix
Fixes crashing when retrieving the latest taxonomy data from NCBI for CAZyme records that are associated with multiple taxa in CAZy.

  • Catches and handles RunTime, NotXML and IncompleteRead errors

What's Changed

Full Changelog: v2.3.0...v2.3.0.2

v2.3.0

23 May 14:30
b720c81
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.2.8...v2.3.0

New in version 2.3.0

  • Downloading protein data from UniProt is several magnitudes faster than before - and should have fewer issues with using older version of bioservices

    • Uses bioservices mapping to map directly from NCBI protein version accession to UniProt
    • cw_get_uniprot_data not longer calls to NCBI and thus no longer requires an email address as a positional argument
  • Updated database schema: Changed Genbanks 1--* Uniprots to Genbanks *--1 Uniprots. Uniprots.uniprot_id is now listed in the Genbanks table, instead of listing Genbanks.genbank_id in the Uniprots table

  • Retrieve taxonomic classifications from UniProt

    • Use the --taxonomy/-t flag to retrieve the scientific name (genus and species) for proteins of interest
    • Adds downloaded taxonomic information to the UniprotsTaxs table
  • Improved clarrification of deleting old records when using cw_get_uniprot_data

    • Separate arguments to delete Genbanks-EC number and Genbanks-PDB accession relationships that are no longer listed in UniProt for those proteins in the local CAZyme database for proteins whom data is downloaded from UniProt
    • New args:
      • --delete_old_ec_relationships = deletes Genbank(protein)-EC number relationships no longer in UniProt
      • --delete_old_ecs = deletes EC numbers in the local db not linked to any proteins
      • --delete_old_pdb_relationships = deletes Genbank(protein)-PDB relationships no longer in UniProt
      • --delete_old_pdbs = deletes PDB accessions in the local db not linked to any proteins
  • Retrieve the local db schema

    • New command cw_get_db_schema added.
    • Retrieves the SQLite schema of a local CAZyme database and prints it to the terminal
  • Added option to skip retrieving the latest taxonomic classifications NCBI taxonomies

    • By default, when retreiving data from CAZy, cazy_webscraper retrieves the latest taxonomic classifications for proteins listed under multiple tax
    • To increase scrapping time, and to reduce burden on the NCBI-Entrez server, if this data is not needed (e.g. GTDB taxs will be use) this step can be skipped by using the new --skip_ncbi_tax flag.
    • When skipping retrieval of the latest taxa classifications from NCBI, cazy_webscraper will add the first taxa retrieved from CAZy for those proteins listed under mutliple taxa

v2.2.8

26 Apr 13:09
5602f88
Compare
Choose a tag to compare

Bugs and improvements

  • Addresses issue of incomplete retrieval of taxonomy data from NCBI
  • Process of retrieving taxonomy data is faster
  • PR #113

What's Changed

Full Changelog: v2.2.7...v2.2.8

v2.2.7

16 Jan 12:12
5bbdab8
Compare
Choose a tag to compare

Fixing bugs when downloading seqs from NCBI:

  • Issue #109
  • Adding missing args to func calls
  • Accept UniProt-style accessions and non-standard NCBI accession formats that are used by NCBI
  • Combine cached seqs with recently downloaded so don't need to manually combine multiple caches if the download is interrupted multiple times
  • Remove unused args from func returns

What's Changed

Full Changelog: v2.2.6...v2.2.7

v2.2.6

12 Jan 15:20
0672fab
Compare
Choose a tag to compare

Fix cazy_webscraper crashing from missing arguments to function calls when retrieving the latest taxonomic classifications for proteins when a batch on protein IDs contains an invalid ID:

Traceback (most recent call last):
  File "...bin/cazy_webscraper", line 33, in <module>
    sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cazy_webscraper')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../cazy_webscraper/cazy_scraper.py", line 246, in main
    get_cazy_data(
  File "...//cazy_webscraper/cazy_scraper.py", line 355, in get_cazy_data cazy_data, successful_replacement = replace_multiple_tax(
                                        ^^^^^^^^^^^^^^^^^^^^^
  File ".../cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 135, in replace_multiple_tax
    cazy_data, success = replace_multiple_tax_with_invalid_ids(cazy_data, args)

What's Changed

Full Changelog: v2.2.5...v2.2.6

v2.2.5

12 Jan 14:59
d7f773a
Compare
Choose a tag to compare

Fix import error when retrieving protein sequences from NCBI, that was introduced in version 2.2.4

What's Changed

Full Changelog: v2.2.4...v2.2.5

v2.2.4

11 Jan 16:09
d8b34b8
Compare
Choose a tag to compare

What's Changed

Fix Issue #99: Improve handling when incurring errors when retrieving data from NCBI

  1. Separate invalid IDs to IDs that suffered to failed connections
  2. Parse batches containing invalid IDs separately to and before failed connection batches

Downloaded protein sequences are cached to a FASTA file.

Updated information in the docs on caching.

Full Changelog: v2.2.3...v2.2.4

v2.2.3

09 Dec 15:44
47af82a
Compare
Choose a tag to compare

Address issue #100 with failing to retrieve data from UniProt, owing to changes to the UniProt API.

All alters the methods for mapping UniProt accessions to GenBank accessions - including a more robust method for assigning data from UniProt to the correct protein in the local CAZyme database.

What's Changed

Full Changelog: v2.2.2...v2.2.3