Skip to content

Simpler download of databases and more robust COG2KO conversion

Compare
Choose a tag to compare
@iquasere iquasere released this 28 Dec 11:47
· 17 commits to master since this release

Much simpler download of databases

reCOgnizer relied on --download-resources and --skip-downloaded parameters for setting up its databases.

--download-resources instructed reCOgnizer to download the files required for its execution, and --skip-downloaded instructed it to ignore already downloaded files, if there had simply been the mistake of removing one file.

Now, reCOGnizer relies on the recognizer_dwnl.timestamp to check if databases have already been downloaded. If the file exists, it skips installation. If the file doesn't exist, reCOGnizer will remove all available files, and download everything.

COG2KO conversion more reliable

Previously, reCOGnizer built the cog2ko conversion as a collection of all KOs available for each protein mapping to the specific COG.

Now, reCOGnizer uses a similar approach to cog2ec conversion, where it will only assign a KO to a COG where over half of instances of that COG have that particular KO.

This obtains a more reliable COG2KO conversion, while keeping KOs for a considerable number of COGs.

Also removes the intermediate ssv files outputted during construction of the cog2ko database.

New parameters --test-run and --output-rpsbproc-columns will usually not be needed

--test-run parameter had to be implemented as consequence of a simpler database downloading. When set, reCOGnizer runs in an abnormal fashion, which is required for the tests at GitHub. reCOGnizer will move the cdd.tar.gz file available in the repo, and use it as a valid cdd.tar.gz file.

--output-rpsbproc-columns will output the Superfamilies, Sites, Motifs columns, which are usually empty for almost all annotations.

Removed some unnecessary files

recognizer.log was produced at working directory. It only included rpsblast outputs, mainly for error assessment. Users can obtain that information by running reCOGnizer with the --debug parameter, and manually running the faulty commands.

taxonomy.rdf was obtained as part of building taxonomy.tsv. Now, reCOgnizer removes it after it outlived its usefulness.

Some fixes

reCOGnizer was not reporting the download of files when the --quiet flag was set, except when the files had already been downloaded, and it removed them.

Also updated regexes to new format, the r'regex' format.