Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use curies package for default implementation #23

Merged
merged 8 commits into from
Aug 12, 2022

Conversation

cthoyt
Copy link
Contributor

@cthoyt cthoyt commented Aug 10, 2022

Closes #5

This PR uses the curies package to provide a much faster default implementation for expansion and contraction that uses the trie data structure. This doesn't address the case where users bring their own custom prefix maps, as this data structure needs to be pre-built.

@cthoyt cthoyt marked this pull request as ready for review August 10, 2022 13:46
@cthoyt cthoyt marked this pull request as draft August 10, 2022 13:49
@cthoyt cthoyt marked this pull request as ready for review August 10, 2022 22:04
@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2022

Codecov Report

Merging #23 (4987c55) into master (9d6808a) will decrease coverage by 0.39%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master      #23      +/-   ##
==========================================
- Coverage   82.20%   81.81%   -0.40%     
==========================================
  Files           5        5              
  Lines         163      176      +13     
==========================================
+ Hits          134      144      +10     
- Misses         29       32       +3     
Impacted Files Coverage Δ
prefixcommons/curie_util.py 71.42% <66.66%> (+0.84%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@cthoyt
Copy link
Contributor Author

cthoyt commented Aug 10, 2022

@sierra-moxon @kshefchek this PR now replaces the default functionality with a fast implementation based on tries. It doesn't change any functionality for when custom dictionaries are passed - ideally, those should be pre-processed with the trie structure to take advantage of its speed-up

def get_prefixes(cmaps: Optional[List[PREFIX_MAP]] = None) -> List[str]:
if cmaps is None:
cmaps = default_curie_maps
return sorted(default_converter.get_prefixes())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cthoyt - does this change the order of the prefix maps from what they were before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appeared that the existing order was random (or was based on implicit assumptions about python data structures) so I don't think that there's a satisfying answer to your question.

Copy link
Contributor Author

@cthoyt cthoyt Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, it could have been the case that prefixes were duplicated since the previous implementation's logic was to just extend a list. That doesn't make a lot of sense to me either

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree; we can definitely make the code better. Is there a problem you are trying to solve for the sorted addition?

Copy link
Contributor Author

@cthoyt cthoyt Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the curies package returns a set, which seems more meaningful since this doesn't have an inherent ordering. However, this package expects a list so sorting the set to make deterministically make it a list seemed reasonable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes good sense, however, if we leave off the sorted we might have a better chance of preserving the serendipitous effects here? (the worry for me is downstream dependencies that depend on a specific ordering; I've had cases where I've tried to reorder the many different contexts to take advantage of one or the other's more complete or updated maps, only to find it breaks project assumptions about prefixes, etc.). Tangential to this PR perhaps, but it would be good to know our strategy for aligning this codebase with bioregistry. #24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rather than sorted just do list? That's fine for me if you think it will work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this update in 4987c55

@@ -136,7 +143,15 @@ def contract_uri(

"""
if cmaps is None:
cmaps = default_curie_maps
# TODO warn if not shortest?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., this implementation is not compatible with the argument shortest=False since tries only return the longest

@sierra-moxon sierra-moxon self-requested a review August 12, 2022 14:41
@sierra-moxon sierra-moxon merged commit d4dbc51 into prefixcommons:master Aug 12, 2022
@cthoyt cthoyt deleted the externalize-impl branch August 12, 2022 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use trie for iri to curie conversion
3 participants