-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature validate blocking schemas #140
Conversation
Codecov Report
@@ Coverage Diff @@
## master #140 +/- ##
==========================================
- Coverage 98.33% 97.63% -0.70%
==========================================
Files 14 18 +4
Lines 539 634 +95
==========================================
+ Hits 530 619 +89
- Misses 9 15 +6 |
6a1f0ee
to
0372092
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pydantic
looks really nice.
Given that we define and validate user-provided configurations, what's best practice to provide documentation for the users? Is there some tooling we can use?
@@ -72,14 +71,14 @@ def generate_candidate_blocks(data: Sequence[Tuple[str, ...]], signature_config: | |||
|
|||
""" | |||
# validate config of blocking | |||
validate_signature_config(signature_config) | |||
blocking_schema = validate_signature_config(signature_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is confusing. Is signature_config
a blocking schema? Why don't we name it that then? Also, the function is called valdiate
but returns the schema as well? How about naming it read_and_validate_blocking_schema
, or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree the names could be better.
generate_candidate_blocks
currently takes a Python dict version of the blocking schema - assumed to have been sourced from a JSON file. We validate that schema getting back a BlockingSchemaModel
which is an instance of a Pydantic model, very similar to a dataclass.
read_and_validate_blocking_schema
sounds like it reads a file to me, I'll rename to validate_blocking_schema
for now.
I'll change the argument to blocking_schema
instead of signature_config
, and make it more explicit with blocking_model
for the pydantic model.
Changed in e87349b
I'd actually like to make generate_candidate_blocks
take a Union[BlockingSchemaModel, Dict]
- thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit smelly to use the Union[BlockingSchemaModel, Dict]
as an argument for generate_candidate_blocks
. Instead, I would only accept BlockingSchemaModel
and make sure that BlockingSchemaModel
has appropriate helper functions to read the model from Dict
.
In fact, I would make all the internal functions only accept the pydantic models. If I am not mistaken, then pydantic offers helper functions to generate models from dicts/json.
It comes down to a separation of concerns thing.
Yes, that breaks the current api a little, but we would end up with something a lot cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I would make all the internal functions only accept the pydantic models.
I'd be very much in favor of that. But surely generate_candidate_blocks
isn't an internal function? Anonlink client uses it directly for example.
Happy to implement what ever you want here though, break the external api and force blocklib
users to pass blocklib's pydantic models for configuration? Please confirm before I do that.
If I am not mistaken, then pydantic offers helper functions to generate models from dicts/json.
Yes, of course:Model.parse_obj(dictionary_data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we outsourced the magic conversion to https://pydantic-docs.helpmanual.io/usage/validation_decorator/
That (provisional api) looks like it might validate and convert for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, the magic converter looks nice, but, it's still beta and also feels a bit like a hack.
For the generate_candidate_blocks
and anonlink-client. You can see that it loads the config first from json, then does some acrobatics to extract a config value, loads the whole dataset into memory, and then calls blocklib. If it would have properly parsed the config, then the access would have been cleaner and it would fail early if the config is wrong.
I'm still leaning towards only allowing the pydantic models, however, before we make braking changes, we should review the whole code base and make it consistent. This seems more like a version 1.0 kind of activity. :) Thus for now, do your non-breaking change.
@@ -17,27 +17,26 @@ class PPRLIndexLambdaFold(PPRLIndex): | |||
This class includes an implementation of Lambda-fold redundant blocking method. | |||
""" | |||
|
|||
def __init__(self, config: Dict): | |||
def __init__(self, config: Union[LambdaConfig, Dict]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be cleaner if we only allow LambdaConfig
things? I'm not a big fan of the following instance check with casting. That smells a bit fishy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wouldn't be compatible with the old API so I'd suggest a deprecation period to continue accepting dict config for a version on two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parse_obj
is nice in that the errors are really on point.
from blocklib.validation import LambdaConfig
LambdaConfig.parse_obj({})
Traceback (most recent call last):
File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-a08a6d544657>", line 1, in <module>
LambdaConfig.parse_obj({})
File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 7 validation errors for LambdaConfig
blocking-features
field required (type=value_error.missing)
Lambda
field required (type=value_error.missing)
bf-len
field required (type=value_error.missing)
num-hash-funcs
field required (type=value_error.missing)
K
field required (type=value_error.missing)
input-clks
field required (type=value_error.missing)
random_state
field required (type=value_error.missing)
Or closer:
LambdaConfig.parse_obj({'blocking-features': [], 'Lambda':2, 'bf-len': 'invalid', 'K': 2, 'input-clks': False, 'random_state': 0, 'num-hash-funcs': 20})
Traceback (most recent call last):
File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-9dd1572df494>", line 1, in <module>
LambdaConfig.parse_obj({'blocking-features': [], 'Lambda':2, 'bf-len': 'invalid', 'K': 2, 'input-clks': False, 'random_state': 0, 'num-hash-funcs': 20})
File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LambdaConfig
bf-len
value is not a valid integer (type=type_error.integer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anyone actually using this outside of blocklib? Let's make a version 1.. and break things. Viva la revolution!
# Validate blocking schema with pydantic | ||
# Note we already know the config contains a type so we could | ||
# directly create a PSig or LambdaFold type | ||
return BlockingSchemaModel.parse_obj(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you parse it twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I thought it was justified but on investigation I no longer think it is.
My logic was something like this:
The first one is using a BlockingSchemaBaseModel
which applies the same level of validation that was already in the library (does it contain the expected top level keys) giving a decent error if the version is missing or if the type is unsupported by this version of blocklib
:
BlockingSchemaBaseModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
Traceback (most recent call last):
File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-11-e40ecbadccd2>", line 1, in <module>
BlockingSchemaBaseModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for BlockingSchemaBaseModel
type
value is not a valid enumeration member; permitted: 'p-sig', 'lambda-fold' (type=type_error.enum; enum_values=[<BlockingSchemaTypes.psig: 'p-sig'>, <BlockingSchemaTypes.lambdafold: 'lambda-fold'>])
I had thought that the BlockingSchemaModel
gave lots of validation errors that might be confusing if the top level was wrong, I think behaviour has improved with the config_gen "hack". The current output from validating a poorly configured blocking_schema with the full pydantic model indeed seems acceptable:
BlockingSchemaModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
Traceback (most recent call last):
File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-036a72b586b9>", line 1, in <module>
BlockingSchemaModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 2 validation errors for BlockingSchemaModel
type
value is not a valid enumeration member; permitted: 'p-sig', 'lambda-fold' (type=type_error.enum; enum_values=[<BlockingSchemaTypes.psig: 'p-sig'>, <BlockingSchemaTypes.lambdafold: 'lambda-fold'>])
config
Unsupported blocking type (type=value_error)
So there are two errors where one would be sufficient, but I'll go ahead and remove the base model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in b365386
description='Input data is CLK rather than PII') | ||
random_state: int | ||
|
||
#random_state: Optional[int] = Field(None, alias='random-state') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just say the blocking schema isn't the most consistent schema I've ever come across. I had this to map the hypenated random-state
to the more Python friendly underscore separated random_state
but it turns out the schema mixed and matched:
I'm happy to include it (or change them all to use underscores), but note it would be a breaking change. So thought I'd just leave it for now, I'll open an issue - #142
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cough signatureSpecs
camelCase is in there too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not the only ones mixing the styles...
We should avoid kebab-case though, as it cannot be accessed easily by JavaScript.
I guess we should defined a new version of the schema that unifies the formatting. There is also a mix of upper and lower case.
If I read that correctly, then pydantic would let us build in backwards compatibility with the alias
thingy? Sweet.
Yeah the tour de force of pydantic was supposed to me proudly showing that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the test coverage.
""" | ||
:param data: list of tuples E.g. ('0', 'Kenneth Bain', '1964/06/17', 'M') | ||
:param signature_config: | ||
:param blocking_schema: | ||
A description of how the signatures should be generated. | ||
Schema for the signature config is found in | ||
``docs/schema/signature-config-schema.json`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not a valid link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the docstrings are not rendered anywhere in sphinx... I'll put together a very basic python api page for now.
d4a96d1
to
820eda8
Compare
Includes updating the blocking tutorial. Out with the old, in with the new. Replaced setup.py and requirements.txt with pyproject.toml Updates azure pipeline to use poetry.
f1035ab
to
f7b6336
Compare
Use PPRLIndexConfig type Simplify validation by removing BlockingSchemaBaseModel
Update deps before merge
d3a3669
to
b53477d
Compare
Adds pydantic models for blocking schemas and carries out much stricter validation of passed in blocking schemas loaded from JSON.
Probably the most dubious part is the
BlockingSchemaModel
which uses a pydanticValidator
onconfig
to discriminate which blocking type gets used based on thetype
value. This will get cleaner after pydantic/pydantic#619 is resolved. The error messages for thePSigSignatureModel
should also improve once pydantic is told how to discriminate between the Union of possible types.Closes #12
Closes #137