Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature validate blocking schemas #140

Merged
merged 7 commits into from
Mar 18, 2021
Merged

Conversation

hardbyte
Copy link
Collaborator

@hardbyte hardbyte commented Mar 10, 2021

Adds pydantic models for blocking schemas and carries out much stricter validation of passed in blocking schemas loaded from JSON.

Probably the most dubious part is the BlockingSchemaModel which uses a pydantic Validator on config to discriminate which blocking type gets used based on the type value. This will get cleaner after pydantic/pydantic#619 is resolved. The error messages for the PSigSignatureModel should also improve once pydantic is told how to discriminate between the Union of possible types.

Closes #12
Closes #137

@hardbyte hardbyte requested a review from wilko77 March 10, 2021 11:20
@codecov
Copy link

codecov bot commented Mar 10, 2021

Codecov Report

Merging #140 (b53477d) into master (a100906) will decrease coverage by 0.69%.
The diff coverage is 98.80%.

@@            Coverage Diff             @@
##           master     #140      +/-   ##
==========================================
- Coverage   98.33%   97.63%   -0.70%     
==========================================
  Files          14       18       +4     
  Lines         539      634      +95     
==========================================
+ Hits          530      619      +89     
- Misses          9       15       +6     

@hardbyte hardbyte force-pushed the feature-validate-blocking-schemas branch from 6a1f0ee to 0372092 Compare March 14, 2021 21:24
Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic looks really nice.
Given that we define and validate user-provided configurations, what's best practice to provide documentation for the users? Is there some tooling we can use?

@@ -72,14 +71,14 @@ def generate_candidate_blocks(data: Sequence[Tuple[str, ...]], signature_config:

"""
# validate config of blocking
validate_signature_config(signature_config)
blocking_schema = validate_signature_config(signature_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is confusing. Is signature_config a blocking schema? Why don't we name it that then? Also, the function is called valdiate but returns the schema as well? How about naming it read_and_validate_blocking_schema, or something along those lines?

Copy link
Collaborator Author

@hardbyte hardbyte Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the names could be better.

generate_candidate_blocks currently takes a Python dict version of the blocking schema - assumed to have been sourced from a JSON file. We validate that schema getting back a BlockingSchemaModel which is an instance of a Pydantic model, very similar to a dataclass.

read_and_validate_blocking_schema sounds like it reads a file to me, I'll rename to validate_blocking_schema for now.

I'll change the argument to blocking_schema instead of signature_config, and make it more explicit with blocking_model for the pydantic model.

Changed in e87349b

I'd actually like to make generate_candidate_blocks take a Union[BlockingSchemaModel, Dict] - thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit smelly to use the Union[BlockingSchemaModel, Dict] as an argument for generate_candidate_blocks. Instead, I would only accept BlockingSchemaModel and make sure that BlockingSchemaModel has appropriate helper functions to read the model from Dict.
In fact, I would make all the internal functions only accept the pydantic models. If I am not mistaken, then pydantic offers helper functions to generate models from dicts/json.
It comes down to a separation of concerns thing.
Yes, that breaks the current api a little, but we would end up with something a lot cleaner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I would make all the internal functions only accept the pydantic models.

I'd be very much in favor of that. But surely generate_candidate_blocks isn't an internal function? Anonlink client uses it directly for example.
Happy to implement what ever you want here though, break the external api and force blocklib users to pass blocklib's pydantic models for configuration? Please confirm before I do that.

If I am not mistaken, then pydantic offers helper functions to generate models from dicts/json.
Yes, of course: Model.parse_obj(dictionary_data)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we outsourced the magic conversion to https://pydantic-docs.helpmanual.io/usage/validation_decorator/

That (provisional api) looks like it might validate and convert for us.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, the magic converter looks nice, but, it's still beta and also feels a bit like a hack.
For the generate_candidate_blocks and anonlink-client. You can see that it loads the config first from json, then does some acrobatics to extract a config value, loads the whole dataset into memory, and then calls blocklib. If it would have properly parsed the config, then the access would have been cleaner and it would fail early if the config is wrong.

I'm still leaning towards only allowing the pydantic models, however, before we make braking changes, we should review the whole code base and make it consistent. This seems more like a version 1.0 kind of activity. :) Thus for now, do your non-breaking change.

blocklib/pprlindex.py Outdated Show resolved Hide resolved
blocklib/pprlindex.py Outdated Show resolved Hide resolved
@@ -17,27 +17,26 @@ class PPRLIndexLambdaFold(PPRLIndex):
This class includes an implementation of Lambda-fold redundant blocking method.
"""

def __init__(self, config: Dict):
def __init__(self, config: Union[LambdaConfig, Dict]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be cleaner if we only allow LambdaConfig things? I'm not a big fan of the following instance check with casting. That smells a bit fishy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't be compatible with the old API so I'd suggest a deprecation period to continue accepting dict config for a version on two.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse_obj is nice in that the errors are really on point.

from blocklib.validation import LambdaConfig
LambdaConfig.parse_obj({})
Traceback (most recent call last):
  File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-a08a6d544657>", line 1, in <module>
    LambdaConfig.parse_obj({})
  File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 7 validation errors for LambdaConfig
blocking-features
  field required (type=value_error.missing)
Lambda
  field required (type=value_error.missing)
bf-len
  field required (type=value_error.missing)
num-hash-funcs
  field required (type=value_error.missing)
K
  field required (type=value_error.missing)
input-clks
  field required (type=value_error.missing)
random_state
  field required (type=value_error.missing)

Or closer:

LambdaConfig.parse_obj({'blocking-features': [], 'Lambda':2, 'bf-len': 'invalid', 'K': 2, 'input-clks': False, 'random_state': 0, 'num-hash-funcs': 20})
Traceback (most recent call last):
  File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-9dd1572df494>", line 1, in <module>
    LambdaConfig.parse_obj({'blocking-features': [], 'Lambda':2, 'bf-len': 'invalid', 'K': 2, 'input-clks': False, 'random_state': 0, 'num-hash-funcs': 20})
  File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LambdaConfig
bf-len
  value is not a valid integer (type=type_error.integer)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anyone actually using this outside of blocklib? Let's make a version 1.. and break things. Viva la revolution!

blocklib/signature_generator.py Outdated Show resolved Hide resolved
# Validate blocking schema with pydantic
# Note we already know the config contains a type so we could
# directly create a PSig or LambdaFold type
return BlockingSchemaModel.parse_obj(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you parse it twice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I thought it was justified but on investigation I no longer think it is.

My logic was something like this:

The first one is using a BlockingSchemaBaseModel which applies the same level of validation that was already in the library (does it contain the expected top level keys) giving a decent error if the version is missing or if the type is unsupported by this version of blocklib:

BlockingSchemaBaseModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
Traceback (most recent call last):
  File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-11-e40ecbadccd2>", line 1, in <module>
    BlockingSchemaBaseModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
  File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for BlockingSchemaBaseModel
type
  value is not a valid enumeration member; permitted: 'p-sig', 'lambda-fold' (type=type_error.enum; enum_values=[<BlockingSchemaTypes.psig: 'p-sig'>, <BlockingSchemaTypes.lambdafold: 'lambda-fold'>])

I had thought that the BlockingSchemaModel gave lots of validation errors that might be confusing if the top level was wrong, I think behaviour has improved with the config_gen "hack". The current output from validating a poorly configured blocking_schema with the full pydantic model indeed seems acceptable:

BlockingSchemaModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
Traceback (most recent call last):
  File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-036a72b586b9>", line 1, in <module>
    BlockingSchemaModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'})
  File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 2 validation errors for BlockingSchemaModel
type
  value is not a valid enumeration member; permitted: 'p-sig', 'lambda-fold' (type=type_error.enum; enum_values=[<BlockingSchemaTypes.psig: 'p-sig'>, <BlockingSchemaTypes.lambdafold: 'lambda-fold'>])
config
  Unsupported blocking type (type=value_error)

So there are two errors where one would be sufficient, but I'll go ahead and remove the base model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b365386

blocklib/validation/__init__.py Outdated Show resolved Hide resolved
description='Input data is CLK rather than PII')
random_state: int

#random_state: Optional[int] = Field(None, alias='random-state')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just say the blocking schema isn't the most consistent schema I've ever come across. I had this to map the hypenated random-state to the more Python friendly underscore separated random_state but it turns out the schema mixed and matched:

image

I'm happy to include it (or change them all to use underscores), but note it would be a breaking change. So thought I'd just leave it for now, I'll open an issue - #142

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cough signatureSpecs camelCase is in there too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not the only ones mixing the styles...
We should avoid kebab-case though, as it cannot be accessed easily by JavaScript.
I guess we should defined a new version of the schema that unifies the formatting. There is also a mix of upper and lower case.
If I read that correctly, then pydantic would let us build in backwards compatibility with the alias thingy? Sweet.

pyproject.toml Outdated Show resolved Hide resolved
@hardbyte hardbyte requested a review from wilko77 March 15, 2021 02:31
@hardbyte
Copy link
Collaborator Author

Given that we define and validate user-provided configurations, what's best practice to provide documentation for the users? Is there some tooling we can use?

Yeah the tour de force of pydantic was supposed to me proudly showing that PSigConfig.schema_json() or even BlockingSchemaModel.schema_json() would produce fully valid json schemas 🥳. Unfortunately there is something not working with the mix of Literal and Enum types so that doesn't work right now. 👎🏼
I suspect that there will be a minor refactoring required after Pydantic officially supports discriminated union types, and then we will get full json schema support. The library can then expose that as documentation.

Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the test coverage.

"""
:param data: list of tuples E.g. ('0', 'Kenneth Bain', '1964/06/17', 'M')
:param signature_config:
:param blocking_schema:
A description of how the signatures should be generated.
Schema for the signature config is found in
``docs/schema/signature-config-schema.json``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not a valid link.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the docstrings are not rendered anywhere in sphinx... I'll put together a very basic python api page for now.

@hardbyte hardbyte force-pushed the feature-validate-blocking-schemas branch 3 times, most recently from d4a96d1 to 820eda8 Compare March 18, 2021 04:24
hardbyte and others added 2 commits March 18, 2021 17:33
Add test exporting blocking schemas objects back to json

Closes #12
Closes #137
Includes updating the blocking tutorial.

Out with the old, in with the new.

Replaced setup.py and requirements.txt with pyproject.toml
Updates azure pipeline to use poetry.
@hardbyte hardbyte force-pushed the feature-validate-blocking-schemas branch 4 times, most recently from f1035ab to f7b6336 Compare March 18, 2021 05:23
@hardbyte hardbyte force-pushed the feature-validate-blocking-schemas branch from d3a3669 to b53477d Compare March 18, 2021 05:37
@hardbyte hardbyte merged commit d7914df into master Mar 18, 2021
@hardbyte hardbyte deleted the feature-validate-blocking-schemas branch March 18, 2021 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serialize to a blocking schema Build validation schema file for each blocking method
2 participants