Feature validate blocking schemas #140

hardbyte · 2021-03-10T11:20:19Z

Adds pydantic models for blocking schemas and carries out much stricter validation of passed in blocking schemas loaded from JSON.

Probably the most dubious part is the BlockingSchemaModel which uses a pydantic Validator on config to discriminate which blocking type gets used based on the type value. This will get cleaner after pydantic/pydantic#619 is resolved. The error messages for the PSigSignatureModel should also improve once pydantic is told how to discriminate between the Union of possible types.

Closes #12
Closes #137

codecov · 2021-03-10T11:26:17Z

Codecov Report

Merging #140 (b53477d) into master (a100906) will decrease coverage by 0.69%.
The diff coverage is 98.80%.

@@            Coverage Diff             @@
##           master     #140      +/-   ##
==========================================
- Coverage   98.33%   97.63%   -0.70%     
==========================================
  Files          14       18       +4     
  Lines         539      634      +95     
==========================================
+ Hits          530      619      +89     
- Misses          9       15       +6

wilko77

pydantic looks really nice.
Given that we define and validate user-provided configurations, what's best practice to provide documentation for the users? Is there some tooling we can use?

wilko77 · 2021-03-15T00:44:54Z

blocklib/candidate_blocks_generator.py

@@ -72,14 +71,14 @@ def generate_candidate_blocks(data: Sequence[Tuple[str, ...]], signature_config:

    """
    # validate config of blocking
-    validate_signature_config(signature_config)
+    blocking_schema = validate_signature_config(signature_config)


this line is confusing. Is signature_config a blocking schema? Why don't we name it that then? Also, the function is called valdiate but returns the schema as well? How about naming it read_and_validate_blocking_schema, or something along those lines?

Agree the names could be better.

generate_candidate_blocks currently takes a Python dict version of the blocking schema - assumed to have been sourced from a JSON file. We validate that schema getting back a BlockingSchemaModel which is an instance of a Pydantic model, very similar to a dataclass.

read_and_validate_blocking_schema sounds like it reads a file to me, I'll rename to validate_blocking_schema for now.

I'll change the argument to blocking_schema instead of signature_config, and make it more explicit with blocking_model for the pydantic model.

Changed in e87349b

I'd actually like to make generate_candidate_blocks take a Union[BlockingSchemaModel, Dict] - thoughts?

I find it a bit smelly to use the Union[BlockingSchemaModel, Dict] as an argument for generate_candidate_blocks. Instead, I would only accept BlockingSchemaModel and make sure that BlockingSchemaModel has appropriate helper functions to read the model from Dict.
In fact, I would make all the internal functions only accept the pydantic models. If I am not mistaken, then pydantic offers helper functions to generate models from dicts/json.
It comes down to a separation of concerns thing.
Yes, that breaks the current api a little, but we would end up with something a lot cleaner.

In fact, I would make all the internal functions only accept the pydantic models.

I'd be very much in favor of that. But surely generate_candidate_blocks isn't an internal function? Anonlink client uses it directly for example.
Happy to implement what ever you want here though, break the external api and force blocklib users to pass blocklib's pydantic models for configuration? Please confirm before I do that.

If I am not mistaken, then pydantic offers helper functions to generate models from dicts/json.
Yes, of course: Model.parse_obj(dictionary_data)

What if we outsourced the magic conversion to https://pydantic-docs.helpmanual.io/usage/validation_decorator/

That (provisional api) looks like it might validate and convert for us.

I don't know, the magic converter looks nice, but, it's still beta and also feels a bit like a hack.
For the generate_candidate_blocks and anonlink-client. You can see that it loads the config first from json, then does some acrobatics to extract a config value, loads the whole dataset into memory, and then calls blocklib. If it would have properly parsed the config, then the access would have been cleaner and it would fail early if the config is wrong.

I'm still leaning towards only allowing the pydantic models, however, before we make braking changes, we should review the whole code base and make it consistent. This seems more like a version 1.0 kind of activity. :) Thus for now, do your non-breaking change.

blocklib/pprlindex.py

wilko77 · 2021-03-15T00:53:26Z

blocklib/pprllambdafold.py

@@ -17,27 +17,26 @@ class PPRLIndexLambdaFold(PPRLIndex):
        This class includes an implementation of Lambda-fold redundant blocking method.
    """

-    def __init__(self, config: Dict):
+    def __init__(self, config: Union[LambdaConfig, Dict]):


Wouldn't it be cleaner if we only allow LambdaConfig things? I'm not a big fan of the following instance check with casting. That smells a bit fishy.

That wouldn't be compatible with the old API so I'd suggest a deprecation period to continue accepting dict config for a version on two.

The parse_obj is nice in that the errors are really on point.

from blocklib.validation import LambdaConfig LambdaConfig.parse_obj({}) Traceback (most recent call last): File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-3-a08a6d544657>", line 1, in <module> LambdaConfig.parse_obj({}) File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 7 validation errors for LambdaConfig blocking-features field required (type=value_error.missing) Lambda field required (type=value_error.missing) bf-len field required (type=value_error.missing) num-hash-funcs field required (type=value_error.missing) K field required (type=value_error.missing) input-clks field required (type=value_error.missing) random_state field required (type=value_error.missing)

Or closer:

LambdaConfig.parse_obj({'blocking-features': [], 'Lambda':2, 'bf-len': 'invalid', 'K': 2, 'input-clks': False, 'random_state': 0, 'num-hash-funcs': 20}) Traceback (most recent call last): File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-5-9dd1572df494>", line 1, in <module> LambdaConfig.parse_obj({'blocking-features': [], 'Lambda':2, 'bf-len': 'invalid', 'K': 2, 'input-clks': False, 'random_state': 0, 'num-hash-funcs': 20}) File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 1 validation error for LambdaConfig bf-len value is not a valid integer (type=type_error.integer)

Is anyone actually using this outside of blocklib? Let's make a version 1.. and break things. Viva la revolution!

blocklib/signature_generator.py

wilko77 · 2021-03-15T01:08:47Z

blocklib/validation/__init__.py

+    # Validate blocking schema with pydantic
+    # Note we already know the config contains a type so we could
+    # directly create a PSig or LambdaFold type
+    return BlockingSchemaModel.parse_obj(config)


why do you parse it twice?

Good question, I thought it was justified but on investigation I no longer think it is.

My logic was something like this:

The first one is using a BlockingSchemaBaseModel which applies the same level of validation that was already in the library (does it contain the expected top level keys) giving a decent error if the version is missing or if the type is unsupported by this version of blocklib:

BlockingSchemaBaseModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'}) Traceback (most recent call last): File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-11-e40ecbadccd2>", line 1, in <module> BlockingSchemaBaseModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'}) File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 1 validation error for BlockingSchemaBaseModel type value is not a valid enumeration member; permitted: 'p-sig', 'lambda-fold' (type=type_error.enum; enum_values=[<BlockingSchemaTypes.psig: 'p-sig'>, <BlockingSchemaTypes.lambdafold: 'lambda-fold'>])

I had thought that the BlockingSchemaModel gave lots of validation errors that might be confusing if the top level was wrong, I think behaviour has improved with the config_gen "hack". The current output from validating a poorly configured blocking_schema with the full pydantic model indeed seems acceptable:

BlockingSchemaModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'}) Traceback (most recent call last): File "/home/brian/.cache/pypoetry/virtualenvs/blocklib-CJeb9hJb-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-13-036a72b586b9>", line 1, in <module> BlockingSchemaModel.parse_obj({'version': 1, 'config': {}, 'type': 'PSIG'}) File "pydantic/main.py", line 572, in pydantic.main.BaseModel.parse_obj File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 2 validation errors for BlockingSchemaModel type value is not a valid enumeration member; permitted: 'p-sig', 'lambda-fold' (type=type_error.enum; enum_values=[<BlockingSchemaTypes.psig: 'p-sig'>, <BlockingSchemaTypes.lambdafold: 'lambda-fold'>]) config Unsupported blocking type (type=value_error)

So there are two errors where one would be sufficient, but I'll go ahead and remove the base model.

Done in b365386

blocklib/validation/__init__.py

wilko77 · 2021-03-15T01:09:41Z

blocklib/validation/lambda_fold_validation.py

+                                  description='Input data is CLK rather than PII')
+    random_state: int
+
+    #random_state: Optional[int] = Field(None, alias='random-state')


do we need this line?

Let's just say the blocking schema isn't the most consistent schema I've ever come across. I had this to map the hypenated random-state to the more Python friendly underscore separated random_state but it turns out the schema mixed and matched:

I'm happy to include it (or change them all to use underscores), but note it would be a breaking change. So thought I'd just leave it for now, I'll open an issue - #142

Cough signatureSpecs camelCase is in there too

we are not the only ones mixing the styles...
We should avoid kebab-case though, as it cannot be accessed easily by JavaScript.
I guess we should defined a new version of the schema that unifies the formatting. There is also a mix of upper and lower case.
If I read that correctly, then pydantic would let us build in backwards compatibility with the alias thingy? Sweet.

pyproject.toml

hardbyte · 2021-03-15T02:38:08Z

Given that we define and validate user-provided configurations, what's best practice to provide documentation for the users? Is there some tooling we can use?

Yeah the tour de force of pydantic was supposed to me proudly showing that PSigConfig.schema_json() or even BlockingSchemaModel.schema_json() would produce fully valid json schemas 🥳. Unfortunately there is something not working with the mix of Literal and Enum types so that doesn't work right now. 👎🏼
I suspect that there will be a minor refactoring required after Pydantic officially supports discriminated union types, and then we will get full json schema support. The library can then expose that as documentation.

wilko77

Please check the test coverage.

wilko77 · 2021-03-16T02:22:42Z

blocklib/candidate_blocks_generator.py

    """
    :param data: list of tuples E.g. ('0', 'Kenneth Bain', '1964/06/17', 'M')
-    :param signature_config:
+    :param blocking_schema:
        A description of how the signatures should be generated.
        Schema for the signature config is found in
        ``docs/schema/signature-config-schema.json``


that's not a valid link.

Turns out the docstrings are not rendered anywhere in sphinx... I'll put together a very basic python api page for now.

Add test exporting blocking schemas objects back to json Closes #12 Closes #137

Includes updating the blocking tutorial. Out with the old, in with the new. Replaced setup.py and requirements.txt with pyproject.toml Updates azure pipeline to use poetry.

Use PPRLIndexConfig type Simplify validation by removing BlockingSchemaBaseModel

Update deps before merge

hardbyte requested a review from wilko77 March 10, 2021 11:20

hardbyte force-pushed the feature-validate-blocking-schemas branch from 6a1f0ee to 0372092 Compare March 14, 2021 21:24

wilko77 requested changes Mar 15, 2021

View reviewed changes

hardbyte requested a review from wilko77 March 15, 2021 02:31

wilko77 approved these changes Mar 16, 2021

View reviewed changes

hardbyte force-pushed the feature-validate-blocking-schemas branch 3 times, most recently from d4a96d1 to 820eda8 Compare March 18, 2021 04:24

hardbyte and others added 2 commits March 18, 2021 17:33

Adds pydantic models for blocking schemas.

f03c070

Add test exporting blocking schemas objects back to json Closes #12 Closes #137

Update docs and version bump

6154dc4

Includes updating the blocking tutorial. Out with the old, in with the new. Replaced setup.py and requirements.txt with pyproject.toml Updates azure pipeline to use poetry.

hardbyte force-pushed the feature-validate-blocking-schemas branch 4 times, most recently from f1035ab to f7b6336 Compare March 18, 2021 05:23

hardbyte and others added 5 commits March 18, 2021 18:37

Rename validate_signature_config -> validate_blocking_schema

813cff1

Use PPRLIndexConfig type Simplify validation by removing BlockingSchemaBaseModel

Set version to 0.1.8-dev

4ca3836

Rename validate_blocking_schema in notebook

ec785f2

Include docs generated from code

cf15b44

Include MacOS in github tests

b53477d

Update deps before merge

hardbyte force-pushed the feature-validate-blocking-schemas branch from d3a3669 to b53477d Compare March 18, 2021 05:37

hardbyte merged commit d7914df into master Mar 18, 2021

hardbyte deleted the feature-validate-blocking-schemas branch March 18, 2021 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature validate blocking schemas #140

Feature validate blocking schemas #140

hardbyte commented Mar 10, 2021 •

edited

Loading

codecov bot commented Mar 10, 2021 •

edited

Loading

wilko77 left a comment

wilko77 Mar 15, 2021

hardbyte Mar 15, 2021 •

edited

Loading

wilko77 Mar 16, 2021

hardbyte Mar 16, 2021

hardbyte Mar 16, 2021

wilko77 Mar 16, 2021

wilko77 Mar 15, 2021

hardbyte Mar 15, 2021

hardbyte Mar 15, 2021

wilko77 Mar 16, 2021

wilko77 Mar 15, 2021

hardbyte Mar 15, 2021

hardbyte Mar 15, 2021

wilko77 Mar 15, 2021

hardbyte Mar 15, 2021

hardbyte Mar 15, 2021

wilko77 Mar 16, 2021

hardbyte commented Mar 15, 2021

wilko77 left a comment

wilko77 Mar 16, 2021

hardbyte Mar 18, 2021

Feature validate blocking schemas #140

Feature validate blocking schemas #140

Conversation

hardbyte commented Mar 10, 2021 • edited Loading

codecov bot commented Mar 10, 2021 • edited Loading

Codecov Report

wilko77 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardbyte Mar 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardbyte commented Mar 15, 2021

wilko77 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardbyte commented Mar 10, 2021 •

edited

Loading

codecov bot commented Mar 10, 2021 •

edited

Loading

hardbyte Mar 15, 2021 •

edited

Loading