Skip to content
This repository has been archived by the owner on Sep 3, 2024. It is now read-only.

Support consuming YAML files directly #93

Open
pinkwah opened this issue Oct 28, 2019 · 3 comments
Open

Support consuming YAML files directly #93

pinkwah opened this issue Oct 28, 2019 · 3 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@pinkwah
Copy link

pinkwah commented Oct 28, 2019

Currently, in order to read and validate YAML files, we first have to convert a the YAML file to a python dict. We lose a lot of useful information doing it this way. Instead, we should traverse the graph that PyYAML gives us using the yaml.SafeLoader class. This will let us do the following:

  • Check for duplicate keys in a YAML block (dict).
  • Give the position in the file where validation errors occurred.
@pinkwah pinkwah self-assigned this Oct 28, 2019
@pinkwah
Copy link
Author

pinkwah commented Oct 29, 2019

Proposed changes to ConfigSuite

Currently, it's not trivial to integrate popular configuration document formats
(YAML, JSON, XML, etc) with ConfigSuite. Internally, ConfigSuite uses native
Python types (dicts, lists, primitive scalars). This leads us to losing a lot of
useful information that we may want to keep in order to show better diagnostic
information when a document has an incorrect schema.

Solution 1: Non-integrated utility function/class

Proof of concept: #94

Here, we have a separate utility class called YamlProcessor. This class
safeloads a YAML file into a native Python structures, but also:

  • Raises a DuplicateKeyError if a duplicate key is found. (Which isn't possible
    with a simple yaml.safe_load)
  • Records all keys and their location in the document.

The latter is useful for when ConfigSuite has a non-empty list of errors,
since then we can map a key_path to its location in the YAML file.

YamlProcessor can relatively easily be extended to support loading multiple
documents, and also multiple document formats.

While this is simple to implement, this approach has several problems:

  • It requires the user to explicitly opt-in to functionality that arguably
    should be part of ConfigSuite by default. The user might expect
    ConfigSuite to handle errors like having duplicate keys, which will be
    surprising when it doesn't.
  • We still can't store information about whether a subgraph of the config was
    modified by a transform function. If there's a bug in the user's transform,
    it's be nice to show that an node was modified by it too.

Solution 2: Move ConfigSuite to a more abstract system

I propose, instead of using Python's dict, list and scalar primitives to
represent the internal configuration, we move to a graph-based system (currently
inspired by PyYAML's document representation). We define the following classes:

  • Node: generic node object with a sources (more on this later)
    property and the property value.
  • ScalarNode: Has a scalar value, eg. int, str, etc.
  • ListNode: value is a list of Node elements.
  • DictNode: value is a list of (str, Node) 2-tuple elements (or a Python dict of str -> Node).

Each Node object has a sources property, that represents the sources
(layers) that "touched" the node, in order from first to last. A singular Source object contains at least the
following properties: filename, line_no, col_no. Printing a Source
object should result in output similar to this:

    in file <stdin>, line 6, column 5:
  - foo: Hello
    ^

This can then be combined to create an error message like this:

Error: Missing required key 'foo'. Expected to find it in one of the following locations:

    in file 'main_config.yaml', line 156, column 5
  - foo:
    ^
  
    in file 'sub_config.yaml', line 2, column 5
  - foo:
    ^
  
    transformed by Python function '_transform_bar' in file '/usr/lib64/python2.7/site-packages/foobar/schema.py', line 16340

We can implement the functionality for loading Yaml document in several ways:

Way 1: As part of ConfigSuite class

Relatively simple and understandable. But it adds a hard-dependency on yaml
and difficult for the user to extend. Code example:

# ConfigSuite checks schema for validity, nothing else (`self.assert_valid_schema`)
suite = ConfigSuite(schema)

# We add our layers (documents), validating each layer separately
suite.add({ 'foo': 'bar' })

suite.add_yaml('''
foo: baz
natural_numbers:
  - 1
  - 2
  - 3
  - etc
''')

with open('config.yaml') as stream:
    suite.add_yaml(stream)
    
# Finally, merge and validate the entire configuration
if not suite.validate():
    raise suite.errors

# Use the snapshot somehow
return suite.snapshot

Way 2: As a separate layer utility

We can implement a YamlLayer that is responsible for converting an arbitrary
yaml file into the Node-based system described above. We would also have an
abstract Layer type and a default PythonLayer that is used by ConfigSuite
with the following logic:

class ConfigSuite(object):
  def __init__(self, raw_config, schema, ...):
    if not isinstance(raw_config, Layer):
      raw_config = PythonLayer(raw_config)

The way to use the library remains more or less the same:

# At the top
from configsuite import ConfigSuite
from configsuite.layers import YamlLayer

# When loading
main_layer = YamlLayer('main_config.yaml')
sub_layer = YamlLayer('sub_layer.yaml')

# Construct ConfigSuite like we do today
suite = ConfigSuite(main_layer, schema, layers=(sub_layer,))

# Check for errors
if suite.valid:
    print(suite.errors)
    sys.exit(1)

# Use snapshot
return suite.snapshot

@pinkwah pinkwah added the help wanted Extra attention is needed label Oct 29, 2019
@markusdregi
Copy link
Collaborator

Applause for the extensive comment 👏 Unfortunately I have not yet found time to read it properly... Will get back to it as soon as possible!

@markusdregi
Copy link
Collaborator

Again, thank you for the extensive comment 🚀 For the rest of this answer I'll make a clear distinction between the consumer and user, because they have different needs. The consumer is utilizing the API of configsuite and by that exposing some configuration possibilities to a user of it's system again. In other words, the consumer can be assumed to be a competent developer and the user to be nothing but a person sitting in front of a computer trying to use a piece of software for which configsuite is an implementation detail.

While this is simple to implement, this approach has several problems:

  • It requires the user to explicitly opt-in to functionality that arguably
    should be part of ConfigSuite by default. The user might expect
    ConfigSuite to handle errors like having duplicate keys, which will be
    surprising when it doesn't.

We should not be afraid to leave some responsibility to the consumer of our API as long as we are up front about it. In particular, for every assumption we make regarding usage, we are giving less control to consumer. I would rather be clear about something being the consumers responsibility, then to get in their way. Duplicate keys is a bit intricate in yamls is a bit intricate in Python. Most people agree with you that it is strange behavior to at all accept these files. And in most cases we do not want our users to give such files as input. There are many reasons why I don't think it is a big problem that we don't handle this and why I would require it to be possible to opt-out of this behavior if implemented.

  1. It is not our responsibility. Many in the Python community regard this as a bug in the PyYaml module. And if so, this is where it should be fixed, not in some random configuration library on top of that. I can only speculate on why this has not been done and I suspect backwards compatibility to be the reason.

  2. It is relatively easy to implement as you already displayed in your POC. Remember, we are assuming that our users is a competent programmer ;)

  3. Usage assumptions. Who are we to decide that in some particular use case this does not make sense? Maybe someone went into a stable release two years ago and did allow duplicate keys at that point. They might even agree that it would be a feature to halt on duplicate keys, but not consider it a bug. In that case, they should be empowered to take that decision.

  • We still can't store information about whether a subgraph of the config was
    modified by a transform function. If there's a bug in the user's transform,
    it's be nice to show that an node was modified by it too.

We might never do this. Because the transformation belongs to the consumer, while the error belongs to the user. If validation fails, whether that is due to a transformation or not is impossible to tell. In particular, a user will not even know what a transformation is. I'm happy to continue this discussion, but I'm not convinced this is feasible to do in a consistent manner. And if we can't do it consistently, we are not implementing it.

Intended consumption
Our consumers is sitting on the context of how this is used and I would expect most of them to extend the ConfigSuite object in the following manner:

class MyConfig(ConfigSuite):
    def __init__(self, data_source):
        super.__init__(self, data_source, layers=_get_default_values())
    ....

So if someone wants to add yaml information to errors that would be contained within the MyConfig no matter whether it is part of the code code or something you can slap on top of existing results.

In addition I think all errors should implement the same interface, as otherwise you would have to treat an error from a yaml-file different than other data formats. And that quickly gets ugly. We could for instance support a tag-field and then dump the yaml specific information there.

Anyway, my vote goes for 1 or 2.2. I don't want yaml to be a first class citizen in the core of configsuite. I think we should discuss a bit further how this should actually function, but I think we are on the right track here 😎

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants