Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reusing anchors in YAML #121

Merged
merged 2 commits into from
Jul 4, 2022

Conversation

innovate-invent
Copy link
Contributor

YAML Spec permits anchor name reuse, but ruaml.yaml by default errors for them.
https://stackoverflow.com/questions/39013993/parse-a-yaml-with-duplicate-anchors-in-python

Stacktrace
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.5/x64/bin/check-jsonschema", line 8, in <module>
    sys.exit(main())
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/check_jsonschema/cli.py", line 280, in main
    execute(args)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/check_jsonschema/cli.py", line 327, in execute
    ret = checker.run()
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/check_jsonschema/checker.py", line 88, in run
    self._run()
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/check_jsonschema/checker.py", line 74, in _run
    errors = self._build_error_map()
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/check_jsonschema/checker.py", line 64, in _build_error_map
    for filename, doc in self._instance_loader.iter_files():
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/check_jsonschema/loaders/instance/__init__.py", line 61, in iter_files
    data = loadfunc(fp)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/check_jsonschema/loaders/instance/yaml.py", line 32, in load
    data = _yaml.load(stream)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/ruamel/yaml/main.py", line 434, in load
    return constructor.get_single_data()
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/ruamel/yaml/constructor.py", line 119, in get_single_data
    node = self.composer.get_single_node()
  File "_ruamel_yaml.pyx", line 706, in _ruamel_yaml.CParser.get_single_node
  File "_ruamel_yaml.pyx", line 724, in _ruamel_yaml.CParser._compose_document
  File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
  File "_ruamel_yaml.pyx", line 775, in _ruamel_yaml.CParser._compose_node
  File "_ruamel_yaml.pyx", line 889, in _ruamel_yaml.CParser._compose_mapping_node
  File "_ruamel_yaml.pyx", line 767, in _ruamel_yaml.CParser._compose_node
ruamel.yaml.composer.ComposerError: found duplicate anchor; first occurrence
  in "grammar/basic.yml", line 2, column 12
second occurrence
  in "grammar/basic.yml", line 39, column 12

@innovate-invent innovate-invent marked this pull request as draft July 1, 2022 21:18
YAML Spec permits anchor name reuse, but ruaml.yaml by default errors for them.
https://stackoverflow.com/questions/39013993/parse-a-yaml-with-duplicate-anchors-in-python
@sirosen
Copy link
Member

sirosen commented Jul 1, 2022

Thanks for the PR and putting me onto this issue! I'm happy to look into this in a bit -- more on this below.

Three quick things:

  • I would prefer a dedicated test or pair of tests for this. I can easily imagine the behavior regressing if other changes are made, so rather than a slew of non-specific failures, it would be great to get the testsuite to print the result.
  • I believe that this needs pure=True on the yaml parser. I bet that will be a performance regression, but I don't know how serious. Ideally it's fine, but either I need to see that when I test perf or we'll need to make this opt-in.
  • I'm currently in the middle of making changes to the way that the ruamel.yaml.YAML object gets built in order to support gitlab !reference tags (!reference tag of .gitlab-ci.yml not supported? #112). If you wait for that change, it will likely have impact here and perhaps even make this easier.

@innovate-invent
Copy link
Contributor Author

Hi @sirosen, thanks for the feedback. This does indeed require pure=True to bypass yaml/pyyaml#394

I was hacking around your test suite trying to figure it out but I couldn't get it to run locally and I am not sure how to cleanly add tests. Can you point me in the right direction? Do I just duplicate the test functions I edited and they will be automatically detected?

How much of a delay are you proposing to get this into a release? My downstream project is a bit stalled with this issue.

@innovate-invent
Copy link
Contributor Author

For reference, this is a workaround in the meantime:

python -c 'import check_jsonschema; from check_jsonschema.loaders import instance; import ruamel.yaml; instance.LOAD_FUNC_BY_TAG["yaml"] = ruamel.yaml.YAML(typ="safe", pure=True).load; check_jsonschema.main()'  --schemafile grammar_schema.yml grammar/basic.yml

@sirosen
Copy link
Member

sirosen commented Jul 1, 2022

How much of a delay are you proposing to get this into a release? My downstream project is a bit stalled with this issue.

That's good to know. Generally I assume things are non-urgent unless someone speaks up. We can release as soon as we get things sorted out.

I was hacking around your test suite trying to figure it out but I couldn't get it to run locally and I am not sure how to cleanly add tests. Can you point me in the right direction? Do I just duplicate the test functions I edited and they will be automatically detected?

The CONTRIBUTING doc is meant to be sufficient -- if you can install tox, then tox should take care of most/all testing needs.
Let me share a little bit of what my setup is in case it helps:

  • pipx as a way of installing python CLI tools ( https://pypa.github.io/pipx/ )
  • pipx install tox to install tox
  • I use pyenv to install multiple python versions (very much optional)
  • tox runs "all tests", but for fast iterations I recommend tox -e py -- -x. That runs most of the tests using the current python version, and passes the -x flag to pytest, meaning "stop on the first failure".

The testsuite itself is written in pytest. Any function named test_* in a file named test_*.py in the test directory will get picked up. I fully endorse copy-paste-replace as a way of writing new tests -- it's efficient and should help you avoid getting bogged down too much in details.


Regarding pure=True performance, I see that this is about 15-to-20 times slower on my machine when parsing. That said, both parsers are plenty fast. I need to think more about whether or not this is appropriate as the default/only behavior.

@sirosen
Copy link
Member

sirosen commented Jul 4, 2022

I've made a bunch of changes in main which conflict with this, so I'm working through a merge of this right now. Thanks for getting the process started!

Just to share, the primary change I'm making is to use the pure=True parser as a failover. Given that the libyaml parsing is at least 10x faster, I think this is a good balance of speed and correctness. The new code will first try to parse with YAML(typ="safe") and if that fails, it will retry the parse with YAML(typ="safe", pure=True). It's still fast enough when pure=True is needed, but much faster when it is not.

I'll have this merged soon, and can release in v0.17.0 (hopefully today).

@sirosen sirosen merged commit 8a03d5c into python-jsonschema:main Jul 4, 2022
@sirosen
Copy link
Member

sirosen commented Jul 4, 2022

v0.17.0 is now released. Please let me know if you see any issues with it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants