Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to guess the intended input file format #1058

Open
jamesaoverton opened this issue Oct 1, 2022 · 4 comments
Open

Add functionality to guess the intended input file format #1058

jamesaoverton opened this issue Oct 1, 2022 · 4 comments

Comments

@jamesaoverton
Copy link
Member

Building on #1038.

PR #1056 is already very useful, but it made me wonder: Why don't we have a utility that would determine the file format, or at least make a good guess at what was intended? It could be added to #1056 as a --input-format detect or --input-format auto option, and maybe used whenever parsing fails.

The root problem is that a .owl extension is used for several different formats supported by OWLAPI. For most of the OBO use cases we expect RDF/XML, but it could be OWL/XML or Manchester or OWL Functional or Turtle. The OWLAPI will try a dozen different parsers until one of them works, and if it successfully loads then we can ask with OWLOntologyManager.getOntologyFormat(). The interesting case is #1038 where the ontology fails to load but we should still be able to guess the intended format, and then report the most useful parsing error.

When I'm not sure about the format, I just look at the first few lines of the file. It shouldn't be hard to write code for some crude heuristics. This would be useful even if it misses some weird edge cases.

  • RDF/XML: look for the <rdf:RDF> (skip XML DTD stuff)
  • Turtle: look for @prefix
  • OWL Functional: look for Prefix(

We could also have useful error messages for common failure modes:

  • HTML, e.g. a 404 reponse
  • empty file
@beckyjackson
Copy link
Contributor

beckyjackson commented Oct 2, 2022

For some other formats (after playing around with loading edge cases):

  • JSON: {
  • Manchester: Prefix:
    • Edge case where there aren't prefixes, we'd need to look for Ontology:, Class:, etc...
  • OBO: format-version: or ontology:
    • Edge cases, maybe look for [Term] or [Typedef]?
    • ontology: has potential for conflict with omn, even though it should be uppercase, ROBOT will still load the file with lowercase
  • OWL/XML: <Ontology (after RDF/XML)

Also for TTL: this is a full, valid file

<http://purl.obolibrary.org/obo/EX_1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .

So we would need an edge case for that which looks for < after checking if it's RDF/XML or OWL/XML

As I'm listing out the edge cases, I'm worried that we will miss something and this could cause a breaking change. It's a good idea to report a clean error message - but maybe we only should do it if all parsers fail? It seems inefficient but I don't know how we can be 100% sure we get all the edge cases. And then suddenly somebody can't load their ontology (I guess they could override with --input-format...).

@matentzn
Copy link
Contributor

matentzn commented Oct 2, 2022

You would 100% do not want to change the default behaviour (no guessing) - just add a new option --input detect which would try to guess the format..

@beckyjackson
Copy link
Contributor

If you're setting the flag anyway, wouldn't you know your format? So why add auto-detect when you could just specify the format the same way?

Can you see a use-case where you don't know the format in advance, but want clear error messages if it fails? Maybe dashboard stuff...?

@matentzn
Copy link
Contributor

matentzn commented Oct 2, 2022

I agree with @beckyjackson. A possible use case is to say: we will do a better job guessing, rather than OWLAPI cycling through all possible parsers. But yeah, I still agree with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants