Add functionality to guess the intended input file format #1058

jamesaoverton · 2022-10-01T15:18:01Z

Building on #1038.

PR #1056 is already very useful, but it made me wonder: Why don't we have a utility that would determine the file format, or at least make a good guess at what was intended? It could be added to #1056 as a --input-format detect or --input-format auto option, and maybe used whenever parsing fails.

The root problem is that a .owl extension is used for several different formats supported by OWLAPI. For most of the OBO use cases we expect RDF/XML, but it could be OWL/XML or Manchester or OWL Functional or Turtle. The OWLAPI will try a dozen different parsers until one of them works, and if it successfully loads then we can ask with OWLOntologyManager.getOntologyFormat(). The interesting case is #1038 where the ontology fails to load but we should still be able to guess the intended format, and then report the most useful parsing error.

When I'm not sure about the format, I just look at the first few lines of the file. It shouldn't be hard to write code for some crude heuristics. This would be useful even if it misses some weird edge cases.

RDF/XML: look for the <rdf:RDF> (skip XML DTD stuff)
Turtle: look for @prefix
OWL Functional: look for Prefix(

We could also have useful error messages for common failure modes:

HTML, e.g. a 404 reponse
empty file

The text was updated successfully, but these errors were encountered:

beckyjackson · 2022-10-02T13:54:56Z

For some other formats (after playing around with loading edge cases):

JSON: {
Manchester: Prefix:
- Edge case where there aren't prefixes, we'd need to look for Ontology:, Class:, etc...
OBO: format-version: or ontology:
- Edge cases, maybe look for [Term] or [Typedef]?
- ontology: has potential for conflict with omn, even though it should be uppercase, ROBOT will still load the file with lowercase
OWL/XML: <Ontology (after RDF/XML)

Also for TTL: this is a full, valid file

<http://purl.obolibrary.org/obo/EX_1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .

So we would need an edge case for that which looks for < after checking if it's RDF/XML or OWL/XML

As I'm listing out the edge cases, I'm worried that we will miss something and this could cause a breaking change. It's a good idea to report a clean error message - but maybe we only should do it if all parsers fail? It seems inefficient but I don't know how we can be 100% sure we get all the edge cases. And then suddenly somebody can't load their ontology (I guess they could override with --input-format...).

matentzn · 2022-10-02T14:18:27Z

You would 100% do not want to change the default behaviour (no guessing) - just add a new option --input detect which would try to guess the format..

beckyjackson · 2022-10-02T14:29:40Z

If you're setting the flag anyway, wouldn't you know your format? So why add auto-detect when you could just specify the format the same way?

Can you see a use-case where you don't know the format in advance, but want clear error messages if it fails? Maybe dashboard stuff...?

matentzn · 2022-10-02T15:15:32Z

I agree with @beckyjackson. A possible use case is to say: we will do a better job guessing, rather than OWLAPI cycling through all possible parsers. But yeah, I still agree with you.

beckyjackson added the under-review label Oct 2, 2022

beckyjackson added enhancement and removed under-review labels Oct 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to guess the intended input file format #1058

Add functionality to guess the intended input file format #1058

jamesaoverton commented Oct 1, 2022

beckyjackson commented Oct 2, 2022 •

edited

Loading

matentzn commented Oct 2, 2022

beckyjackson commented Oct 2, 2022

matentzn commented Oct 2, 2022

Add functionality to guess the intended input file format #1058

Add functionality to guess the intended input file format #1058

Comments

jamesaoverton commented Oct 1, 2022

beckyjackson commented Oct 2, 2022 • edited Loading

matentzn commented Oct 2, 2022

beckyjackson commented Oct 2, 2022

matentzn commented Oct 2, 2022

beckyjackson commented Oct 2, 2022 •

edited

Loading