-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG2 - Parameterized #178
Comments
I have to admit that the thought of a field for parameters occurred to me in passing as well. I think it would help make things clear. The field could contain a good descriptive name for the parameter(s) and the default value(s). |
OK. This is what I was trying to get at with my comment on #63 about the correlation of vocabs and parameterized. |
OK, in checking the first few of Parameterized with the longsuffering @ArthurChapman, there are syntax and content issues we need to standardize before I feel comfortable about making more changes (42 all up at the moment). So far Parameter(s) edited in table as examples- #163 Specified source authority, default = http://rs.gbif.org/vocabulary/gbif/rank.xml This raises
-- |
There has been some discussion around default values for parameterized tests
|
To answer @Tasilee should we use a link to a web address, a name ("The Getty Thesaurus of Geographic Names" or "TGN") or a link to an API? In the parameter field, I think it should be an API if possible for the default. The in the References, a full name and web link to the vocabulary.
|
Thanks @ArthurChapman. I agree that we should supply a default even if it is 'best guess' as that will be helpful for implementers as a starting position. Regarding default minimum year , I think you mean '1753' and not '1953'? With my limited taxonomic experience, '1600' would seem a reasonable 'flag-raising' point but my reservation is that I tend to err toward false positives rather than false negatives. Meaning, I would rather raise a flag for those below 1753 than to not flag those between 1600 and 1753. The 'XXX-YYY' was to cover terms in the 'Expected response' such as 'NOT_EMPTY', 'NOT_COMPLIANT', 'NO_REPORT' etc. I will check that these are in the vocab, as they have grown with the implementation of the 'Expected responses'. I agree that the Parameter defaults should ideally point to an API, but a) some don't exist, b) some exist but may not be tightly coupled to a 'standard' and c) some are hard to find. My note about References in Parameters means that in some cases, we use the references as a link to defaults. In other cases, I have taken info from the 'Expected response', for example if there is a mention of 'authority'. |
Yes 1753. There is no logical reason for selecting 1753 for collections - there is for taxonomy. I am not sure where we got 1700 and what the logic was for that. 1600 predates the years of major scientific exploration (Spanish, Portuguese, British and French). |
Tests should only be parameterized when we have identified user stories in the areas that TG3 examined that clearly have different parts of the community wishing to use different parameters. The two only valid cases that come to my mind right off are application of a particular national taxonomic authority for tests involving scientific names and specifications of the earliest valid date for identifications or eventDates, where particular data sets are known by their users to have earliest valid dates. Parameters must not point to hypothetical resources that are not available to implementors. |
@ArthurChapman, yes, if we specify that a test is parameterized, we must specify a default value. I suspect that the identifiiers (guids) for tests should only apply to implementations of those tests that use the default parameter values, and that implemenations which take other values should use different guids to allow for machine comparison of results, but as the intent of parameters is to change the test behavior at runtime that might significantly complicate implementation. One alternative (thinking in terms of annotated java methods ala the filtered push implementations), would be to have one identifier refer to a test with the default parameter, and another identifier refer to the same test, but with any other value for the parameter (java implementation on the order of
), where the first method uses the guid currently specified for the test, and the second method uses a guid that we would need to specify for parameterized implementations. |
@ArthurChapman and I have been discussing 'Needs work' tagged tests and resolved a few, but there are three remaining. Also, a question to the rest of you about the Expected Response regarding specified source authority. Should we
|
@Tasilee the updates to make the parameter values structured and consistent is great. |
Significant remaining problem: A very large number of the tests which take parameters should not be parameterized. I've noted this on #20, only tests for which we have use cases where different user communities will expect the tests to behave in different ways should be parameterized (such as a country wishing to validate scientific names against a national list rather than a global one). We must not specify parameters that point implementors to a resource from which the controlled vocabulary for a particular test can be found, that is something for the notes. When the specification says, e.g. compliant if matching ISO vocabulary x, then the implementor must use that vocabulary, and where they get it an how they get it is an implementation detail, not a parameter. All of the tests that have parameters need careful review to see if there is a clear use case for different users to expect different behaviors of the test for different uses, not whether or not there are multiple possible sources that could be used for some vocabulary. |
We have 41 tests that specify parameters. It looks to me like only 18 of those are actually candidates for parameterization, and each of these needs careful consideration and identification of the use cases that require the test to be parameterized.
|
The following tests have parameters and look to me like they very unambiguously must not be parameterized. The resources mentioned should be moved either into the specification or the notes, and not specified as a parameter.
|
@chicoreus I will look at this in detail when I get back home (away at the moment), but the Geodetic Datum (#102, #59, #60) ones should be Paramaterized as different jurisdictions use different defaults (some by legislation - eg. Brazil) and WGS84 may not always be the best default. In Brazil, for example, if no datum is specified, you can be nearly certain that the default is either SAD69(96) or SIRGAS2000 (depending on the date). Also many jurisdictions are using Coordinate Reference Systems (CRS) rather then datums as these are more often than not what is being given on GPS units. I will check their wording later. Like you, I think we have unnecessarily made too many tests Paramaterized. @tucotuco may have good reasons for some of these, but I think we need to justify each test. Perhaps there are comments with justifications under the individual tests - I will check later. |
@ArthurChapman looks like #102 should be parameterized, while #59 and #60 should not. Added notes in those issues. |
I've updated the tables in the comments above accordingly, moving #102 into should be parameterized. |
Having looked at your list @chicoreus for tests that "shouldn't" be Paramaterized - I have the following comments.
|
Agreed @chicoreus re #102, #59 and #60. #102 Paramaterized, #59 and #60 not - with bdq:sourceAuthoriity=http://epsg.io/ |
Copied from #102 as comment applicable to more than just that test |
#133 and #38 I think should be Paramaterized - see my comments in the table above i.e. "Problem I see here is that we are following dcterms:license - which could be broader than just Creative Commons. Do we wish to restrict to Creative Commons, or allow other license conditions to be valid? and thus allow someone to chose different vocabulary?" I am also concerned that some jurisdictions may legislate the licences they can use within that jurisdiction and they may not be Creative Commons |
Thanks @chicoreus and @ArthurChapman. Reading through the table and your comments Arthur, here is my take on it. Maybe after a Pinot Noir or two, I would think differently. #106 - Parameterised @tucotuco : We would value your discerning eye (or two) on this lot. I'll hold off edits for a response. I hope all is ok over there. |
I Think you missed a few @Tasilee Not Paramaterized @tuco might particularly like to comment on (see my table and comments above) #51, #115, #116, #73, #50, #118, #139, #21, #95 |
@ArthurChapman: I was using the table only..so will add missing into here. And BTW, you also missed #39 (Parameterised), #79 isn't parameterised: #20 - Not parameterised |
I am presuming for the Not parameterised above, we move any reference to a default source authority to the References section? That is, the Parameter field is EMPTY. |
@Tasilee I guess that would make sense, however it doesn't distinguish the default or target source Authority from any other reference. Perhaps we should put them in the Reference but as "bdq:sourceAuthority=xxxxxxx" and then the other references |
@ArthurChapman - that seems like a good strategy. I'll tackle the updates on Monday to give @tucotuco and @pzermoglio a chance to comment. |
Sorry folks, though I think there are a couple of good catches in this discussion, I am afraid that some of it will take us into circular reasoning. I think most of the tests that were tagged to be parametrized were correctly so. A big part of my stance on this is hidden in a comment to issue #63 (#63 (comment)). Basically, Darwin Core is not a source authority for values. But that is only part of the issue. The other is that we can't make standardizations without a thesaurus (or at least a simple lookup table) - controlled vocabularies are not enough. This is the reason we brought TG4 into existence, recognizing this fundamental need to develop the tests in tandem with the vocabularies that allow them to actually function. Some specific comments... I would like to challenge this statement by @chicoreus: Why? Can't it be evident aside from the work in TG3? Are the results of TG3 exhaustive for all time? I would also like to propose an amendment to the statement by @chicoreus: "Parameters must not point to hypothetical resources that are not available to implementors." Instead of "Parameters", this should be "Default sources". @Tasilee asked "Should we
I vote for bdq:sourceAuthority. For example, change "using a specified source authority service" to "using the bdq:sourceAuthority". I would like to challenge this statement by @chicoreus: "We must not specify parameters that point implementors to a resource from which the controlled vocabulary for a particular test can be found, that is something for the notes. When the specification says, e.g. compliant if matching ISO vocabulary x, then the implementor must use that vocabulary, and where they get it an how they get it is an implementation detail, not a parameter." I agree for VALIDATION tests where the vocabulary is written in stone. This is not true of most Darwin Core terms, which make recommendations, not requirements. The philosophy has always been to decouple requirements from definitions wherever possible. All of the AMENDMENT_ tests need a parameter to point to a source for the lookups. If we only used controlled vocabularies, we couldn't do any standardization, because only the standard values would be found, not the values from which the standard values would be determined. I do agree that there is a subset of tests that we currently have as parametrized that need not be. To me, these are only #20 (TG2-VALIDATION_COUNTRYCODE_NOTSTANDARD), #21 (TG2-VALIDATION_COUNTRY_NOTSTANDARD), #59 (TG2-VALIDATION_GEODETICDATUM_NOTSTANDARD), #79 (TG2-VALIDATION_DECIMALLATITUDE_OUTOFRANGE), #162 (TG2-VALIDATION_TAXONRANK_NOTSTANDARD). #21 and 59 will need to be explicit about the expectations. For example, for #21, it must be explicit whether the preferred name is the standard name, or if any of the names in any of the names or codes are acceptable standard names. For #59, it will need to be made explicit whether the epsg code is the only standard (because its the only thing that is unambiguous), or if any of the names in Geodetic CRS, Datum, or Ellipsoid are also acceptable. Again, sorry, especially that it took this long to respond, but it was unavoidable. |
One issue that @tucotuco's comments bring up is the urgent need for Vocabularies of Values to be created for all the current Darwin Core terms that are currently refrerred to in the tests. Perhaps TG4 (at Leiden?) needs to establish a working group under the TG with the remit to create as many Vocabularies of Values for those terms that are possible in the short term (especially beginning with the easy ones). Some, I think, only have a limited number of terms, but we will need to formalise them under the format that TG4 is proposing to develop. I guess a first step is to make a list, with an assessment of what is required, and a work program. @pzermoglio something for the agenda in Leiden - perhaps discuss informally on the Sunday. |
Thanks @tucotuco. Good to have your insights again, but I am struggling. I will repeat a comment I made somewhere among the tests. We have two scenarios for Parameterised
Your comment "we can't make standardizations without a thesaurus (or at least a simple lookup table) - controlled vocabularies are not enough" focuses on the second scenario. But surely we can't anticipate every possible misspelling or incorrectly interpreted 'value' to lookup? I guess I am assuming in at least some of the AMENDMENTS, that we are using pattern matching in the test code to have a stab at interpreting a potential target. Take the example in #133 dc:license="CCZero" becomes dc:license="https://creativecommons.org/publicdomain/zero/1.0/", following the Creative Commons vocabulary. @tucotuco: You are implying that we have a thesaurus that contains "CCZero"? As usual, I am probably missing something. Also, I have to bow to your Darwin Core philosophy: "Darwin Core is not a source authority for values". Our tests are Darwin Core based (and hence scenario 1 above is not applicable), but scenario 2 is. We are indeed stuffed in terms of vocabs (let alone thesauri), hence TG4, but we need to grab onto any straw we currently have, and DwC 'values' are a 'port in a storm'? |
@Tasilee I think we do need vocabularies/thesauri. License is a difficult one - but CCZero could = CC0 (1.0) or CC0 (1.0) Universal, etc. and then link to https://creativecommons.org/publicdomain/zero/1.0/. Also with many of the earlier Creative Commons there were many Ports (versions in different languages - see for examplke, https://creativecommons.org/tag/porting/). Version 4.0 is suppoosed to be a Universal set without the need for Porting, and that is encouraged for all new uses. A thesuarus would hopefully list these and (maybe) sononymise many. @tucotuco has extracted the licensing records from GBIF. Many (majority) are in the form of "ex coll. " These aren't very helpful as they just refer back to the original institution, etc. I am looking through the list to see if we can extraxt a basic set of options - especially with CC, but in addition there are various country licenses (e.g. http://open.canada.ca/en/open-government-licence-canada) and there are ODC licenses (Open Data Commons) - e.g. Open Data Commons Attribution License: http://www.opendatacommons.org/licenses/by/1.0/. I will see what I can come up with when I get time. |
I am saying explicitly, not implying, that we have a thesauri for
vocabularies of terms that need to be cleaned. So yes, a license lookup
that says 'CCzero' is a synonym of the unequivocally preferred term '
https://creativecommons.org/publicdomain/zero/1.0/'.
My point is that values alone don't help us do any lookups - whether taken
from the examples given in Darwin Core (examples are no longer even
canonical) or elsewhere. Pattern matching is an implementation solution,
not a community data-driven one, which means we would rely on tech people
to make the mappings, not on the people who know (and are even responsible
for) the state of the domain.
I do not see two scenarios. Both examples need a source authority and we
decided that all tests that take a parameter should have a default value
for that parameter. To me it is best to be able to specify the source
authority when there isn't a single definitive option. This is in order to
decouple the test and the data used for the test, so that tests are less
likely to be implementation dependent. Imagine certifying an implementation
- given input A, the implementation is certified if it gives output A'.
Input would be the combination of data to be tested and versioned source
authority. The nature of the source authority may be distinct for distinct
types of tests. For example, AMENDMENT tests would require a lookup (hence
the suggestion of a thesaurus as the generic pattern), whereas VALIDATION
tests would only really need a controlled vocabulary, though they could
also use a thesaurus and only use the preferred values.
We can't effectively anticipate every possible nonsense that might come
along. I agree. We don't need to. But we can certainly create a lookup of
every bit of nonsense that has been seen so far, and we can strive for an
infrastructure that accumulates new nonsense as it arises and lets us
provide the lookups for those as we move forward.
I hope that helps explain where I am coming from.
…On Sun, Sep 8, 2019 at 6:46 PM Lee Belbin ***@***.***> wrote:
Thanks @tucotuco <https://github.com/tucotuco>. Good to have your
insights again, but I am struggling. I will repeat a comment I made
somewhere among the tests. We have two scenarios for *Parameterised*
1. Genuine options for bdq:sourceAuthority (e.g., #28
<#28>) and
2. Options for a default value (e.g., #133
<#133> )
Your comment "we can't make standardizations without a thesaurus (or at
least a simple lookup table) - controlled vocabularies are not enough"
focuses on the second scenario. But surely we can't anticipate every
possible misspelling or incorrectly interpreted 'value' to lookup? I guess
I am assuming in at least some of the AMENDMENTS, that we are using pattern
matching in the test code to have a stab at interpreting a potential
target. Take the example in #133 <#133>
dc:license="CCZero" becomes dc:license="
https://creativecommons.org/publicdomain/zero/1.0/", following the
Creative Commons vocabulary.
@tucotuco <https://github.com/tucotuco>: You are implying that we have a
thesaurus that contains "CCZero"?
As usual, I am probably missing something.
Also, I have to bow to your Darwin Core philosophy: "Darwin Core is not a
source authority for values". Our tests are Darwin Core based (and hence
scenario 1 above is not applicable), but scenario 2 is. We are indeed
stuffed in terms of vocabs (let alone thesauri), hence TG4, but we need to
grab onto any straw we currently have, and DwC 'values' are a 'port in a
storm'?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#178?email_source=notifications&email_token=AADQ7257LNTO6SRIFE4KMY3QIVXD7A5CNFSM4HMTTKZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6F2EKQ#issuecomment-529244714>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADQ725ZIJP4GAHFDIDRBGDQIVXD7ANCNFSM4HMTTKZA>
.
|
@tucotuco - "Pattern matching is an implementation solution". I agree. I was unaware of the extent on thesauri to our issues - which is a more 'standard' solution that is openly accessible and hopefully understandable. This reminds me of the eureka moment aeons ago in TDWG (TIP days) when I realized that we needed an effective environment for the creation and management of ontologies. We needed an environment created by 'programmers' that made it easy to add terms, definitions and relationships. As far as I am aware, such a user (application domain specialist)-centric environment still doesn't exist (but I could be wrong as I have not recently researched it). I think such an environment for biodiversity informatics-related thesauri (term -> preferred standard term, definition, comments and links etc) would be nice. A wiki style of management? A list by itself is a start, but when isolated and without provenance, is less than optimal. Governance is a key issue. If there is an 'authority', grand, but the system still needs to be open to public comment for efficient improvements. |
I totally agree. I think ontology management has progressed well and has
viable environments and tools. Some of our vocabs would be best
accommodated by ontologies, especially basisOfRecord. For the rest, I think
it is high time we dive in and play with what Tim has to offer,
…On Mon, Sep 9, 2019 at 7:39 PM Lee Belbin ***@***.***> wrote:
@tucotuco <https://github.com/tucotuco> - "Pattern matching is an
implementation solution". I agree. I was unaware of the extent on thesauri
to our issues - which is a more 'standard' solution that is openly
accessible and hopefully understandable.
This reminds me of the eureka moment aeons ago in TDWG (TIP days) when I
realized that we needed an effective environment for the creation and
management of ontologies. We needed an environment created by 'programmers'
that made it easy to add terms, definitions and relationships. As far as I
am aware, such a user (application domain specialist)-centric environment
still doesn't exist (but I could be wrong as I have not recently researched
it).
I think such an environment for biodiversity informatics-related thesauri
(term -> preferred standard term, definition, comments and links etc) would
be nice. A wiki style of management? A list by itself is a start, but when
isolated and without provenance, is less than optimal. Governance is a key
issue. If there is an 'authority', grand, but the system still needs to be
open to public comment for efficient improvements.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#178?email_source=notifications&email_token=AADQ72YWXCNJ4GMDV323GR3QI3GBLA5CNFSM4HMTTKZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6JH45A#issuecomment-529694324>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADQ7264R6DAATG77CCHFT3QI3GBLANCNFSM4HMTTKZA>
.
|
We have a quorum to CLOSE. |
Having a look at the tests, we now seem to have added Parameterized to (virtually) every test where we have a vocabulary - even where (e.g. #62) the Vocabulary is an ISO Standard.
I am not sure that we have thought this through for each case.
I think we need to make it clear (in Notes?) of what the Parameter is that needs to be set. It is not clear in some of the tests where we have Parameterized. In most it is specifying the Source Authority, in others an upper or lower limit.
Are we over using parameterization? Do we need another field explaining what the parameter is?
The text was updated successfully, but these errors were encountered: