-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG2 - Test Data Framework #189
Comments
From @chicoreus
Questions
|
Question 1 - definitely YES (Pity the world doesn't have one standard for this!) Question 2 - is there an ISO standard or some other standard we can cite for this? |
These are the VALIDATIONs ordered by Darwin Core Term
Can I suggest @tucotuco makes a start on the SPACE ones, @ArthurChapman on the NAME ones, @chicoreus on the TIME ones and @Tasilee on OTHER and NOTIFICATIONS? Hopefully a few others will offer some help, at least for checking. |
I've updated the table slightly, changing 143.5 to a negative value so that the not-compliant result makes sense, adding a remarks column with notes about the tests, and making more explicit the two tests at the end which have leading and trailing space characters as part of the test value. I've also clarified the explanatory text at the top of the table and added examples of human readable explanations where they were absent. |
@ArthurChapman for (2) is "1 828.8" [without the quotes, 1000 fathoms, in meters, with a period as the decimal separator and a space separating every three digits (an expected SI format for publication, a very unnatural form for electronic darwin core data, where "1828.8" or "1828,8" serialized from some floating point representation by software into some form of data exchange document would be expected values (with localization of the software doing the serialization making the choice about comma or period as the decimal separator, but most software not adding space separators every three digits in serialized data). My tendency would be to say, we can expect to see "1828.8" or "1828,8" in abundance in the wild, but not "1,828.8" or "1 828,8", and should either not specify how these cases should be handled, or should say that both are expected to be internal prerequisites not met as a general expectation for all darwin core data. For standards, we should probably look for RFCs for serialization of numeric data, rather than ISO or SI representation, as the (numeric or date) values found in data sets will in large part be serializations into uncontrolled string fields of database representations of strongly typed database fields (or less strongly typed and variously formatted spreadsheet columns...). |
…a for #187. Filename suggests pattern, testdata_{humanreadablenameoftest}.csv for such test data sets
Thanks a million Lee- give me the easy one :-) |
Thanks @chicoreus I agree with what you suggest, although - certainly in Australia - I think "1,828.8" would be common but happy to have it treated as you suggest. BTW - what is the easiest way to open that file as an Excel file - can you send it to me separately as just csv. Copy and pasting doesn't seem to work. |
I accept working on the SPACE test data.
…On Wed, Sep 30, 2020 at 7:57 PM Lee Belbin ***@***.***> wrote:
These are the VALIDATIONs ordered by Darwin Core Term
*Dimension* *Term_Action*
Other BASISOFRECORD_EMPTY
Other BASISOFRECORD_NOTSTANDARD
Name CLASS_NOTFOUND
Name CLASSIFICATION_AMBIGUOUS
Space COORDINATES_COUNTRYCODE_INCONSISTENT
Space COORDINATES_STATE-PROVINCE_INCONSISTENT
Space COORDINATES_TERRESTRIALMARINE
Space COORDINATES_ZERO
Space COORDINATEUNCERTAINTY_OUTOFRANGE
Space COUNTRY_COUNTRYCODE_INCONSISTENT
Space COUNTRY_EMPTY
Space COUNTRY_NOTSTANDARD
Space COUNTRYCODE_EMPTY
Space COUNTRYCODE_NOTSTANDARD
Time DATEIDENTIFIED_NOTSTANDARD
Time DATEIDENTIFIED_OUTOFRANGE
Time DAY_NOTSTANDARD
Time DAY_OUTOFRANGE
Other DCTYPE_EMPTY
Other DCTYPE_NOTSTANDARD
Space DECIMALLATITUDE_EMPTY
Space DECIMALLATITUDE_OUTOFRANGE
Space DECIMALLONGITUDE_EMPTY
Space DECIMALLONGITUDE_OUTOFRANGE
Time ENDDAYOFYEAR_OUTOFRANGE
Time EVENT_TEMPORAL_EMPTY
Time EVENTDATE_EMPTY
Time EVENTDATE_INCONSISTENT
Time EVENTDATE_NOTSTANDARD
Time EVENTDATE_OUTOFRANGE
Name FAMILY_NOTFOUND
Name GENUS_NOTFOUND
Space GEODETICDATUM_EMPTY
Space GEODETICDATUM_NOTSTANDARD
Space GEOGRAPHY_AMBIGUOUS
Space GEOGRAPHY_NOTSTANDARD
Name KINGDOM_NOTFOUND
Other LICENSE_EMPTY
Other LICENSE_NOTSTANDARD
Space LOCATION_EMPTY
Space MAXDEPTH_OUTOFRANGE
Space MAXELEVATION_OUTOFRANGE
Space MINDEPTH_GREATERTHAN_MAXDEPTH
Space MINDEPTH_OUTOFRANGE
Space MINELEVATION_GREATERTHAN_MAXELEVATION
Space MINELEVATION_OUTOFRANGE
Time MONTH_NOTSTANDARD
Other OCCURRENCEID_EMPTY
Other OCCURRENCEID_NOTSTANDARD
Other OCCURRENCESTATUS_EMPTY
Other OCCURRENCESTATUS_NOTSTANDARD
Name ORDER_NOTFOUND
Name PHYLUM_NOTFOUND
Name POLYNOMIAL_INCONSISTENT
Name SCIENTIFICNAME_EMPTY
Name SCIENTIFICNAME_NOTFOUND
Time STARTDAYOFYEAR_OUTOFRANGE
Name TAXON_AMBIGUOUS
Name TAXON_EMPTY
Name TAXONID_AMBIGUOUS
Name TAXONID_EMPTY
Name TAXONRANK_EMPTY
Name TAXONRANK_NOTSTANDARD
Time YEAR_EMPTY
Time YEAR_OUTOFRANGE
Can I suggest @tucotuco <https://github.com/tucotuco> makes a start on
the SPACE ones, @ArthurChapman <https://github.com/ArthurChapman> on the
NAME ones, @chicoreus <https://github.com/chicoreus> on the TIME ones and
@Tasilee <https://github.com/Tasilee> on OTHER? Hopefully a few others
will offer some help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#189 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ72ZBAO5A5V4PPRZDY3TSIOZTXANCNFSM4R7SE4JA>
.
|
Have data for the time tests in progress, will accept working on the rest of the time test data. |
…name consistent with case of test label.
@ArthurChapman best way to obtain the csv files is with the raw link. For example, for https://github.com/tdwg/bdq/blob/master/tg2/core/testdata/testdata_VALIDATION_MAXDEPTH_OUTOFRANGE.csv to the upper right of the table are the buttons Raw and Blame. Raw takes you to the raw csv file https://raw.githubusercontent.com/tdwg/bdq/master/tg2/core/testdata/testdata_VALIDATION_MAXDEPTH_OUTOFRANGE.csv - which is important in these cases, as the data values may be numbers not in quotes, or numbers in quotes as strings with whitespace padding. I've added the maximum 32 bit signed and 32 bit unsigned integer values, plus those values with 1 added and those values with 2 added, plus the name of the term under test (e.g. dwc:day="day") to each of the three sets of test data I've got up so far. -1, 0, the maximum integer values are good test values to add for any term that takes numeric data. |
@chicoreus You have an error in testdata_VALIDATION_MAXDEPTH_OUTOFRANGE.csv - in lines 19 and 20 for the default for bdq:minimumValidDepthInMeters - Depth can never be a negative number. So 18 has to be NOT_COMPLIANT Also lines 23 and 24 appear identical |
@chicoreus in testdata_VALIDATION_DAY_NOTSTANDARD.csv Lines appear to be duplicates |
@chicoreus in testdata_VALIDATION_MONTH_NOTSTANDARD.csv Lines 4 and 5 appear to be duplicates Lines 39, 40, 41 should be NOT_COMPLIANT Should we include "01" etc. |
@ArthurChapman testdata_VALIDATION_MAXDEPTH_OUTOFRANGE.csv , lines 19 and 20 are both correct. They are testing cases where the provided parameter values are outside the defaults, thus does the test listen to the provided parameters or does it treat the defaults as hard limits. For testdata_VALIDATION_DAY_NOTSTANDARD.csv, check the raw csv file, the duplicated lines are probably cases where leading or trailing spaces are present in one line but not another. for testdata_VALIDATION_MONTH_NOTSTANDARD.csv lines 4 and 4 differ in whitespace in the input, line 4 has the string "1", line 5 the string " 1" with a leading space. Lines 39-41 are indeed in error. Yes, leading zeros make sense to test. I have added. |
…ng leading zeros to tests. Fixing NOT_COMPLIANT out of range month 13.
Thanks @chicoreus. I still think it is misleading for the default depth to be a negative number as that is not allowed From Georeferencing Best Practices DEPTH "A measurement of the vertical distance below a vertical datum. In this document, we try to modify the term to signify the medium in which the measurement is made. Thus, "water depth" is the vertical distance below an air-water interface in a waterbody (ocean, lake, river, sinkhole, etc.). Compare distance above surface. Depth is always a non-negative number." |
If depth is distance from a vertical datum, and dept represents a vertical distance below an air-water interface, then negative values of depth are possible. Consider a vertical datum of mean sea level, and a sample collected in the intertidal, below the surface of the water at a high tide, above the mean sea level vertical datum. Such a sample would be both collected below the air-water interface, and at a distance above (thus negative) from the vertical datum. If, however, depth can never be a negative value, then we need to be explicit about that in the specification for VALIDATION_MAXDEPTH_OUTOFRANGE and other depth related tests such that the test is explicit about regardless of the parameterization, zero is the smallest allowed value for depth, and even if a negative value is provided as a parameter, the test must still return not compliant for depths smaller than zero. |
What you are describing Paul is distance above surface |
I think we should add Paul's example to the Best Practices and explain how
it should be determined?
1 m below the surface of the ocean stuck to a rock at a 2 m high tide.
Elevation: 2 m Vertical Datum: EGM1996 Depth: 1 m Distance above surface: 0
m
…On Fri, Oct 2, 2020 at 2:00 AM Arthur Chapman ***@***.***> wrote:
What you are describing Paul is
*distance above surface*
In addition to elevation and depth, a measurement of the vertical distance
above a reference point, with a minimum and a maximum distance to cover a
range. For surface terrestrial locations, the reference point should be the
elevation at ground level. Over a body of water (ocean, sea, lake, river,
glacier, etc.), the reference point for aerial locations should be the
elevation of the air-water interface, while the reference point for
sub-surface benthic locations should be the interface between the water and
the substrate. Locations within a water body should use depth rather than a
negative distance above surface. Distances above a reference point should
be expressed as positive numbers, while those below should be negative. The
maximum distance above a surface will always be a number greater than or
equal to the minimum distance above the surface. Since distances below a
surface are negative numbers, the maximum distance will always be a number
less than or equal to the minimum distance. Compare altitude.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#189 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ725IRC7LA63ALH5V7ATSIVM7TANCNFSM4R7SE4JA>
.
|
…tion, updating status and comment for negative values in test data for #187.
@tucotuco I'm confused. If depth is defined as distance below a vertical datum, and the data as you specify are: 1 m below the surface of the ocean stuck to a rock at a 2 m high tide. Doesn't this mean that the vertical datum is the datum for both elevation and depth, and the point is both 2 meters above this datum and one meter below this datum, and at the water surface all at the same time? Shouldn't the values be: This tells us that the sample was collected 1 meter above mean sea level for that location, and was 1 meter below the surface of the water at that time. For nearshore and intertidal localities, particularly with historical data, vertical position is most likely known based on a local mean low tide, mean tide, or mean high tide datum, which may or may not be translatable from the provided data to a global vertical datum. |
In accordance with #189, added file testdata_NOTIFICATION_ANNOTATION_NOTEMPTY_#29.csv
I had a chat with @ArthurChapman after we have discussed some of the issues arising and figure there are at least the following issues to discuss once we have all done our test data.
|
@Tasilee added some columns to the table above for ticking off test data files that have been checked by each of us. |
… changing data values to be consistent with basis of record, adding explicit alternative vocabularies, clarifying human readable messages, adding column to specify source authority, adding cases for all valid vocabulary values, adding a range of cases for problematic values.
I created an Excel file (emailed) with worksheets that support one or more test templates from the test datasets done so far (27 SPACE and TIME missing). In doing so (as anticipated), a number of issues arise. Given the propensity of the 99 (plus some 'non-printing character' versions) to diverge from a standard template, can I suggest that we use the worksheets (as CSVs)? Currently there are 7, but a) we aren't done yet and b) there may be a way of combining some of the test datasets. If we combine tests with the same template into a single worksheet, it is simple to edit. I have organized the data so that it can be sorted easily. The single worksheet makes it easier for me to understand the test data and the same will be true of all those who will be using them.
|
…o convert the rows in @Tasilee's data sheet in the spreadsheet of tests into a csv file suitable for input into a test harness. Supporting tdwg/bdq#189 used to generate https://github.com/tdwg/bdq/blob/master/tg2/core/TG2_test_validation_data.csv
For the validation data, see: https://github.com/tdwg/bdq/blob/master/tg2/core/TG2_test_validation_data.csv These csv files are generated from @Tasilee 's spreadsheet into a form that is more readily consumed by a test validation framework by code in https://github.com/FilteredPush/bdqtestrunner These csv files and guidance for their use is being assembled for TDWG standards track submission in: |
A Zoom discussion September 29/30 recommended that we develop unit tests for each of the VALIDATIONs. The main justifications (thanks to @tucotuco) were extensibility and minimal maintenance considering the evolution of the Darwin Core standard on which the TG2 tests are based.
We have 65 VALIDATIONs and would value any assistance in the creation the unit tests based on what @chicoreus has proposed with the following template using #187 as an example.
Test VALIDATION_MAXDEPTH_OUTOFRANGE
GUID 3f1db29a-bfa5-40db-9fd1-fde020d81939
Column 1 is the INPUT (one column for each InformationElement in the test)
Columns 2-3 are the parameter values (one column for each Parameter in the test)
Columns 4-6 are the expected output, values in columns 4 and 5 must match exactly.
Column 7 is a remark on the row in this table, not part of the expected output.
See https://github.com/tdwg/bdq/blob/master/tg2/core/testdata/testdata_VALIDATION_MAXDEPTH_OUTOFRANGE_%23187.csv for the latest version of this file.
1,828.8
1828.8
The text was updated successfully, but these errors were encountered: