Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG2- Use the word "EMPTY" instead of NULL and provide definition #111

Closed
cgendreau opened this issue Jan 16, 2018 · 14 comments
Closed

TG2- Use the word "EMPTY" instead of NULL and provide definition #111

cgendreau opened this issue Jan 16, 2018 · 14 comments
Assignees

Comments

@cgendreau
Copy link
Contributor

I don't think the word NULL should be used in tests title and definition.
I think the word EMPTY should be used instead and a definition (of EMPTY) should be added to a glossary.

see #20 (comment)

@ArthurChapman
Copy link
Collaborator

Thanks Christian - we came to much the same conclusion with EMPTY to include empty, NULL, /N, -9999 etc. We have made a note to define the term and will bulk change the names. We are changing the descriptions, etc.as we work through them.

@chicoreus
Copy link
Collaborator

Yes, we need a standard definition for a function isEmpty(informationElement) which returns true or false.

Some values for isEmpty=true are obvious: empty string, null (if the language supports it e.g in java, javascript, sql), undefined (e.g. in javascript), a char array of size 0 (e.g. in C).

Other values are probably reasonable as we see them in data as serializations of null out of relational databases. These include "\N", "NULL", "null".

Other values need substantive discussion. One class of these are strings that users put into data to mean an empty value, these include "n/a", "not applicable", "[not applicable]", "[data not available]".
A second class of these are strings that were historically used to represent empty values in numeric or date fields, these include "0", "9", "99", "999", "", "**", "***--". This second class may well overlap with valid values in some terms.

The scope of what is considered empty by a standard isEmpty() function needs discussion.

@chicoreus
Copy link
Collaborator

@chicoreus
Copy link
Collaborator

@cgendreau Plan is currently for Alex to update all of the NULL names to EMPTY in bulk after the Gainesville meeting.

@ArthurChapman
Copy link
Collaborator

See also discussion under #147

@Tasilee
Copy link
Collaborator

Tasilee commented Mar 28, 2019

I like @tucotuco 's rendition and have added it for now in #152

EMPTY: A field that is needed as input is not present, or, the input field
is present but there is no value in the field, or the field is present and
the value of the field consists entirely of non-printing characters.

@ArthurChapman
Copy link
Collaborator

Talking to @tucotuco we think we need to combine elements of both versions. We will discuss and come back with a suggestion tomorrow.

@Tasilee
Copy link
Collaborator

Tasilee commented Apr 23, 2019

In the light of discussions, I have amended the definition of EMPTY in #152 to read "A field that is present but does not contain any characters or values. A field containing non-printing or other invalid characters or values may be separately detected."

The reasoning seems ok as we already have Expected Responses that state, for example (#162) -"INTERNAL_PREREQUISITES_NOT_MET if the field dwc:taxonRank is not present or is EMPTY; COMPLIANT if the value of the field dwc:taxonRank is in the specified source authority; otherwise NOT_COMPLIANT."

We are then allowing separately in theory for a field not present and a field that is EMPTY.

@chicoreus
Copy link
Collaborator

I'm putting together test data for #49 VALIDATION_YEAR_EMPTY, and I think we need to revert to the rendtion from @tucotuco we also need to be explicit about spaces, as elsewhere we have asserted that whitespace should be trimmed before testing, thus a value with only whitespace would be empty. The meaning of invalid characters is very different from non-printing characters, and the two shouldn't be mixed and treated separately. I'd suggest:

EMPTY: A field that is needed as input is not present, or, the input field
is present but there is no value in the field, or the field is present and
the value of the field consists entirely of whitespace or other non-printing
characters.

chicoreus added a commit that referenced this issue Oct 2, 2020
chicoreus added a commit that referenced this issue Oct 2, 2020
…ble comment in one line and adding a file with examples of non-printing characters (unicode u0000,u0007,and u0020, for discussion of for definition of EMPTY() in #111 and #152.
@Tasilee
Copy link
Collaborator

Tasilee commented Oct 4, 2020

Thanks @chicoreus. This would seem to cover it but the wording is a little odd. How about

EMPTY: A field that is needed as input is not present, or the input field
is present but either there is no value in the field, or the value of the field
consists entirely of whitespace or other non-printing characters.

We will need to update the TG2 vocabulary @ArthurChapman.

@chicoreus
Copy link
Collaborator

@Tasilee that works.

@chicoreus
Copy link
Collaborator

@ArthurChapman the entry in the vocabulary #152 is out of sync with this discussion, thought the entry in the vocabulary feels more current than the discussion here.

"EMPTY: A field that is either not present or does not contain any characters or values other than white space. Note: A field containing invalid characters or values (including serializations of NULL values) are NOT_EMPTY and may be separately detected but fields containing only non-printing characters (q.v.) are treated as EMPTY."

It is worth looking at the documentation of String.trim() in Java, which contains the following text: "where space is defined as any character whose codepoint is less than or equal to 'U+0020' (the space character). "

I would suggest that we follow that text and amend the current definition of empty to make use of that definition of space (which includes other whitespace characters (tab, line feed, carriage return) and non-printing characters).

"EMPTY: An information element that is either not present or does not contain any characters or values other than those in the range U+0000 to U+0020. Note: An information element containing invalid characters (e.g. letters in an information element that would be expected to contain integers) or values (including string serializations of the NULL value) are NOT_EMPTY and may be separately detected."

Using only characters in the range U+0000 to U+0020 as EMPTY also reduces the need to tackle that at the interface within a mechanism implementing the tests where data are presented to the test, the mechanism is likely to be unable to distinguish between cases where a term was absent in the original data and the term was present but contained no data. Either case could be handled by the mechanism by presenting a test with a null or an empty string. By starting at U+0000 we are effectively being explicit that null objects are also EMPTY.

@ArthurChapman
Copy link
Collaborator

I like - and have changed definition in #152

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 28, 2024

Can this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants