Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for collating sequences on indexed files #132

Conversation

ddeclerck
Copy link
Contributor

This PR adds support for COLLATING SEQUENCE on indexed files, as mentionned in https://sourceforge.net/p/gnucobol/feature-requests/459/.

Note this has. only been done for the BDB backend - that's what our client needs.

@ddeclerck
Copy link
Contributor Author

Hmm, that broke the NIST testsuite.
Investigating further, it seems we should in fact NOT apply the program collating sequence to indexed files that do not have an explicit collating sequence (as opposed to SORT/MERGE files).

See 12.3.5.3 in ISO/IEC 1989:2014 :

11) The alphanumeric program collating sequence and national program collating sequence are used to determine
the truth value of any alphanumeric comparisons and national comparisons, respectively, that are:
a)Explicitly specified in relation conditions.
b)Explicitly specified in condition-name conditions.
c)Implicitly specified by the presence of a CONTROL clause in a report description entry.
When alphabet-name-1 or alphabet-name-2, or both, is associated with a locale, locale category LC_COLLATE
is used to carry out these comparisons.

12) The alphanumeric program collating sequence and national program collating sequence explicitly or implicitly
established by the OBJECT-COMPUTER paragraph are effective with the initial state of the runtime modules
to which they apply. If alphabet-name-1 or alphabet-name-2 references a locale, the associated collating
sequence is defined by category LC_COLLATE in the specific locale associated with that alphabet-name or, if
none, in the locale current at the time the collating sequence is used at runtime

13) The alphanumeric program collating sequence and national program collating sequence are applied to
alphanumeric and national sort or merge keys, respectively, unless the sort or merge collating sequence has
been modified by execution of a SET statement or a COLLATING SEQUENCE phrase is specified in the
respective SORT or MERGE statement.```

@GitMensch
Copy link
Collaborator

Found the answer in 12.4.4.3:

If no COLLATING SEQUENCE clause is specified:
a) the collating sequence for alphanumeric record keys, both primary and alternate, is the native alphanumeric collating sequence;
b) the collating sequence for national record keys, both primary and alternate, is the native national collating sequence.

So this is by default "native", not "program collating".
Please re-adjust, sorry if I was misleading here.

BUT: before adjusting the code, please add a minimal version of the failing NIST test (ideally written from scratch) to our internal testsuite; this should fail before your change-adjustment and pass afterwards.

(and note that this is also one of the fixed file attributes, which in theory should be checked - we don't do up to 4.x, but don't check that there either...).

@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch from ccf423d to f840670 Compare January 26, 2024 10:19
@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2024

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (db7db96) 65.78% compared to head (b594e0e) 65.86%.

Files Patch % Lines
cobc/tree.c 78.94% 2 Missing and 2 partials ⚠️
cobc/cobc.c 71.42% 1 Missing and 1 partial ⚠️
cobc/codegen.c 90.47% 0 Missing and 2 partials ⚠️
libcob/common.c 60.00% 2 Missing ⚠️
libcob/fileio.c 95.45% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@                  Coverage Diff                  @@
##           gcos4gnucobol-3.x     #132      +/-   ##
=====================================================
+ Coverage              65.78%   65.86%   +0.08%     
=====================================================
  Files                     32       32              
  Lines                  59416    59481      +65     
  Branches               15694    15708      +14     
=====================================================
+ Hits                   39087    39178      +91     
+ Misses                 14311    14283      -28     
- Partials                6018     6020       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ddeclerck
Copy link
Contributor Author

I made the necessary changes. I also added a new -fdefault-file-colseq flags that works similarly to -fdefault-colseq but just for files.

As for the tests, I reworked them and included a dummy program collating sequence (taken from the failed NIST test), which was sufficient to cause the testsuite to fail priori to fixing the bug.

@GitMensch
Copy link
Collaborator

That sounds all good!

Please check the test coverage of your changes, the codecov bot says:

Attention: 188 lines in your changes are missing coverage. Please review.

188 sounds a bit much.

@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch from f840670 to 2961e35 Compare January 26, 2024 12:47
@ddeclerck
Copy link
Contributor Author

That sounds all good!

Please check the test coverage of your changes, the codecov bot says:

Attention: 188 lines in your changes are missing coverage. Please review.

188 sounds a bit much.

Fixed ! Just had to resynchronize with upstream.
The remaining uncovered lines are either benign or were not covered anyways.

Copy link
Collaborator

@GitMensch GitMensch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some general changes (should not be too much work) and one change to cob_key that actually reduces the amount of changes (here the important part again):

Note: If you always place the key collation into each cob_key, then you don't need to have the file-collation in cob_file at all (if set then those will never be empty).**

I'd like to also have support for key collation - as the runtime already handles that (just specifying always the same collation), which is possibly a bit more work, if you feel that's too much, then this part could be postponed.

cobc/parser.y Outdated Show resolved Hide resolved
cobc/flag.def Show resolved Hide resolved
libcob/fileio.c Outdated Show resolved Hide resolved
libcob/fileio.c Outdated Show resolved Hide resolved
cobc/parser.y Show resolved Hide resolved
libcob/coblocal.h Outdated Show resolved Hide resolved
libcob/common.h Outdated Show resolved Hide resolved
libcob/common.h Outdated Show resolved Hide resolved
libcob/fileio.c Show resolved Hide resolved
tests/testsuite.src/run_file.at Outdated Show resolved Hide resolved
@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch 2 times, most recently from d6f4c29 to 623e27a Compare January 26, 2024 16:12
@ddeclerck
Copy link
Contributor Author

Should be good. Though I left the key collating sequence for later (although I enabled the field to store the file collating sequence).

@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch from 623e27a to 7d4daef Compare January 26, 2024 16:24
Copy link
Collaborator

@GitMensch GitMensch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we're nearly done; I think this can be checked in next week (and maybe there is a key collation available until then, too)

NEWS Outdated Show resolved Hide resolved
libcob/fileio.c Outdated Show resolved Hide resolved
libcob/fileio.c Outdated Show resolved Hide resolved
libcob/fileio.c Show resolved Hide resolved
@GitMensch
Copy link
Collaborator

Note: I do have a bigger refactoring of the BDB code which I want to push start of next week, should I wait for this PR to be finished and merged before or would you possibly even prefer if I push the refactoring before?
(I just wanted to fix a bug, recognized that for doing so I should backport status 02 complex; then got from one thing to the next... and then did a refactoring, using the specific f->keys[idx] instead of using an index of a subfield and the same for the contained keydesc - the refactor got so big that I should push that separately in any case [the full code of course has a bug somewhere and hides from me])

@ddeclerck
Copy link
Contributor Author

ddeclerck commented Jan 26, 2024

Note: I do have a bigger refactoring of the BDB code which I want to push start of next week, should I wait for this PR to be finished and merged before or would you possibly even prefer if I push the refactoring before?

I plan to make the final changes to this PR on monday (need to rest this week end) ; if that prevents you from merging, I don't mind rebasing my PR on your changes.

@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch 2 times, most recently from 560cd05 to 8a6e7f0 Compare January 29, 2024 11:15
@ddeclerck
Copy link
Contributor Author

ddeclerck commented Jan 29, 2024

Made the last changes, though there might be another one to do.
I'm concerned about the memcmp here (fileio.c: 5679) :

	/* Check record key */
	bdb_setkey (f, 0);
	if (!p->last_key) {
		p->last_key = cob_malloc ((size_t)p->maxkeylen);
	} else
	if (f->access_mode == COB_ACCESS_SEQUENTIAL
	 && memcmp (p->last_key, p->key.data, (size_t)p->key.size) > 0) {
		return COB_STATUS_21_KEY_INVALID;
	}
	memcpy (p->last_key, p->key.data, (size_t)p->key.size);

All other memcmp's on keys are fine because they just test (in)equality with 0, but this one tests for > 0. I guess I should replace this with a collation-aware compare, right ?

@GitMensch
Copy link
Collaborator

GitMensch commented Jan 29, 2024 via email

@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch from 8a6e7f0 to 2969913 Compare January 29, 2024 14:30
@ddeclerck
Copy link
Contributor Author

I have a strange feeling now... Can you please verify that:

  • the testsuite has at least one entry with variable length keys (which test case is that?) for each SEARCH and INDEXED
  • if missing please add it
  • check if we do store the variable length in the key fields

Doesn't seem to be checked - but shouldn't I address this in a different PR ?

I agree that a memcpy should only be used for testing equality or difference, otherwise either a numeric compare (if key field is numeric - the byte order could be non-native!) or a collation aware check needs to be used

Oh, wait... I believe non-display numeric keys in indexed files are already incorrectly ordered... Just tested with a primary key of usage BINARY-LONG and inserting the keys 4000, 4100 and 4200 : on sequential read I get 4100, 4200 and 4000...

@GitMensch
Copy link
Collaborator

GitMensch commented Jan 29, 2024 via email

@ddeclerck
Copy link
Contributor Author

ddeclerck commented Jan 29, 2024

That was my feeling. Please address the numeric key sorting, if possible.

I see two ways to fix that :

  • either modifying the bdb_bt_compare function - but that would fix only for BDB...
  • or always store integer numeric keys in big-endian order (not sure how to deal with IEE754...), but then if I actually translate the keys so they end up stored in the right order, I might as well do that for alphanumeric keys, making this PR less relevant...

In any case, I believe this require some thinking before actually doing that, and I'm not sure this belongs to this PR (as it did not introduce that problem).

I think this should work (and hopefully be tested) in SEARCH already.

Tested, seems to work fine for SEARCH.

As both can use the same comparison logic (just using a different set of collations), I think it's useful to test that.

Could you elaborate ?

Testing variable length keys belong to this PR if previously possibly completely handled by BDB and now in our own search function.

I'm not sure to understand ; could you rephrase that ?

@GitMensch
Copy link
Collaborator

  • either modifying the bdb_bt_compare function - but that would fix only for BDB...

that would be enough for now and is likely "the most portable" solution. But that creates a HUGE problem: if we do that, then the files need to be rebuilt :-/ Note that using the key type and use the appropriate COBOL sort function (if the field structure within ´->app_data[if not done: store a complete cb_field structure in there], then you only need to duplicate that cob_field to then set the two separate fields' data) should directly work for all types. This is effectively what is done forSEARCH(and the internal comparison duringSORT`).

Otherwise we could have small-enough numbers be converted to big-endian binary and bigger ones to be converted to USAGE DISPLAY SIGN SEPARATE LEADING in bdb_savekey, that would also work - but again leads to all files needing a rebuild. As this repacking/unpacking would be done on any key access, this is likely slower than the approach above.

As LMDB does not provide an external compare function, the second solution seems to be the only one that could be applied there (but note that LMDB exists only in trunk and even then is marked as experimental and possibly will be dropped).

Note: with ISAM there are key types that handle this (including IEE754), but we currently only use CHARTYPE, so the same problem seems to exists there. For this part I'll contact Ron.

In any case, I believe this require some thinking before actually doing that, and I'm not sure this belongs to this PR (as it did not introduce that problem).

Hm. Please add a test with expected failure and a comment about the underlying problem.
We could otherwise leave that general problem out of this PR, but if a key is purely numeric, there should be no "additional" conversion applied which makes the current scenario worse.
That brings us to the good thing: if you implement key collation within the cobc site (you already have the codegen and use of that in libcob) then you can place a check there for a numeric type and just skip generating the collation for that.

As both can use the same comparison logic (just using a different set of collations), I think it's useful to test that.

Could you elaborate ?

SEARCH/SORT already compares "as cob_field" for any numeric type; if the same is applied to the INDEXED comparisons (you need a cob_field in ´app_data` for that), then everything "just works".
But as we noted: this may be kept to a separate PR, especially as we have to consider what to do with the existing files (likely we'd have to insert a marker into the file itself, likely by an embedded table, where we record "that's type 2, use different search algorithm"). This is definitely useful, but definitely "quite big" (and yes, should be kept separate).

Testing variable length keys belong to this PR if previously possibly completely handled by BDB and now in our own search function.

I'm not sure to understand ; could you rephrase that ?

BDB compares keys. those are setup from cob_field *component - if this would match the COBOL field, then it may be variable-length; but as far as I now see those are always fixed-length; so the issue is again "likely important, but likely big and belongs to another PR".

So... do you add key collations in cobc now and skip to set the collation if the field is numeric (potentially also skip if it is NATIONAL)? That should provide enough for the original issue.

Yeah, they should always be of identical length in our context (and I made a quick check by adding an assert and running both NIST and the testsuite with no error). Anyway I prefered to handle the case anyways (the comparison function is rather "low-level" and does not really care about our use case).

The new indexed_compare is our specialized function to our case - and we cannot even support two different sizes as we don't have it when the specialized function is used.
Therefore: please adjust indexed_compare() to take only a single size, which then also removes the need to use cob_min_int().

For the runtime warning - I think that should be an error, but we don't expect it to be seen by a user so remove the translation part and add a coverage exclusion, like

			/* LCOV_EXCL_START */
			if (k++ == MAX_MODULE_ITERS) {	/* prevent endless loop in case of broken list */
				/* not translated as highly unexpected */
				cob_runtime_warning ("max module iterations exceeded, possible broken chain");
				break;
			}
			/* LCOV_EXCL_STOP */

cobc/cobc.c Outdated Show resolved Hide resolved
@ddeclerck
Copy link
Contributor Author

  • either modifying the bdb_bt_compare function - but that would fix only for BDB...

that would be enough for now and is likely "the most portable" solution. But that creates a HUGE problem: if we do that, then the files need to be rebuilt :-/ Note that using the key type and use the appropriate COBOL sort function (if the field structure within ´->app_data[if not done: store a complete cb_field structure in there], then you only need to duplicate that cob_field to then set the two separate fields' data) should directly work for all types. This is effectively what is done forSEARCH(and the internal comparison duringSORT`).

Otherwise we could have small-enough numbers be converted to big-endian binary and bigger ones to be converted to USAGE DISPLAY SIGN SEPARATE LEADING in bdb_savekey, that would also work - but again leads to all files needing a rebuild.

So, both solutions would imply a rebuilding of the files. What I like about the first one is that it uses backend's mechanisms intended for this purpose (i.e. custom key compare function) - but it has to be implemented specifically for each backend, and not all these backends expose such features. And what I like about the second one is that it is (I think ?) backend-agnostic.

As this repacking/unpacking would be done on any key access, this is likely slower than the approach above.

I don't think we would ever have to translate the "encoded" keys backward, right ?

As LMDB does not provide an external compare function, the second solution seems to be the only one that could be applied there

Isn't mdb_set_compare what we're looking for ?

(but note that LMDB exists only in trunk and even then is marked as experimental and possibly will be dropped).

Even if slightly off-topic : I really like the idea of an LMDB backend. What would be the reasons for dropping it ? Not performing as expected ?

Note: with ISAM there are key types that handle this (including IEE754), but we currently only use CHARTYPE, so the same problem seems to exists there. For this part I'll contact Ron.

In any case, I believe this require some thinking before actually doing that, and I'm not sure this belongs to this PR (as it did not introduce that problem).

Hm. Please add a test with expected failure and a comment about the underlying problem.

I guess I'll do that.

We could otherwise leave that general problem out of this PR, but if a key is purely numeric, there should be no "additional" conversion applied which makes the current scenario worse. That brings us to the good thing: if you implement key collation within the cobc site (you already have the codegen and use of that in libcob) then you can place a check there for a numeric type and just skip generating the collation for that.

Well, I think this is handled in codegen.c: output_indexed_file_key_colseq :

	int	type = cb_tree_type (key, cb_code_field (key));
	cb_tree	col = NULL;

	/* We only apply a collating sequence if the key is alphanumeric / display */
	if ((type & COB_TYPE_ALNUM) || (type == COB_TYPE_NUMERIC_DISPLAY)) {
		/* TODO: actually handle record keys (using file key for now) */
		col = f->collating_sequence;
	}

We agree that NUMERIC DISPLAY should be stored lexicographically, right ?

As both can use the same comparison logic (just using a different set of collations), I think it's useful to test that.

Could you elaborate ?

SEARCH/SORT already compares "as cob_field" for any numeric type; if the same is applied to the INDEXED comparisons (you need a cob_field in ´app_data` for that), then everything "just works". But as we noted: this may be kept to a separate PR, especially as we have to consider what to do with the existing files (likely we'd have to insert a marker into the file itself, likely by an embedded table, where we record "that's type 2, use different search algorithm"). This is definitely useful, but definitely "quite big" (and yes, should be kept separate).

Got it. As an alternative to an embedded "marker", how about an extra file for such metadata ?

Testing variable length keys belong to this PR if previously possibly completely handled by BDB and now in our own search function.

I'm not sure to understand ; could you rephrase that ?

BDB compares keys. those are setup from cob_field *component - if this would match the COBOL field, then it may be variable-length; but as far as I now see those are always fixed-length; so the issue is again "likely important, but likely big and belongs to another PR".

So... do you add key collations in cobc now and skip to set the collation if the field is numeric (potentially also skip if it is NATIONAL)? That should provide enough for the original issue.

I think that's what my code does now ; except that it uses the file collation for every key - but I guess with all the things to do/fix we discussed here, I'd better implement complete support for (alphanumeric) key collations right now so we don't have to come back to it later.

Yeah, they should always be of identical length in our context (and I made a quick check by adding an assert and running both NIST and the testsuite with no error). Anyway I prefered to handle the case anyways (the comparison function is rather "low-level" and does not really care about our use case).

The new indexed_compare is our specialized function to our case - and we cannot even support two different sizes as we don't have it when the specialized function is used. Therefore: please adjust indexed_compare() to take only a single size, which then also removes the need to use cob_min_int().

Alright, I'll change that.

For the runtime warning - I think that should be an error, but we don't expect it to be seen by a user so remove the translation part and add a coverage exclusion, like

			/* LCOV_EXCL_START */
			if (k++ == MAX_MODULE_ITERS) {	/* prevent endless loop in case of broken list */
				/* not translated as highly unexpected */
				cob_runtime_warning ("max module iterations exceeded, possible broken chain");
				break;
			}
			/* LCOV_EXCL_STOP */

Will do.

@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch 2 times, most recently from fab88f5 to 6939a31 Compare February 6, 2024 10:05
@ddeclerck
Copy link
Contributor Author

ddeclerck commented Feb 6, 2024

Rebsed, added key collations and a test with expected failure regarding the numeric keys sorting problem.
@GitMensch would it be sufficient for now ?

@ddeclerck ddeclerck requested a review from GitMensch February 16, 2024 15:00
@GitMensch
Copy link
Collaborator

while reviewing: I've just checked in the refactoring

Copy link
Collaborator

@GitMensch GitMensch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the minor review comments and the rebase: that's beautiful - please: go, go, go

cobc/codegen.c Outdated Show resolved Hide resolved
cobc/codegen.c Outdated Show resolved Hide resolved
cobc/codegen.c Outdated Show resolved Hide resolved
cobc/parser.y Outdated Show resolved Hide resolved
libcob/fileio.c Outdated Show resolved Hide resolved
libcob/fileio.c Outdated Show resolved Hide resolved
@ddeclerck
Copy link
Contributor Author

ddeclerck commented Feb 19, 2024

All changes accounted for - will merge to SVN if it's okay (I see the tag is already there).
(+ rebased)

@ddeclerck ddeclerck force-pushed the add_indexed_file_colseq branch from ba00023 to b594e0e Compare February 19, 2024 09:21
@GitMensch
Copy link
Collaborator

GitMensch commented Feb 19, 2024 via email

@ddeclerck
Copy link
Contributor Author

Merged on SVN.

@ddeclerck ddeclerck closed this Feb 20, 2024
@GitMensch
Copy link
Collaborator

@ddeclerck Thank you for working on this.
In my review I've missed that the new functions were defined globally - that broke the !WITH_DB builds, just fixed that in svn.

@ddeclerck
Copy link
Contributor Author

@ddeclerck In my review I've missed that the new functions were defined globally - that broke the !WITH_DB builds, just fixed that in svn.

Ah, I overlooked that - sorry. Hopefully this was an easy fix. Thanks.

@ddeclerck ddeclerck deleted the add_indexed_file_colseq branch October 4, 2024 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants