Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Version as File Modified time #431

Merged
merged 4 commits into from
Feb 2, 2022
Merged

Conversation

aodhan-domhnaill
Copy link
Contributor

@aodhan-domhnaill aodhan-domhnaill commented Sep 17, 2021

Description

#395

Code review items

  • Readable: Concise, well named, follows the style guide, documented if necessary.
  • Documented: Configuration and user documentation on cht-docs
  • Tested: Unit and/or integration tests where appropriate
  • Backwards compatible: Works with existing data and configuration. Any breaking changes documented in the release notes.

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

@@ -39,6 +39,7 @@ module.exports = async (projectDir, subDirectory, options) => {
}

const xml = fs.read(xformPath);
const xmlVersion = fs.statSync(filePath).mtime;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure Git will preserve this correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it looks like it doesn't preserve it. Given this I think we should instead... when building the form in cht-conf, if the output is different than the previous version, store the current time somewhere in a source controlled file.

There are some challenges with this approach... for example

  1. If you repo is out of date then it may create a spurious version but spurious versions aren't the end of the world as long as everyone uses less-than and greater-than comparisons.
  2. If the system clock is out of date the version will jump around. This should be rare especially in a CICD setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are okay adding a Git library, then we could use the Git mod time,

$ git log -1 --pretty="format:%ci" test/data/lib/upload-forms/merge-properties/forms/example.xml
2020-04-14 21:54:42 +1200

Copy link
Contributor

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, but maybe I am missing something with regards to the actual functionality. How does give us the ability to find the exact version of the xlsx file which generated the form (like Kenn mentioned in his initial description of the issue #395)? AFIK the GitHub hash search only works for commit hashes and not hashes of the actual file contents. I tried searching for the hash of the test form xml (hash:7e3bb121779a8e9f707b6e1db4c1b52aa6e875b5015b41b0a9115efa2d0de1d1) but it did not give me anything useful.

@aodhan-domhnaill
Copy link
Contributor Author

@kennsippell Can you comment about the hash search? If we use Git Commit, it might be better because we could have a timestamp too.

For my usage, I just need to know if the file changed.

@kennsippell
Copy link
Member

kennsippell commented Nov 12, 2021

Is it important to search by hash? - Our old data science team had this problem pretty frequently - looking at data and wanting to map it back to the source. "This user answered potatoes but the question is supposed to be an integer. What was the exact question we asked them?" Many times the answer is simply "we don't know". I do believe there is quite a bit of value in mapping CHT data back to the source...

How to do it

If the hashing algorithm we use is the same as GitHub, this would let users GitHub hash search for the exact version of the form that caused the report.

commit hash vs content hash - that's just my mistake. I think this comment from #395 is false. Even git's "pickaxe" doesn't seem to support searching by content hash. I suspect what you're hashing here can still be used to perform the search if somebody really wants to - just not through a convenient UI that I know of. Sorry to mislead. What you're doing here seems like a good option still.

But note the approach of SHA + MIN(timestamp) query might yield some funk if a form reverts to an exact previous state (A > B > A). I don't have a better suggestion.

@jkuester
Copy link
Contributor

@aodhan-domhnaill @kennsippell Would it make sense to move this hashing code that we have here into convert-forms (maybe as part of the fixXml function) and store the xmlVersion hash maybe as another attribute on the model/instance/form_name element? Then the hash would get saved to the actual xml file (and get committed/version-controlled). Still would not enable GitHub searching, but anyone with the repo checked out locally could just use git log -G<regex> to search the whole revision history of the repo looking for commits containing file changes with the hash text.

@kennsippell
Copy link
Member

kennsippell commented Nov 15, 2021

Makes sense, but I'm not sure we are 100% on committing xml with xlsx changes. Currently with the xml being generated by the CI system, it isn't necessary. If we take this path - it'd be nice for the CI system to also forced the xml commit (either by committing it if different, or failing).

@garethbowen
Copy link
Contributor

@aodhan-domhnaill I have concerns about using the hash alone because I think it'll add significant complexity in the pipeline configuration because the hash won't be sequential. If a form is modified 100 times and a field was added in version 50, you would have to check the hash against all 50 later hashes to work out if the field should be there or not. Furthermore if version 101 was added it'd be easy to forget to update the if-else conditions and the configuration would regress.

This would be much easier with a timestamp so the xml version can be compared with < and >. It sounds like from @kennsippell that the content hash isn't very useful because you can't do a search for the source, but if you'd still like to retain it then I think the xml version should be date + hash so the versions are sequential still.

@aodhan-domhnaill
Copy link
Contributor Author

@garethbowen I would like a timestamp as well, but we can't use the Unix timestamp of the file. Using the timestamp off someones computer seems scary because timezones and local issues could cause a problem. Using the last Git commit for the file would be nice because that comes with a searchable hash and a timestamp. Alternatively, trusted timestamping is an option, but seems excessive.

@kennsippell Will this tool always be run in a Git repo? Is using a Git commit hash/timestamp feasible?

@garethbowen
Copy link
Contributor

Timezones shouldn't be a problem as long as the timestamp is UTC. Local issues (such as computer clock out of sync) could be an issue but that shouldn't be common. I agree that trusted timestamps are overkill. We can't rely on the project using git or any change control at all, but we could try that and fall back to the local timestamp if necessary?

Aidan Macdonald and others added 4 commits December 31, 2021 12:25
$ sha256sum test/data/lib/upload-forms/merge-properties/forms/example.xml
7e3bb121779a8e9f707b6e1db4c1b52aa6e875b5015b41b0a9115efa2d0de1d1  test/data/lib/upload-forms/merge-properties/forms/example.xml
@@ -52,6 +53,8 @@ describe('upload-forms', () => {
const form = await api.db.get('form:example');
expect(form.type).to.equal('form');
expect(form.internalId).to.equal('different');
expect(form.xmlVersion.time).to.equal(123123);
expect(form.xmlVersion.sha256).to.equal('7e3bb121779a8e9f707b6e1db4c1b52aa6e875b5015b41b0a9115efa2d0de1d1');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth leaving a comment in here just to call out that if this assert fails (and you just made changes to the example.xml), then this is expected behavior and you can just update the expected has value....

Copy link
Contributor

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@aodhan-domhnaill aodhan-domhnaill merged commit 9d0b8d7 into master Feb 2, 2022
@aodhan-domhnaill aodhan-domhnaill deleted the 395-xml-form-version branch February 2, 2022 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants