Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata functionality implementation #259

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

samgregory
Copy link

With grateful thanks to @FredrikKarlssonSpeech for his permission to use code in the https://github.com/humlab-speech/reindeer re-implementation of emuR, here is a working and tested implementation of metadata for emuR.

These functions the meets the requirements outlined in multiple issues, notably #130

  • Metadata is stored in json format at the bundle, session and database level using the suffix+extension _meta.json
  • Bundle metadata has precedence over session metadata, likewise session over database
  • Metadata can be added / overwriten at each level by the built in function (add_metadata) or imported and exported from an Microsoft Excel Open XML Format Spreadsheet file
  • Recording metadata (eg SHA1 checksum) can be added or appended to existing bundle metadata
  • A metadata table can be requested (get_metadata) or metadata can be bound to an existing query seglist (biographize)
  • A helper function(list_metafiles) can list all metadata files found in a database structure

There are some major (breaking) changes compared to the reindeer implementation:

  • moving from .meta_json to the more emu-like _meta.json suffix/extension
  • removing metadata from the _DBconfig.json (which broke the DBconfig schema and thus prevent opening a database in emuWebApp)
  • Importing from XLSX spreadsheets with calculate and remove duplicate inherited values at the bundle and session level thus preserving the utility of the precedence scheme

Complete documentation exists for all new functionality and new .Rmd files have been supplied in this pulled request.
Unit testing of core features (get_metadata, add_digest, import / export) has been implemented with dummy metadata for the ae test database.

I note the addition of the memoise dependency. It was found during testing that calls to get_metadata were very slow for any database of a non-trivial size. emuR resolves the same issue of data distributed throughout _annot.json files by creating and accessing an SQL cache. This method seemed to be overly complicated for metadata so memoise was to cache the results from a call to export_metadata. EmuR will only read _meta.json files once per laoded database handle or again one changes written to _meta.jsons from within emuR. Any changes to these files outside of emuR require a get_metadata(clearCache = TRUE) call.

openxlsx becomes an suggested package, without which import and export of Excel spreadsheets will fail (gracefully, with a message to load the openxlsx package).

samgregory added 7 commits May 6, 2022 15:45
Complete with documentation and unit testing.
New package dependency: memoise for metadata caching
Package suggestion: openslxs for metadata <-> Excel import/export
Complete with documentation and unit testing.
New package dependency: memoise for metadata caching
Package suggestion: openslxs for metadata <-> Excel import/export
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant