Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EAD collection import #6297

Merged
merged 5 commits into from
Nov 15, 2024
Merged

Add EAD collection import #6297

merged 5 commits into from
Nov 15, 2024

Conversation

solth
Copy link
Member

@solth solth commented Nov 7, 2024

This pull request adds the option to import EAD collections from uploaded XML files. The import creates one (parent) process for the collection and one (child) for each "file" contained in the XML file, thus allowing complete collections ("Bestände) into individual processes from a single file upload.

To be able to use this EAD import, first a ruleset and an Import configuration of type "File Upload" has to be configured with metadata format EAD:
Bildschirmfoto 2024-11-08 um 00 34 29

Accordingly, the mapping file used in the import configuration has to map EAD to the Kitodo internal metadata format:
Bildschirmfoto 2024-11-08 um 00 39 34

A separate mapping file can be configured to map the collection level of the imported EAD XML file, in the field "Mapping file for parent record":
Bildschirmfoto 2024-11-08 um 00 41 17

This pull request adds example mapping files for the parent and child processes (eadParent2kitodo.xsl and ead2kitodo.xsl) as well as a corresponding ruleset (ruleset_ead.xml) that can be used for this purpose and that of cause can also extended when required.

The user can then use this import configuration to upload an EAD XML file containing the EAD collection. The upload dialog also allows to select the EAD level of the element that is to be transformed into the parent process from COLLECTION, CLASS and SERIES and the EAD level of items that are transformed into child processes from FILE and ITEM:
Bildschirmfoto 2024-11-08 um 00 30 33

Additionally, a limit for the maximum number or child processes that can be processed interactively in the import mask can now be configured in kitodo_config.properties using the parameter maxNumberOfProcessesForImportMask. (this value defaults to 5). If the number of child elements in the uploaded file exceeds this limit, the user is offered with a choice to relocate the import of all processes to a background task:
Bildschirmfoto 2024-11-08 um 00 51 50

If the user has the corresponding permission, the progress of this background import can then be monitored on the "Task manager" page of Kitodo:
Bildschirmfoto 2024-11-08 um 00 53 15

Fixes #5984

@solth solth requested review from thomaslow and BartChris November 8, 2024 12:16
Copy link
Collaborator

@henning-gerhardt henning-gerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes from reading the changes.

@@ -108,6 +122,10 @@ public CreateProcessForm() {
this.priorityList = priorityList;
}

public void calculateNumberOfEadElements() throws XMLStreamException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing JavaDoc

private SearchResult searchResult = null;

/**
* Empty default constructor. Sets default catalog and search field, if configured.
*/
public LazyHitModel() {
public LazyHitModel(CatalogImportDialog dialog) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc should be adjusted.

if (Objects.isNull(parentID)) {
this.parentTempProcess = null;
return;
public void checkForParent(String parentID, Ruleset ruleset, int projectID) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaDoc should be adjusted.

/**
* Set searchTerm.
*
* @param searchTerm as java.lang.String
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.lang.String could be simplified to string or the whole wording should be adjusted.

StringConstants.LEVEL, level);
}

public boolean isMaxNumberOfRecordsExceeded(String xmlString, String tagName) throws XMLStreamException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing JavaDoc.

@@ -409,8 +412,7 @@ public String getParentID(Document document, String higherLevelIdentifier, Strin
*/
public TempProcess createTempProcessFromDocument(ImportConfiguration importConfiguration, Document document,
int templateID, int projectID)
throws ProcessGenerationException, IOException, TransformerException, InvalidMetadataValueException,
NoSuchMetadataFieldException {
throws ProcessGenerationException, IOException, TransformerException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaDoc should be adjusted.

@BartChris
Copy link
Collaborator

BartChris commented Nov 13, 2024

While testing this i first made a dumb configuration error. I associated the wrong XSL-file with the EAD-Mapping-Definition (i associated the pica2ktodo.xsl).
This brought up the following exception message:

KITODO_EAD_ERROR

This error is produced by the fact that the internalRecord does only contain the XML header and no real content at that point.

resultDocument = XMLUtils.parseXMLString((String) internalRecord.getOriginalData());

It is a dumb mistake on my side, but maybe we can assist by also check if we also check if the importDocument is empty (file has no real content) - as we do later with the resultDocument - and the importer has to check the mapping file configuration.

@BartChris
Copy link
Collaborator

BartChris commented Nov 13, 2024

I tested with the given test ead file. A problem appears, when i try to import the file two times. I then get a Null pointer for the createProcessLocation call of the child processes and they are not created.
The problem is that when processTempProcess is called for a child process (with an already existing title)

processTempProcess(tempProcess, rulesetManagement, acquisitionStage, priorityList, null);

the process titles are considered invalid because they do already exist.

if (!ProcessValidator.isProcessTitleCorrect(processTitle)) {
throw new ProcessGenerationException(String.format("Unable to create process (invalid process title '%s')",
processTitle));

This should throw a ProcessGenerationException, but it seems as if this exeption does not stop the execution in the CreateProcessForm. The execution continues and we have processes without an ID.

Then the Nullpointer is thrown in the FileService, because the file location is based on the missing ID.

public URI createProcessLocation(Process process) throws IOException, CommandException {
URI processLocationUri = fileManagementModule.createProcessLocation(process.getId().toString());
createProcessFolders(process, processLocationUri);
return processLocationUri;
}

When processing the list in the background (because the number of imported processes is greater than the configured limit) the process stops with a ProcessGenerationException.

For the background processing the behaviour should probably be more like in the mass import, where items with duplicate titles are skipped. In the form Kitodo should catch invalid titles and force the user to correct them.

* @return number of EAD elements with given level
* @throws XMLStreamException when parsing XML string fails
*/
public static int getNumberOfElements(String xmlString, String eadLevel) throws XMLStreamException {
Copy link
Collaborator

@BartChris BartChris Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static int getNumberOfElements(String xmlString, String eadLevel) throws XMLStreamException {
public static int getNumberOfEADElements(String xmlString, String eadLevel) throws XMLStreamException {

This can only be used with EAD elements. Maybe a generic function for getting the number of elements is useful as well, but right now it does not work like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BartChris yes, you are right. This was intended to be used in more general imports, but at the moment only works for the EAD import. I have renamed the method as you suggested.


if (parentElements.size() != 1) {
throw new ProcessGenerationException(
String.format("EAD XML does not contain exactly one element of parent level '%s'!",
Copy link
Collaborator

@BartChris BartChris Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should maybe be translated as it is displayed in the UI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@BartChris
Copy link
Collaborator

While inspecting the Pull request i noticed a general problem:
#6303

@solth
Copy link
Member Author

solth commented Nov 15, 2024

When processing the list in the background (because the number of imported processes is greater than the configured limit) the process stops with a ProcessGenerationException.

For the background processing the behaviour should probably be more like in the mass import, where items with duplicate titles are skipped. In the form Kitodo should catch invalid titles and force the user to correct them.

@BartChris thanks for your analysis. Actually I implemented it this way deliberately. The goal here was to restore the original state - from before the import started - when an error occurs so that the user can first check the uploaded file for problems, fix those problems and then re-try the upload with the updated file. Therefore, all processes that were already created up to the point where the exception is thrown are "cleaned up", e.g. removed.

I do understand, though, that a differnt behavior could be preferable, where importing collections does not fail with individual problematic EAD data records, but instead skips those processes and proceeds with the next "items" from the EAD file. If there is no consense on which option is better, we could ideally introduce a parameter in kitodo_config.properties to control the error handling that way.

@henning-gerhardt
Copy link
Collaborator

The goal here was to restore the original state - from before the import started - when an error occurs so that the user can first check the uploaded file for problems, fix those problems and then re-try the upload with the updated file. Therefore, all processes that were already created up to the point where the exception is thrown are "cleaned up", e.g. removed.

Maybe nothing for now: but could here help a dry run on importing without a real save on database, index and file system? Deleting processes can even cause additional work in the background.

@solth
Copy link
Member Author

solth commented Nov 15, 2024

Maybe nothing for now: but could here help a dry run on importing without a real save on database, index and file system? Deleting processes can even cause additional work in the background.

I wouldn't do that in this particular case. Uploaded EAD XML files are parsed using a SAX parser in order to be able to handle large XML files without increasing memory requirements. To keep this advantage no lists of complex objects should be retained during the import of an EAD collection (which would grow with XML files containing large collections). Instead each process is completely handled (including saving it to filesystem, index and database) when reaching its corresponding "end event" in the main while loop of ImportEadProcessesThread.

I added a parameter that controls how the system should handle exceptions during EAD import. The default behavior is now as @BartChris described, so that erroneous records are skipped and the import is continued with the following EAD element. In this case, no intermediate processes are removed, so no additional deletion of processes is performed anyway.

@solth solth requested a review from BartChris November 15, 2024 11:47
@BartChris
Copy link
Collaborator

I tested with the given test ead file. A problem appears, when i try to import the file two times. I then get a Null pointer for the createProcessLocation call of the child processes and they are not created. The problem is that when processTempProcess is called for a child process (with an already existing title)

I still have the problem, that when using the standard import way (without background task) i get a NullPointer exception, when i try to import the same EAD collection multiple times.

@solth
Copy link
Member Author

solth commented Nov 15, 2024

I tested with the given test ead file. A problem appears, when i try to import the file two times. I then get a Null pointer for the createProcessLocation call of the child processes and they are not created. The problem is that when processTempProcess is called for a child process (with an already existing title)

I still have the problem, that when using the standard import way (without background task) i get a NullPointer exception, when i try to import the same EAD collection multiple times.

This seems to be a problem in the hierarchical import in general that also occurs in the master branch and is not related to the changes in this pull request. There (in the master branch), I just tried to import a process hierarchy from Kalliope (ID DE-2060-BE-13-C35, import depth = 1, import children = true) twice and on the second try got the same NullPointerException you documented above.

@BartChris If this is easy to resolve I can try to include a fix in this pull request, otherwise I would suggest to handle that issue separately.

@solth
Copy link
Member Author

solth commented Nov 15, 2024

I still have the problem, that when using the standard import way (without background task) i get a NullPointer exception, when i try to import the same EAD collection multiple times.

@BartChris I think this will take a little more time to properly fix. I would therefor suggest to deal with this problem separately. I will create a ticket for this problem.

@BartChris
Copy link
Collaborator

BartChris commented Nov 15, 2024

@solth I agree and approve the Pull request. I think the real usage in the archives has to show, wether we need additional adjustments, but it works good for me.

Some additional ideas for the future:

  • Also allow the background processing for the "normal" mass import
  • Think about a way to provide a report after a background processed mass import (EAD and normal) to report the result of the processing (Skipped processes? Reason?) in a way which is usable for non-admins

@solth
Copy link
Member Author

solth commented Nov 15, 2024

Some additional ideas for the future:

* Also allow the background processing for the "normal" mass import

* Think about a way to provide a report after a background processed mass import (EAD and normal) to report the result of the processing  (Skipped processes? Reason?) in a way which is usable for non-admins

Yes, these are definitely the next two incremental steps that we should try to achieve and that I already had in mind as well. Thanks a lot for the review!

@solth solth merged commit d84f92c into kitodo:master Nov 15, 2024
5 checks passed
@solth solth deleted the ead-import branch November 15, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import von Archiv-Metadaten aus EAD-DDB-XML um hierarchische Vorgänge automatisch zu erstellen
3 participants