Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force live object when adding to the index #91

Merged
merged 2 commits into from
Apr 17, 2024
Merged

Conversation

blueo
Copy link
Collaborator

@blueo blueo commented Apr 7, 2024

This change tries to make a consistent way of ensuring a dataobject can only be 'Live' when adding to the index but may be in another state for when it needs to be removed. It does this by using the BEFORE_ADD event to re-fetch a dataobject in "live" mode. This means a non-live dataobject can be given to DataObjectDocument for use when removing or otherwise checking if an object should be indexed - however when adding to an index, it will always use a live object.

Key Changes

  • DataObjectDocument's shouldIndex has been changed to use isPublished instead of isLiveVersion. This means you can pass a DataObject that is not live and checks will still allow it to be indexed if there is a 'live' version available. There is a current edge case where you could have a published object but the 'draft' version is the most recent - in this case a dataobject will be removed from the index even though it has a live version.
  • DataObjectDocument's onAddToSearchIndexes has had logic added to the BEFORE_ADD event to refetch a document in 'live' mode. This ensures only published content will go to an index as this event is called immediately prior to processNode in the indexer.

Notes

Initially I thought this would allow indexing of draft content. Queues are normally run via a "dev task" which is run in the DevelopmentAdmin context. This controller will set the reading mode to DRAFT. The upshot of this is that an Index job will unserialise a DataObject in DRAFT reading mode (saved as a Data Query param) and the subsequent toArray call could get draft content. This is often seen as a link indexed with ?stage=Stage parameters. However because shouldIndex was checking that it was the Live version - this content would always have been the same as the published object.

@blueo blueo force-pushed the pulls/force-live-on-unserialize branch from 5e1579b to 43c744b Compare April 7, 2024 22:46
Copy link
Contributor

@satrun77 satrun77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen this issue, but code look good to me thank ❤️

Copy link

@ssmarco ssmarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and works as expected 👍

@ssmarco
Copy link

ssmarco commented Apr 9, 2024

@blueo Since we got 2 PRs for this module, I wonder which one should go to the finish line first and that might mean the other one will have to rebase as I also used the tests/Fake/PageFake.php in my PR.

@blueo
Copy link
Collaborator Author

blueo commented Apr 12, 2024

I think I'll update the title for this - after a bit of discussion with @ssmarco it looks like there is a better way to solve this.
It looks as though the logic for passing a live object to the indexer should live with the DataObjectDocument but (unlike changing the _unserialize function) we need the context of if an object is being added or removed from an index. This is to accommodate removing deleted items - as they need to present the 'old' version of a data object where as when adding to the index, we want to ensure an item is live.

A possible fix is via the indexer eg

if ($document instanceof DataObjectDocumentInterface) {
                        // Making sure we get the Live version of the DataObject before indexing the document
                        Versioned::withVersionedMode(static function () use ($document): void {
                            Versioned::set_stage(Versioned::LIVE);
                            $dataObject = $document->getDataObject();
                            $liveDataObject = DataObject::get($dataObject->ClassName)->byID($dataObject->ID);
                            $document->setDataObject($liveDataObject);
                        });
                    }

could be added prior to adding a document in src/Service/Indexer.php

Another is to update the onAddToSearchIndexes function which is called prior to adding a document to the index.

@blueo blueo changed the title Force live object on unserialise Force live object when adding to the index Apr 15, 2024
blueo added 2 commits April 15, 2024 16:49
This corrects behaviour that potentially allows indexing of draft content. Queues are normally run via a "dev task" which is run in the DevelopmentAdmin context. This controller will set the reading mode to DRAFT. The upshot of this is that an Index job will unserialise a DataObject in DRAFT reading mode (saved as a Data Query param) and the subsequent `toArray` call could get draft content. This is often seen as a link indexed with `?stage=Stage` parameters.

The impact is likely limited because an IndexJob is usually queued by a publish action and run shortly afterwards. In these cases it is unlikely that draft changes are present - but is isn't impossible.

I checked the ReIndexJob run and it doesn't appear to have this issue because it sets the reading mode in the setup method. I have instead added this to the unserialise function as it seemed a safer method to ensure the reading mode was as expected.
@blueo blueo force-pushed the pulls/force-live-on-unserialize branch from 6588580 to e95d234 Compare April 15, 2024 04:49
@ssmarco
Copy link

ssmarco commented Apr 15, 2024

@blueo Have retested this again manually and working as expected.

Copy link
Collaborator

@chrispenny chrispenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling a little bit to replicate the original issue.

Note: I'm using Elasticsearch, not Enterprise Search (just in case that becomes relevant to my difficulties).

  1. Checkout our current master branch
  2. Config applied in project (below), Page includes a link field (which points to the Link() method)
---
Name: app-search
---
SilverStripe\SearchService\Service\IndexConfiguration:
  auto_dependency_tracking: true
  crawl_page_content: false
  common_fields: &common_fields
    ...
    link:
      property: Link
      options:
        type: keyword
  indexes:
    main:
      includeClasses:
        Page:
          fields:
            <<: *common_fields
            ...
  1. Publish a page
  2. Run the Search service adding 1 documents Job
  3. Check index for added document

Expected result: link field contains ?stage=Stage

Actual result: link field does not contain ?stage=Stage


That said, the code changes themselves make a lot of sense to me.

// Dataobject is only in draft
if ($dataObject->hasExtension(Versioned::class) && !$dataObject->isLiveVersion()) {
// DataObject has no published version (or draft changes could cause a doc to be removed)
if ($dataObject->hasExtension(Versioned::class) && !$dataObject->isPublished()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that your fix below will solve this edge case. The check for isLiveVersion does check the _Live table, so now that you've made sure the DataObject is consistently the LIVE record, I think this would work.

That said, it makes more practical sense to me for us to be checking isPublished(), as I think even if you have a different LIVE version, you would still want to reindex (and you certainly wouldn't want to remove it from the index), so I'm more than happy to make this switch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the Before Add event will actually run after this check. So its possible that you can add a DO that is the latest version, but it is draft (eg you published and then saved some draft changes). In this case the most recent (read draft) DO will be fetched by the _unserialize function because dev tasks run in 'draft' reading mode. Then the isLiveVersion will fail (as its not the live version, it is a version after) and the indexer will remove it.

I don't think we can always make the attached dataobject Live because we need a record for deleted versions (or we'd need to rethink how deletions are done)

@blueo blueo merged commit 4509830 into 3 Apr 17, 2024
18 checks passed
@blueo blueo deleted the pulls/force-live-on-unserialize branch April 17, 2024 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants