Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query in the frontend does not find proper indexed document #277

Closed
it25fg opened this issue Jul 31, 2023 · 15 comments
Closed

Query in the frontend does not find proper indexed document #277

it25fg opened this issue Jul 31, 2023 · 15 comments

Comments

@it25fg
Copy link

it25fg commented Jul 31, 2023

Nextcloud 26.0.4, ElasticSearch 8.6.1

I can manually verify that indexing (rebuilt with upgrade to ES 8.6.1) and live indexing works. With a web query (a simple one, without further filtering) the document can be found. But occ fulltextsearch:search (or, the NC web gui search) does not find it: neither by the owner nor by any of the sharees.

The only thing that puzzles me is that the owner is not mentioned in the _source/users[] array... is this correct? In the share_names dictionary, the owner is present, and of course in the _source/owner property.

If the above observation is not relevant then there can only be something wrong with the query issued by the Nextcloud app (not sure which of the three components).

I can debug queries and responses: my search daemon is local and I have switched off SSL, so I can simply capture it by tcpdump. How to proceed here?

@it25fg
Copy link
Author

it25fg commented Aug 11, 2023

In the meantime, the app has been updated twice, current version is 26.0.2 -- I have repeated the procedure of fresh setup and rebuilding the index. Search in the frontend (and using occ fulltextsearch:search) does not find anything, while a pure web request against Elasticsearch finds the expected documents. Somebody out there to look into this?

@tuxecert
Copy link

tuxecert commented Aug 11, 2023

I've encountered a similar issue whether I'm using Nextcloud 26.0.4 or 27.0.1, both with the latest versions of fulltextsearch. A fresh full indexing shows no results either in the GUI or with the occ command. However, I found that the problem only occurs for LDAP accounts. Local users in Nextcloud, whether admin or not, do not have the problem.
It came up for me reproducibly with fulltextsearch app version 26.01 or 27.01. I don't have any problems with 26.0.0 or 27.0.0

@it25fg
Copy link
Author

it25fg commented Aug 11, 2023

While experimenting with the new versions, I recognize a somewhat weird behaviour: occ fulltextsearch:search user string does never find files by content, but if I issue a query that finds a file by name, I get (sometimes) this exception:

400 Bad Request: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"The length [6348384] of field [content] in doc[12188]/index[allfi
  les] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value les
  s than index setting [1000000] and this will tolerate long field values by truncating them."}]

But if I look at the document in question, the length (6348384) is the length of the content property, which is the base64-encoded BINARY document! If I enter the same query in the web search, it does not even return (simply the previous results remain forever... seems the exception is swallowed somewhere).
If I (just for testing) increase the index setting by issuing

curl -XPUT http://127.0.0.1:9200/allfiles/_settings -H 'Content-Type: application/json' -d '{"index":{"highlight.max_analyzed_offset":1000000000}}'

the commandline search succeeds, and in the web frontend I get a result (and interestingly, the 2nd line of the result does not show some content snippet, but rather the raw base64 string)

So my question is: can it be that the query issued by occ or the web frontend simply invokes a search on the wrong property (content instead of attachment.content) ?

@it25fg
Copy link
Author

it25fg commented Aug 18, 2023

Further experimenting with NC 26.0.5, ES 8.x ... When I change the file apps/fulltextsearch_elasticsearch/lib/Service/SearchMappingService.php all occurrences of 'content' to 'attachment.content':

270         $fields = array_merge(['attachment.content', 'title'], $request->getFields());
420         $fields = ['attachment.content' => new stdClass()];

then I get two changes:

  • for the first time, a search by content returns correct results at all!
  • the highlight snippet (second line of search result) shows the real text, and not the base64 gibberish!

Please somebody verify. For me, fulltextsearch with Elasticsearch 8.x has never worked.

@it25fg
Copy link
Author

it25fg commented Aug 27, 2023

Now that I have changed the above file... it turns out that this is not a real solution; even if it makes the search in files content working, it breaks DECK and perhaps other content providers... as soon as I change the search and highlight field names from content to attachment.content -> deck does not find cards by content anymore.

So my conclusion for now: the content providers provide their contents differently to the index process, and the searcher can either search the one or the other. Or someone knowledgeable (e.g. not me) please step up and change the query generation such that all content can be found. Or change the indexing process such that the queries find everything.

PLEASE, is there somebody listening? The current state of fulltextsearch in Nextcloud is that it does not work at all (or somebody prove me wrong). If I should provide logs or anything to diagnose the problem please let me know.

(and YES I know, it's your spare time: it's mine too.)

@XueSheng-GIT
Copy link

Generally fulltextsearch works for me (NC27.0.2). Do you encounter your issue with specific files only or with all files within your cloud? Did you try to completely rebuild your index?

@it25fg
Copy link
Author

it25fg commented Aug 28, 2023

To summarize:

  • I'm using Nextcloud 26.0.5, fulltextsearch 26.0.2, Elasticsearch 8.6.1
  • YES I have rebuilt the whole index and there were no errors.
  • When I'm using the http interface of ElasticSearch, I can verify that (1) my documents are there, (2) my documents carry the correct owner/user/share properties and (3) I can find documents by their name and by some random content.
  • When I'm using occ fulltextsearch:search or the search frontend of fulltextsearch in Nextcloud, I do not find any file by its content. Anything else CAN be found correctly: deck cards by their title, deck cards by their content, and files by their name.

So now, I changed the SearchMappingProvider.php and changed all occurrences of 'content' into 'content.attachment'. Now the situation has reversed partly:

  • searches by name always return correct results (for files and for deck cards too)
  • deck cards CANNOT be found by content anymore
  • files ARE found by content.

From my humble understanding, there must be a basic difference between how documents of different providers are indexed, and how they can be found. And I'm not involved enough to find the right place to correct it.

You say it works with Nextcloud 27? I won't upgrade right now, but I can try to compare if the versions of the three fulltextsearch apps carry significant changes between 26.0.2 and 27.0.. Or maybe the Deck app has changed and indexes differently now?

@XueSheng-GIT
Copy link

XueSheng-GIT commented Aug 29, 2023

I don't assume this is a general issue because it seems to work on my instances. How does your search query look like?
@R0Wi provided already some guidance how to get the query (see your parallel issue: #269 (comment)).

Maybe it would also be helpful if you could provide some sample file and specific steps how to reproduce this issue so that others can double check.

@it25fg
Copy link
Author

it25fg commented Sep 1, 2023

Ok, let's make an example.

  • I create a text file named 'WeirdExample.txt' with the content: This is an example text file that demonstrates what Fulltextsearch does (or more precisely, does not).
  • I wait until the live indexer has indexed it.
  • I do a curl http://localhost:9200/allfiles/_search?q=demonstrates | jq . and the document is found. Some interesting snippets:
  {
    "_index": "allfiles",
    "_id": "files:811155",
    "_score": 9.628709,
    "_source": {
      "owner": "paule",
      "groups": [],
      "circles": [],
      "metatags": [
        "files_local"
      ],
      "source": "files_local",
      "title": "Tests/WeirdExample.txt",
      "users": [],
      "content": "VGhpcyBpcyBhbiBleGFtcGxlIHRleHQgZmlsZSB0aGF0IGRlbW
      "tags": [],
      "attachment": {
        "content_type": "text/plain; charset=ISO-8859-1",
        "language": "en",
        "content": "This is an example text file that demonstrates w
        "content_length": 104
      },
  • As one can see: it shows my username as 'owner', it shows base64 text as 'content', and it shows the plain text as 'attachment.content'.
  • This is about my query. Now to the app's queries. (don't choke on root prompt and occ without sudo: i've wrapped it)
root@blackbox:~# occ fulltextsearch:search paule demonstrates
search
> Deck
> Files
root@blackbox:~# occ fulltextsearch:search paule weirdexample
search
> Deck
> Files
 - 811155 score:0
  • As I already wrote: when I search by name it finds the document (files:811155). If I search by content it finds nothing. Now to the app's web interface. It shows exactly the same results: a search by name finds the file, a search by content does not.
    image
    image
  • And as I already wrote: In the file SearchMappingService.php, I change two mentions of 'content' into 'attachment.content', and what do I get:
root@blackbox:~# occ fulltextsearch:search paule demonstrates
search
> Deck
> Files
 - 811155 score:0

image

Unfortunately, the change I made has only reversed the situation: Now Deck cards cannot be found by their content anymore (which worked without my change).

If you say 'for me, it works' does it mean you can find documents of different providers (files AND deck) and you can find them by their name as well as by their content?

You wanted to know the app's queries and results. I have stripped the section that queries for sharees and groups I'm in. And I have omitted the query and results for Deck since these don't matter here.

{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": [
            {
              "match_phrase_prefix": {
                "content": "demonstrates"
              }
            },
            {
              "match_phrase_prefix": {
                "title": "demonstrates"
              }
            },
            {
              "match_phrase_prefix": {
                "share_names.paule": "demonstrates"
              }
            },
            {
              "wildcard": {
                "title": "*demonstrates*"
              }
            },
            {
              "wildcard": {
                "share_names.paule": "*demonstrates*"
              }
            },
            {
              "query_string": {
                "fields": [
                  "parts.comments"
                ],
                "query": "demonstrates"
              }
            }
          ]
        }
      },
      "filter": [
        {
          "bool": {
            "must": {
              "term": {
                "provider": "files"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "owner": "paule"
                }
              } 
            ]  
          }
        },
        {
          "bool": {
            "should": []
          }
        },
        {
          "bool": {
            "must": []
          }
        },
        {
          "bool": {
            "must": []
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {},
      "parts.comments": {}
    },
    "pre_tags": [
      ""
    ],
    "post_tags": [
      ""
    ]
  }
}
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

So now: if this all works for you, what is the difference?

@gpgmailencrypt
Copy link

I'm having -almost- the same problem. The main differences are:

  • I use Nextcloud 27.0.2 with elasticsearch 8.8.2
  • Search in nextcloud usually works. The error only happens with a few documents, but the error is reproducable (to avoid the question: yes, I have re-created the index from scratch)

The main part of the error message is:

"Elastic\\Elasticsearch\\Exception\\ClientResponseException","message":"400 Bad Request: {\"error\":{\"root_cause\":[{\"type\":\"illegal_argument_exception\",
\"reason\":\"The length [2054693] of field [content] in doc[69]\/index[nextcloud] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. 
To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them.

But I think this error message is misleading. When I change max_analyzed_offset to 99999, I get the same error message, just with the hint that I should set max_analyzed_offset below 99999.

My search word consists of 10 letters. Funny thing is, that I can enter the first 3 letters, and nextcloud finds the correct documents, but when entering the 4th letter the error appears (while in elasticsearch the whole word works).

This is the reason why I don't expect this just a problem of a few installations, I assume it is a general issue

@it25fg
Copy link
Author

it25fg commented Sep 1, 2023

But I think this error message is misleading.

I'd think it can be understood this way: the error message asks you to set the query parameter to less than the index setting, but what you did was to decrease the index setting instead. With the curl statement in #277 (comment) I have done it the other way: I have increased the index setting.

The problem is that Nextcloud's query does not contain a highlight limit at all, so it asks for highlighting in unlimited lengths, and the index refuses it if the given document has a too-long content property which exceeds its own limit.

@R0Wi
Copy link
Member

R0Wi commented Sep 2, 2023

Hey folks, wanted to give some feedback from my end here.

The max_analyzed_offset parameter

@gpgmailencrypt I think you're right and this app currently does not set the max_analyzed_offset parameter in the query it sends over to ES. Like also outlined in the ES docs, this can lead to exactly the error you mentioned in some edge cases. So here I'd say that it might be useful to add a meaningful value for max_analyzed_offset inside of the query construction. @ArtificialOwl might chime in here.

The content field of ES documents

@it25fg I checked your example from #277 (comment) and what I can see is that in my personal ES index the content field of the documents are properly filled with the plaintext values of the files. So a text file in my index looks like this:

 {
        "_index" : "nextcloud_index",
        "_id" : "files:4848622",
        "_score" : 10.761789,
        "_source" : {
          "owner" : "",
          "groups" : [ ],
          "circles" : [ ],
          "metatags" : [
            "files_external"
          ],
          "source" : "files_external",
          "title" : "path/to/file.txt",
          "users" : [
            "__all"
          ],
          "content" : "This is a plaintext non base64",
          "tags" : [ ],
          "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "language" : "ro",
            "content_length" : 174
          },
          "provider" : "files",
          "subtags" : [ ],
          "parts" : {
            "comments" : ""
          },
          "links" : [
            ""
          ],
          "share_names" : [ ],
          "hash" : "0d7b2cd93a9ca0ac89e199a5f4ad2208"
        }
      }

So to me it looks like there's something wrong in the way your documents have been created. As a temporary fix you could try to optionally add attachment.content to your query like this (did not test it):

// ...
 "should": [
            {
              "match_phrase_prefix": {
                "content": "demonstrates"
              }
            },
           {
              "match_phrase_prefix": {
                "attachment.content": "demonstrates"
              }
            },
// ...

But you'd need to patch the PHP code for that. Here I'd say the we'd rather need to investigate why your index documents look different than mine.

@XueSheng-GIT
Copy link

@it25fg I checked your example from #277 (comment) and what I can see is that in my personal ES index the content field of the documents are properly filled with the plaintext values of the files. So a text file in my index looks like this:

@it25fg I did double check on nextcloud 27.0.2 with fulltextsearch-elasticsearch 27.0.2 and elasticsearch 8.9.1. The index looks like shown by @R0Wi (content field filled properly).
Searching with occ does find the files as expected (followed exactly your steps).
Thus, the main question remains, why your index is not created properly. It doesn't seem to be an issue of searching itself.

The highlight.max_analyzed_offset / max_analyzed_offset issue seem not to be directly related. Maybe these two issues should be separated so that it's easier to follow up/fix.

@it25fg
Copy link
Author

it25fg commented Sep 8, 2023

Many thanks @XueSheng-GIT @R0Wi for the enlightenment. So it looks like my index was built wrong? But that's not strictly me who did it: it is the same Fulltextsearch app you all have, and it was the same occ fulltextsearch:index command you all have invoked...

And there, I have it! When I first tried to update ES to 8.x (not knowing that it could not work with NC 25) I came into the situation that occ fulltextsearch:reset did not create the index and the attachment pipeline, and I made a rebuild script that would do it manually:

curl -X PUT 127.0.0.1:9200/${INDEX}
curl -H 'Content-Type: application/json' \
        -X PUT \
        -d '{"description":"Extract attachment information","processors":[{"attachment":{"field":"content","indexed_chars":-1}}]}' \
                127.0.0.1:9200/_ingest/pipeline/attachment

And with the update of NC to 26, I did not remove these requests (because I simply did not know that they weren't necessary anymore). Now I have removed them from my 'rebuild all' script, and the attachment pipeline now is:

# curl -s http://127.0.0.1:9200/_ingest/pipeline/attachment | jq .
{
  "attachment": {
    "description": "attachment",
    "processors": [
      {
        "attachment": {
          "field": "content",
          "indexed_chars": -1
        },
        "convert": {
          "field": "attachment.content",
          "type": "string",
          "target_field": "content",
          "ignore_failure": true
        },
        "remove": {
          "field": "attachment.content",
          "ignore_failure": true
        }
      }
    ]
  }
}

which clearly explains that the extracted text is stored back into the 'content' field.

My index is now being rebuilt. I'll soon be back with new results...

@it25fg
Copy link
Author

it25fg commented Sep 8, 2023

Many thanks for all the insights and help! Now that occ fulltextsearch:reset correctly creates the index and attachment pipeline config, the index contains the content in the 'content' property (surprise). Closing as resolved.

@it25fg it25fg closed this as completed Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants