Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping "=" equal symbols in search using served index #29

Closed
zjgreen opened this issue Jul 29, 2018 · 5 comments
Closed

Escaping "=" equal symbols in search using served index #29

zjgreen opened this issue Jul 29, 2018 · 5 comments

Comments

@zjgreen
Copy link

zjgreen commented Jul 29, 2018

I've experienced that searches which contain = return too many results. My q parameter equals: name=anything which gets encoded to name%3Danything when searching via served index.

Searching via CLI does not have this issue.

Steps to reproduce:

Download tantivy static binary for macOS

https://github.com/tantivy-search/tantivy-cli/releases/download/0.4.2/tantivy-cli-0.4.2-x86_64-apple-darwin.tar.gz

(can reproduce in linux x86 64 too)

Create new index

$ ./repro/tantivy new --index repro/index

Creating new index
Let's define it's schema!



New field name  ? text
Text or unsigned 32-bit integer (T/I) ? t
Should the field be stored (Y/N) ? n
Should the field be indexed (Y/N) ? y
Should the field be tokenized (Y/N) ? y
Should the term frequencies (per doc) be in the index (Y/N) ? n
Add another field (Y/N) ? y



New field name  ? name
Text or unsigned 32-bit integer (T/I) ? y
Error: Invalid input. Options are (T/I)
Text or unsigned 32-bit integer (T/I) ? t
Should the field be stored (Y/N) ? y
Should the field be indexed (Y/N) ? n
Add another field (Y/N) ? n

Index our two documents from data.json

# data.json
{"name":"bob","text":"this is text for bob, mainly it contains: name=\"bob\" which we want to find"}
{"name":"mary","text":"this is text for mary, mainly it contains: zame=\"mary\" which we do not want to find"}
$ cat repro/data.json | repro/tantivy index -i repro/index
Commit succeed, docstamp at 2
Waiting for merging threads
Terminated successfully!

Serve index on port 3000

$ ./repro/tantivy serve -i ./repro/index
listening on http://localhost:3000

Search for anything containing an equals symbol

$ curl "http://localhost:3000/api/?q=name%3Danything"

Expected Output

No matches

Actual Output

{
  "q": "name=askljdkfj",
  "num_hits": 1,
  "hits": [
    {
      "doc": {
        "name": [
          "bob"
        ]
      }
    }
  ],
  "timings": {
    "timings": [
      {
        "name": "search",
        "duration": 54,
        "depth": 0
      },
      {
        "name": "fetching docs",
        "duration": 70,
        "depth": 0
      }
    ]
  }
}
@fulmicoton
Copy link
Collaborator

= signs are treated as a whitespace both by the analyzer and the
QueryParser.
Also by default the QueryParser treats sequence of tokens as
disjunctions, so your query is logically equivalent to name OR anonymous.

To get the result you expect you need to add quotation marks around your
query to declare it as a PhraseQuery.

Note it would then still match a doc that contains a space in place of the
equal. This problem cannot be solved without using a custom analyzer.

Finally, if you are performance critical there is also plenty of room to
make your use case run much faster.

@zjgreen
Copy link
Author

zjgreen commented Jul 30, 2018

Thanks @fulmicoton. That makes sense. I've found that some queries with special characters work as expected (with phrase matching) in CLI, but can not be properly escaped using the serve API.

For now I've just replaced them with a (space) and seems to do the trick.

Thanks!

@zjgreen zjgreen closed this as completed Jul 30, 2018
@fulmicoton
Copy link
Collaborator

That's surprising. In your response payload repeats the query. In your example the = sign was decoded correctly. If you add double quotation marks in your query you don't get the result you expect?

@zjgreen
Copy link
Author

zjgreen commented Aug 6, 2018

Hi @fulmicoton,

It makes sense what you say about treating = the same as (spaces) if I wrap it in quotes.

I think I'm able to reproduce what I didn't understand, though, sorry in advance if it ends up being some elementary encoding issue!

  1. Copy tantivy binary to a directory, and mkdir index

  2. Create new index

$ ./tantivy new -i index/

Creating new index
Let's define it's schema!



New field name  ? text
Text or unsigned 32-bit integer (T/I) ? t
Should the field be stored (Y/N) ? n
Should the field be indexed (Y/N) ? y
Should the field be tokenized (Y/N) ? y
Should the term frequencies (per doc) be in the index (Y/N) ? y
Should the term positions (per doc) be in the index (Y/N) ? y
Add another field (Y/N) ? y



New field name  ? name
Text or unsigned 32-bit integer (T/I) ? t
Should the field be stored (Y/N) ? y
Should the field be indexed (Y/N) ? n
Add another field (Y/N) ? n

[
  {
    "name": "text",
    "type": "text",
    "options": {
      "indexing": "position",
      "stored": false
    }
  },
  {
    "name": "name",
    "type": "text",
    "options": {
      "indexing": "unindexed",
      "stored": true
    }
  }
]
  1. Add a key="value" text to index
$ echo '{"name":"raul","text":"i am phrase=\"value\""}' | ./tantivy index -i index/
Commit succeed, docstamp at 1
Waiting for merging threads
Terminated successfully!
  1. Add another document with different words between to not match the phrase
$ echo '{"name":"paul","text":"i am not a phrase with value"}' | ./tantivy index -i index/
Commit succeed, docstamp at 2
Waiting for merging threads
Terminated successfully!
  1. Serve the index
$ ./tantivy serve -i index/
listening on http://localhost:3000

Attempts to isolate the first document which contains phrase="value"

Attempt 1

$ ./tantivy search -i index/ -q '"phrase=\"value\""'
{"name":["raul"]}
{"name":["paul"]}

Attempt 2

$ ./tantivy search -i index/ -q '"phrase=\\"value\\""'
{"name":["raul"]}
{"name":["paul"]}

Attempt 3

$ ./tantivy search -i index/ -q '"phrase=\\\"value\\\""'
{"name":["raul"]}
{"name":["paul"]}

(pause to laugh at https://xkcd.com/1638/)

Attempt 4

$ curl "http://localhost:3000/api/?q=%22phrase%3D%5C%22value%5C%22%22"
{
  "q": "\"phrase=\\\"value\\\"\"",
  "num_hits": 2,
  "hits": [
    {
      "doc": {
        "name": [
          "raul"
        ]
      }
    },
    {
      "doc": {
        "name": [
          "paul"
        ]
      }
    }
  ],

Attempt 5 - If i replace =," with spaces, it returns the one document

$ ./tantivy search -i index/ -q '"phrase value"'
{"name":["raul"]}


$ curl "http://localhost:3000/api/?q=%22phrase%20value%22"
{
  "q": "\"phrase value\"",
  "num_hits": 1,
  "hits": [
    {
      "doc": {
        "name": [
          "raul"
        ]
      }
    }
  ],

@fulmicoton
Copy link
Collaborator

Thanks for the great bug report. I'll have a look at that soonish!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants