Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add special encoding of json array elements #3368

Open
wants to merge 49 commits into
base: develop
Choose a base branch
from

Conversation

islamaliev
Copy link
Contributor

Relevant issue(s)

Resolves #3329

Description

Mark array elements in a special way so that it's possible to scope the search only to arrays.

Introduces JSONPath that helps differenciate parts of json path.

Also changes the behavior of _none filter operator so that it's exlude resuls if they are not arrays.

client/json.go Outdated Show resolved Hide resolved
@islamaliev
Copy link
Contributor Author

question: What made you go with approach instead of just checking the type of the value as/when needed? Why did you feel like the encoding needed to change?

when index is reading encoded keys it has how idea what's inside. It has either a prefix or an exact value. We use to encode like this (taking the example from the issue):
For these docs:

{
  "name": "Andy",
  "custom": {
    "numbers": 4,
  }
}
{
  "name": "John",
  "custom": {
    "numbers": [3, 4],
  }
}

Old way to store indexes:

/1/1/custom.number/3/john-doc-id
/1/1/custom.number/4/andy-doc-id
/1/1/custom.number/4/john-doc-id

So when one of these is fetched there is no info if it's array or not.
New way (byte code is a bit different than present here):

/1/1/custom.number.0/3/john-doc-id
/1/1/custom.number.0/4/john-doc-id
/1/1/custom.number/4/andy-doc-id

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Jan 13, 2025

when index is reading encoded keys it has how idea what's inside. It has either a prefix or an exact value. We use to encode like this (taking the example from the issue): For these docs: ...

Excellent explanation, makes sense to me now, thanks Islam!

/1/1/custom.number.0/3/john-doc-id
/1/1/custom.number.0/4/john-doc-id

question: What is the significance of 0 here? Is both of them being 0 a typo? Could this not result in a clash if another doc defines a json-map with a 0 key?:

{
  "name": "John",
  "custom": {
    "numbers": [3, 4],
  }
}
{
  "name": "Andy",
  "custom": {
    "numbers": {
      0: "foo"
    },
  }
}

// Results in keys:

/1/1/custom.number.0/3/john-doc-id
/1/1/custom.number.0/4/john-doc-id
/1/1/custom.number.0/foo/andy-doc-id

@islamaliev
Copy link
Contributor Author

What is the significance of 0 here?

0 is just a random marker I chose for now. We can easily turn this into a real array index:

/1/1/custom.number.0/3/john-doc-id
/1/1/custom.number.1/4/john-doc-id

but it's not necessary for our use case for now. Later we can employ it when we need to process more sophisticated array queries.

"numbers": {
0: "foo"
}

this is illegal json. This would be correct

{
"numbers": {
       "0": "foo"
   }
}

And 0 and "0" are encoded differently, so there is no collision there

client/json.go Outdated Show resolved Hide resolved
client/json.go Outdated Show resolved Hide resolved
}, client.TraverseJSONOnlyLeaves(), client.TraverseJSONVisitArrayElements(false))
},
// we don't want to traverse intermediate nodes, because we encode only values that can be filtered on
client.TraverseJSONOnlyLeaves(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why are these functions declared in the client package? Do you think users will want to use them, and if so, for what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't have to be in client package. It can be easily moved elsewhere? Do you have an idea where it would fit best?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are only ever called in internal/db/index.go, so my preference would be to make them private funcs in that file. But the most important thing for me is to move them out of the users' scope and to somewhere internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out it's not that easy to move it out.

I created a ticket.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your earlier answers Islam, I have now reviewed the PR - looks good, but I think you are dropping an error, and a little more documentation would be handy.

client/json.go Outdated Show resolved Hide resolved
@@ -59,17 +143,17 @@ type JSON interface {
Marshal(w io.Writer) error

// GetPath returns the path of the JSON value in the JSON tree.
GetPath() []string
GetPath() JSONPath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Please note that this is the index path only, and does not represent the path of the base (datastore-doc) data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said earlier, although at the moment it's used only by the indexer, it is not index-specific, hence doesn't need to mention it.
And there no point in saying anything about storage, as it's irrelevant. It says "in the JSON tree".

I thought we agreed that TODOs are going to be used only if a reviewer considers the code (or absence of it) a blocker. I don't think any request for comment or documentation can ever be considered a blocker. To me it feels like we failed the experiment at it's very start.

Copy link
Contributor

@AndrewSisley AndrewSisley Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any request for comment or documentation can ever be considered a blocker.

I very strongly disagree with this. Anyone can write greenfield code that gets the machine to do what they want. The hard, and valuable, bit is writing maintainable code that it readable by your peers, including your future self. Code documentation is a valuable tool when it comes to improving code readability, and IMO reviewers, with their fresh, unfocused eyes are far better judges than authors when it comes to guessing where information is missing.

The fact that there is still confusion around what this function and it's types represent suggests very strongly to me that the code is not self describing and requires further documentation in order for maintainers to safely and efficiently maintain it.

This particular doc request

I think perhaps a large contribution to the continued confusion here is a name clash! The existing code/documentation uses the word index a lot, and most of the time it is not referring to secondary indexes (the only place where the code is consumed), but to the index of elements in an array.

It is further confused by the magic 0, that forms part of the encoded result of this path, that has the appearance of an array-location, yet is not. The existence/behaviour of this confuses the example in the documentation for JSONPathPart for myself, although without looking into the codebase/raw-store this problem will not affect users.

Perhaps we can reduce this confusion by:

  1. suggestion: Avoid the use of the word index in the documentation for JSONPathPart, for example perhaps talk about the location in the array instead of array index.
  2. suggestion: Change the example in the documentation for JSONPathPart to avoid talking about the first index, so that we avoid the ambiguity of the hardcoded 0. Perhaps add a second element to the example array and talk about that instead. And/or note the existence of the magic 0 in that documentation and what it represents (that might be out of place though).
  3. suggestion: Add a line to the documentation for GetPath noting that it is currently only used internally for the encoding of secondary indexes (please make sure you include the word secondary, to minimise the whole index vs index confusion).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think the current documentation is relevant and precise enough. This is not index specific and not related to datastore keys so I don't think its relevant to state that it does not represent a datastore doc path.

op, ok := key.(*mapper.Operator)
if ok && isArrayCondition(op.Operation) {
if op.Operation == compOpNone {
// if the array condition is _none it doesn't make sense to use index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Please document why it does not make sense to use an index for _none.

Copy link
Contributor Author

@islamaliev islamaliev Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was obvious. It feel like I'm writing information that has no value.

I have to ask you @AndrewSisley again why you mark this as TODO. Even without the experiment I would disagree with it's TODO status.
It feels like you are not onboard with the experiment and going the opposite direction. I'm a bit frustrated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is obvious to you because you wrote it, and have spent considerably more time than most thinking about it.

It is a blocker, because I want code that is easy to read merged into to the codebase. This review took far longer than it needed to because it was not easy to read, it felt like you dumped 600 lines of undocumented code on us with a 3 line PR description. We are both frustrated. If you want quicker reviews, document your work so that other people can read it.

The experiment is not a 'merge whatever you like whether others can work with it or not' sprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with Fred, he agreed with you that this was not a blocker. He suggested I make it a suggestion, and successfully convinced me that it was suggestion worthy, and that my original comment was information-sparse :)

suggestion: This comment feels incomplete and raises questions in the reader's mind. It would speed up reading if it described why it does not make sense to use an index for _none.

Perhaps you could could change it to something along the lines of:

// if the array condition is _none it doesn't make sense to use index as we need to do a full table scan anyway

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person, if we write documentation to make the code more readable, we should make an effort to describe why the code has this behaviour. In this case, the why is that the full table scan that would be required on the index would perform worst than the datastore full table scan so we don't use the index. It makes sense to update it in my opinion.

A todo was maybe a little much in this case but I don't think the comment goes against the experiment as we also discussed in person.

internal/db/fetcher/indexer_iterators.go Outdated Show resolved Hide resolved
internal/db/fetcher/indexer_iterators.go Outdated Show resolved Hide resolved
internal/db/fetcher/indexer_iterators.go Outdated Show resolved Hide resolved
Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/query Related to the query component feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sec. index: make json array elements be encoded with depth information
4 participants