Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mango covering JSON indexes RFC #4410
base: main
Are you sure you want to change the base?
Mango covering JSON indexes RFC #4410
Changes from all commits
06bb7ba
f0d1d06
f368755
fc76a45
c6ea642
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is
mango_cursor_view.erl
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that is
mango_cursor_view:execute/3
. It might be worth to use fully qualified names to aid readability.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the fields in
include
is not meaningful in any way? Should we add a note highlighting it in the API, just for completeness.As an implementation detail, perhaps we'd just want to normalize it by sorting when creating the design doc and the view signature. That would mean that two indexes with the same details and only the
include
in a different order would be equivalent and "point to" the same view signature.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if there is any value in applying limits, or would that be over-complicating it? We'd have to decide what happens when the limit is reached: crash or ignore and fall back to just loading the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goodness -- great point 🎉
I think that we could validate the length of the list of fields when the ddoc is updated, rather than failing during indexing. We could also limit the depth by counting the
.
characters.One thing we can only validate at index time are things like the length of included strings. I think here that we might want to place a limit on the total size of the values, say 32kb. Even that's quite a few disk pages, though hopefully they are sequential on disk so the kernel's prefilling the page cache ahead of us.
Given it's easier to start with limits and increase them later, perhaps we should think about this more deeply. In a view index we allow ~anything I believe, but here potentially we could be more conservative.
As an example, postgres limits indexes to 32 columns. Its max field size is 1GB; I think we'd like something a little smaller 😬
Are there other limits here?
My thought is that we do limit, and make it configurable, and perhaps start relatively low for the defaults:
mango_json_index_include_fields_max=16
(why 16? Powers of two always sound nice)mango_json_index_include_depth_max=8
mango_json_index_include_size_bytes_max=32768
(32kb)We can enforce
mango_json_index_include_fields_max
andmango_json_index_include_depth_max
in_index
. (We may have to belt-and-braces this as the user can go behind Mango's back to upload views that are the "right shape").mango_json_index_include_size_bytes_max
would need to be checked per document at index time. I worry what the behaviour should be here -- I see options of marking the whole index bad; having rows with "missing" values fields, meaning complexity during query; skipping indexing the document entirely. I lean towards skipping the doc as the least likely to cause unpredictable behaviour, but what's the current behaviour for views if indexing a doc fails?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a section on limits for
include
in 91f99be. I think that the topic is really important and it's a bit facepalm that I skipped it. Too excited about writing Erlang I guess 😬 What do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the behavior for
mango_json_index_include_size_bytes_max
should be either. It's pretty tricky. From the top of my head I don't know what currently happens if we fail to index a document. With mango we don't usually fail, if I had to guess it would be we end up in a crash loop on that document. A user then may index for a day and hit a large field value their index crashes, they'd remove the field, try again (get a new view signature) , index for 2 days and crash, etc.The alternative is to skip the doc but then there is danger of it looking like data loss - user has some very important document (medical record, say allergies to medicine), index skips it, nobody notices until the patient is prescribed the wrong medication (sorry being a bit dramatic here with a silly user story). Maybe that's something we solve outside of the RFC and just that if the value made it past the max_document_size limit then it will be indexed or crash horribly...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of storing plain values in rows we'd allow also storing a marker which would indicate that a value was too large, and just fallback to reading the doc? We can still have a hard-limit (strict) mode perhaps which deals with failures, but this would allows us not to deal with indexing failures so to speak and punt it for later.
There may be an optimal cut-off value where storing it in the index is more wasteful than reading the doc? Or there may be not, as technically I think we can write arbitrarily large values in the index, it will just spread over a lot of b-tree blocks, but we'd still duplicate the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid doing this if we can -- while nice in some ways, this approach makes it more difficult for a customer to understand the performance of their query. I'd like the performance of queries to be predictable. My feeling is that large fields should require a trip to primary data, and that enforcing smaller value sizes keeps it more likely that indexes can live in RAM, where, I think, they should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this a bit in the developer meeting yesterday.
First, we compared this to how this would manifest in M/R views, and the example we used was what if a doc causes a JS exception to be thrown. What we do there is skip the doc and not include any result rows. We do log this in couch.log but it is not surfaced to the HTTP API user.
We went through the motions of whether to add fields inline with the result and also recording a metric for this and concluded from experience that most people do not really care about these things.
However some folks take this seriously, and we want to accommodate those. Imagine this:
_explain
, say:POST /db/_explain?include_result=true
which returns the result like normal, but rows that are missing have an error object in them (or maybe we just show the error rows)So some handwaving to be defined away, but: let’s do this right this time :)