-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CQL Vector support #1165
base: main
Are you sure you want to change the base?
CQL Vector support #1165
Conversation
See the following report for details: cargo semver-checks output
|
78f489c
to
cfdf4e5
Compare
I'm not sure this is the correct way to split this PR into commits (I'm pretty sure it isn't, as the commits won't compile), however I can't think of a proper way. |
cfdf4e5
to
6aee097
Compare
This is needed to deserialize vector metadata as it is implemented as a Custom type with VectorType as its class
6aee097
to
440d63a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed the first commit (introduction of TypeParser
)
Some general comments:
- The logic of
TypeParser
is quite complex. I suggest adding some docstrings next to the type definitions and methods. For example, I have no idea whatTypeParser::from_hex
does. Docstrings will also help a lot in the future in case some other developer touches this piece of code. - It's worth adding some comments next to the non-intuitive parts of the code. Example:
if name.is_empty() {
if !self.is_eos() {
return Err(CqlTypeParseError::AbstractTypeParseError());
}
return Ok(ColumnType::Blob);
}
It's not obvious why we return Blob
if name is empty. A link to the corresponding part of original source code would be helpful.
- Please, add some unit tests. I saw that there is some small test of
TypeParser
in a later commit. I think we should add more tests and try to handle as many parsing cases as we can. In addition, I think that in this case, unit tests should be added in the same commit (they help during review - it's easier to reason about the complex code when there are some use case examples one can look at) - This implementation is based on some existing (probably Java) implementation, correct? If so, please, provide the link to the source in the commit. Ideally, the link should be placed in the comments in code as well.
InvalidInetLength(u8), | ||
#[error("UTF8 deserialization failed: {0}")] | ||
UTF8DeserializationError(#[from] std::str::Utf8Error), | ||
#[error(transparent)] | ||
ParseIntError(#[from] std::num::ParseIntError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is some remnant from one of your initial implementations. It's not needed anymore AFAIU (I deleted it locally and everything compiled).
#[error("Failed to parse abstract type")] | ||
AbstractTypeParseError(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What is an abstract type? I thought that what
TypeParser
does is it parses some specific Custom CQL types. Maybe it should be calledCustomTypeParseError
. - Need more context - what exactly failed during parsing of custom type? I propose to create a new error type called
CustomTypeParseError
. It should be an enum with variants corresponding to the possible cause of failures. ThenCqlTypeParseError
could have a variant like:
#[error("Failed to parse custom CQL type: {0}")]
CustomTypeParseError(#[from] CustomTypeParseError)
cc: @wprzytula
type UDTParameters<'result> = ( | ||
Cow<'result, str>, | ||
Cow<'result, str>, | ||
Vec<(Cow<'result, str>, ColumnType<'result>)>, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer to have it as a struct
instead of a type alias. Cow<'result, str>
type appears twice and it's hard to reason about it without explicit field names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer to have it as a
struct
instead of a newtype.
Actually, a struct
is called a newtype. type
is a type alias, which is not a new type, yet just a new name for an existing type.
pub(crate) struct TypeParser<'result> { | ||
pos: usize, | ||
str: Cow<'result, str>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Probably need to rename it to CustomTypeParser
(or AbstractTypeParser
if we decide to stick to abstract naming convention). Same goes for the name of the module - type_parser.rs
is not specific enough IMO.
pub(crate) fn parse(str: Cow<'result, str>) -> Result<ColumnType<'result>, CqlTypeParseError> { | ||
let mut parser = TypeParser::new(str); | ||
parser.do_parse() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the functions/methods in this module unnecessarily return such broad error type as CqlTypeParseError
. We could narrow it - see my other comment about introducing separate error type for custom type parsing failures.
if !self.is_eos() && self.str.as_bytes()[self.pos] == b':' { | ||
self.pos += 1; | ||
let _ = usize::from_str_radix(&name, 16) | ||
.map_err(|_| CqlTypeParseError::AbstractTypeParseError()); | ||
name = self.read_next_identifier(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this part do? Is it tested somewhere?
Whole TypeParser logic was ripped straight out of ScyllaDB's vector implementation, however, as it still in development and probably won't be merged for a while, it will be hard to link directly. IIRC there is a lot of tests there for this functionality, so thay also can be borrowed. |
Ok, makes sense. And let's borrow the tests in such case :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 What a great piece of code! Thank you for the contribution!
There are quite many comments, though.
I think that the new parser module needs much more unit tests.
Also, tests for particular errors upon serialization and deserialization of Vector are missing.
type UDTParameters<'result> = ( | ||
Cow<'result, str>, | ||
Cow<'result, str>, | ||
Vec<(Cow<'result, str>, ColumnType<'result>)>, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer to have it as a
struct
instead of a newtype.
Actually, a struct
is called a newtype. type
is a type alias, which is not a new type, yet just a new name for an existing type.
fn new(str: Cow<'result, str>) -> TypeParser<'result> { | ||
TypeParser { pos: 0, str } | ||
} | ||
|
||
pub(crate) fn parse(str: Cow<'result, str>) -> Result<ColumnType<'result>, CqlTypeParseError> { | ||
let mut parser = TypeParser::new(str); | ||
parser.do_parse() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛏️ Let's not use str
as a name for a variable - it's a name of a type.
fn char_at_pos(&self) -> char { | ||
self.str.as_bytes()[self.pos] as char | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 This may panic. Wouldn't it be better to use the checked get(self.pos)
method?
fn read_next_identifier(&mut self) -> Cow<'result, str> { | ||
let start = self.pos; | ||
while !self.is_eos() && TypeParser::is_identifier_char(self.char_at_pos()) { | ||
self.pos += 1; | ||
} | ||
match &self.str { | ||
Cow::Borrowed(s) => Cow::Borrowed(&s[start..self.pos]), | ||
Cow::Owned(s) => Cow::Owned(s[start..self.pos].to_owned()), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 This logic requires comments.
|
||
pub(crate) struct TypeParser<'result> { | ||
pos: usize, | ||
str: Cow<'result, str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛏️ Let's not use str
as a name for a field - it's a name of a type.
pub struct CellWriter<'buf> { | ||
buf: &'buf mut Vec<u8>, | ||
cell_len: Option<usize>, | ||
size_as_uvarint: bool, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lorak-mmk As you know the serialization framework quite well, could you please aid in review of this commit?
impl<'buf> CellValueBuilder<'buf> { | ||
#[inline] | ||
fn new(buf: &'buf mut Vec<u8>) -> Self { | ||
fn new(buf: &'buf mut Vec<u8>, size_as_uvar_int: bool) -> Self { | ||
// "Length" of a [bytes] frame can either be a non-negative i32, | ||
// -1 (null) or -1 (not set). Push an invalid value here. It will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏕️ This looks like a typo in the comment: -1
is mentioned twice.
According to the CQL specs, not set is represented using -2
.
Could you please fix this as a bonus, @smoczy123?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This typo is present in more than one place in this file.
pub struct CellValueBuilder<'buf> { | ||
// Buffer that this value should be serialized to. | ||
buf: &'buf mut Vec<u8>, | ||
pub(crate) buf: &'buf mut Vec<u8>, | ||
|
||
// Starting position of the value in the buffer. | ||
starting_pos: usize, | ||
|
||
// If writing to a fixed length type vector, the type length. | ||
cell_len: Option<usize>, | ||
|
||
//If serializing a variable length vector cell, the size is encoded as a varint. | ||
is_variable_length: bool, | ||
|
||
// Buffer for variable length vector cell. | ||
variable_length_buffer: Option<Vec<u8>>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 🔧 I believe that with the new possible fields, CellValueBuilder
should be made an enum, with distinct variants for non-Vector, const Vector and variable Vector cases (though I'm not sure about separating the latter two cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same goes for CellWriter
- I think it would benefit from being made an enum.
if let Some(buffer) = self.variable_length_buffer { | ||
let value_len = buffer.len(); | ||
let mut len = Vec::new(); | ||
types::unsigned_vint_encode(value_len as u64, &mut len); | ||
self.buf.extend_from_slice(&len); | ||
self.buf.extend_from_slice(&buffer); | ||
} else { | ||
let value_len: i32 = (self.buf.len() - self.starting_pos - 4) | ||
.try_into() | ||
.map_err(|_| CellOverflowError)?; | ||
self.buf[self.starting_pos..self.starting_pos + 4] | ||
.copy_from_slice(&value_len.to_be_bytes()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 The logic around Cell*
got much convoluted with these alterations. Let's think how we can make it more digestable. @Lorak-mmk @muzarski
use std::vec; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ ♻️ Is this used anywhere?
@@ -454,6 +454,8 @@ pub enum CqlTypeParseError { | |||
TupleLengthParseError(LowLevelDeserializationError), | |||
#[error("CQL Type not yet implemented, id: {0}")] | |||
TypeNotImplemented(u16), | |||
#[error("Failed to parse abstract type")] | |||
AbstractTypeParseError(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's idiomatic to avoid the parentheses if the list of arguments for the variant is empty
Vec<(Cow<'result, str>, ColumnType<'result>)>, | ||
); | ||
|
||
pub(crate) struct TypeParser<'result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit sad that you had to introduce this type from scratch, we already have very similar parsing utilities in the scylla
crate (scylla::utils::parse::ParserState
).
I learned from @wprzytula that he suggested moving the scylla::utils::parse
module to scylla-cql
and then rework your TypeParser
to reuse the existing code. I highly suggest that you do that, we would rather avoid maintaining two separate parsers.
@@ -864,17 +913,12 @@ fn deser_type_generic<'frame, 'result, StrT: Into<Cow<'result, str>>>( | |||
types::read_short(buf).map_err(|err| CqlTypeParseError::TypeIdParseError(err.into()))?; | |||
Ok(match id { | |||
0x0000 => { | |||
// We use types::read_string instead of read_string argument here on purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit sad that Cassandra folks didn't bother to add proper support for expressing the vector type in the protocol, instead relying on the custom type...
@@ -0,0 +1,26 @@ | |||
## Vector (for Cassandra only!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather skip "for Cassandra only" here. This piece of text will eventually become outdated when Scylla starts supporting the type, and nobody will remember to remove it (and you can't really remove it in released versions of the driver, I suppose).
} | ||
} | ||
|
||
pub fn type_size(&self) -> Option<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a public method with unclear meaning, please add a docstring.
#[test] | ||
fn test_cassandra_type_parser() { | ||
let type_name = | ||
"org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 5)"; | ||
assert_eq!( | ||
TypeParser::parse(Cow::Borrowed(type_name)).unwrap(), | ||
ColumnType::Vector(Box::new(ColumnType::Int), 5) | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You introduced a beast of a module (type_parser
) which is capable of parsing the syntax of any type (be it a primitive type, list, UDT, vector, etc...), but you added only this one, short test. Please add more tests for that module (preferably in the commit which introduced it) in order to increase the coverage.
let string_class_name: String; | ||
let class_name: Cow<'result, str>; | ||
if name.contains("org.apache.cassandra.db.marshal.") { | ||
class_name = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: semicolon missing at the end of line
@@ -1534,6 +1534,10 @@ mod legacy { | |||
CqlValue::Map(m) => serialize_map(m.iter().map(|p| (&p.0, &p.1)), m.len(), buf), | |||
CqlValue::Tuple(t) => serialize_tuple(t.iter(), buf), | |||
|
|||
CqlValue::Vector(_) => { | |||
unimplemented!("Vector serialization is not implemented yet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should start by introducing lower layers of the code (i.e. serialization / deserialization) and only then move to extend the ColumnType
/CqlValue
. This way you will avoid awkward unimplemented!
invocations which are a bit cumbersome for reviewers to track and make sure they are removed at the end.
_ => Err(mk_typck_err::<Self>( | ||
typ, | ||
BuiltinTypeCheckErrorKind::SetOrListError( | ||
SetOrListTypeCheckErrorKind::NotSetOrList, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotSetListOrVector
?
Besides, now I realize that the name of this error type is bad, we will basically have to change its name every time we add support for a new type which deserializes to Vec
- which may or may not happen anymore in the future, but it did happen for the vector data type.
impl<'frame, 'metadata, T> Iterator for VariableLengthVectorIterator<'frame, 'metadata, T> | ||
where | ||
T: DeserializeValue<'frame, 'metadata>, | ||
{ | ||
type Item = Result<T, DeserializationError>; | ||
|
||
fn next(&mut self) -> Option<Self::Item> { | ||
self.remaining = self.remaining.checked_sub(1)?; | ||
let size = types::unsigned_vint_decode(self.slice.as_slice_mut()).map_err(|err| { | ||
mk_deser_err::<Self>( | ||
self.coll_typ, | ||
BuiltinDeserializationErrorKind::RawCqlBytesReadError( | ||
LowLevelDeserializationError::IoError(Arc::new(err)), | ||
), | ||
) | ||
}); | ||
let raw = size.and_then(|size| { | ||
self.slice | ||
.read_subslice(size.try_into().unwrap()) | ||
.map_err(|err| { | ||
mk_deser_err::<Self>( | ||
self.coll_typ, | ||
BuiltinDeserializationErrorKind::RawCqlBytesReadError(err), | ||
) | ||
}) | ||
}); | ||
|
||
Some(raw.and_then(|raw| { | ||
T::deserialize(self.elem_typ, raw).map_err(|err| { | ||
mk_deser_err::<Self>( | ||
self.coll_typ, | ||
VectorDeserializationErrorKind::ElementDeserializationFailed(err), | ||
) | ||
}) | ||
})) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is VectorBytesSequenceIterator
used anywhere outside of VariableLengthVectorIterator
? Does it make sense to keep it separate? Maybe we can inline it?
This PR adds serialization and deserialization of CQL Vector (as implemented in Cassandra) therefore achieving compatibility with Cassandra's Vector type.
Fixes #1014
Pre-review checklist
./docs/source/
.Fixes:
annotations to PR description.