separated_by but keep them joined #576

wyatt-herkamp · 2023-11-25T21:29:45Z

wyatt-herkamp
Nov 25, 2023

So I am trying to implement a few of the email standards.

Currently trying to parse mailboxes.

And part of the standard is dot-atom-text
This is a segments of strings separated by .

You can't have two . back to back

Currently I have this

/// [atext Defined in RFC 2822](https://datatracker.ietf.org/doc/html/rfc2822#section-3.2.4)
pub fn atext<'a>() -> impl Parser<'a, &'a str, char, ErrType<'a>> {
    choice((
        // Instead of having a choice inside of a choice call the parser directly
        one_of('a'..='z'),
        one_of('A'..='Z'),
        one_of('0'..='9'),
        one_of([
            '!', '#', '$', '%', '&', '\'', '*', '+', '-', '/', '=', '?', '^', '_', '`', '{', '|',
            '}', '~',
        ]),
    ))
}
pub fn atext_seg<'a, C>() -> impl Parser<'a, &'a str, C, ErrType<'a>>
where
    C: Container<char>,
{
    atext().repeated().at_least(1).collect::<C>()
}
/// \`\`\`ebnf
/// dot-atom-text = 1*atext *("." 1*atext)
/// \`\`\`
pub fn dot_atom_text<'a>() -> impl Parser<'a, &'a str, String, ErrType<'a>> {
    atext_seg::<String>()
        .separated_by(just('.'))
        .collect::<Vec<_>>()
        .map(|v| {
            if v.len() == 1 {
                return v.into_iter().next().unwrap();
            }
            let mut s = String::with_capacity(v.iter().map(|v| v.len() + 1).sum::<usize>());

            s.push_str(&v[0]);
            for v in v[1..].iter() {
                s.push('.');
                s.push_str(&v);
            }
            s
        })
}

I found Vec::join to be slower than what I have lol.

Anyway, does Chumsky have a way of doing a separated by that keeps the result united.

The only real reason I am asking is. I am bike shedding performance for no reason. Lol

Answered by wackbyte

Nov 25, 2023

To avoid allocation, you could write atext_seg as:

pub fn atext_seg<'a>() -> impl Parser<'a, &'a str, &'a str, ErrType<'a>>
{
    atext().repeated().at_least(1).to_slice()
}

And, if I'm understanding correctly, dot_atom_text can be written in a similar way:

pub fn dot_atom_text<'a>() -> impl Parser<'a, &'a str, &'a str, ErrType<'a>> {
    atext_seg()
        .separated_by(just('.'))
        .to_slice()
}

Right now, you are manually rebuilding the input you parse, char-by-char, using collect. However to_slice will return the slice of the input you parsed, for free, without any copying or allocation.
(This is the biggest strength of zero-copy parsing in chumsky 1.0!)

View full answer

wackbyte · 2023-11-25T21:43:29Z

wackbyte
Nov 25, 2023

To avoid allocation, you could write atext_seg as:

pub fn atext_seg<'a>() -> impl Parser<'a, &'a str, &'a str, ErrType<'a>>
{
    atext().repeated().at_least(1).to_slice()
}

And, if I'm understanding correctly, dot_atom_text can be written in a similar way:

pub fn dot_atom_text<'a>() -> impl Parser<'a, &'a str, &'a str, ErrType<'a>> {
    atext_seg()
        .separated_by(just('.'))
        .to_slice()
}

Right now, you are manually rebuilding the input you parse, char-by-char, using collect. However to_slice will return the slice of the input you parsed, for free, without any copying or allocation.
(This is the biggest strength of zero-copy parsing in chumsky 1.0!)

1 reply

wyatt-herkamp Nov 26, 2023
Author

It works! Thank you for the help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separated_by but keep them joined #576

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

separated_by but keep them joined #576

wyatt-herkamp Nov 25, 2023

Replies: 1 comment · 1 reply

wackbyte Nov 25, 2023

wyatt-herkamp Nov 26, 2023 Author

wyatt-herkamp
Nov 25, 2023

Replies: 1 comment 1 reply

wackbyte
Nov 25, 2023

wyatt-herkamp Nov 26, 2023
Author