This is going to be interesting project for me.

I'm gonna build modern parsing library in Rust and I'll share here my experience.

Note - I'm not going to sell you golden solution - I have no final answers. Instead, I'd like to think about these articles as my notes and diary. Hopefully when I finish both you and I will learn something new.

Ok. Let's start from the beginning.

Assumptions

As a base for my design I'm choosing this article.

Therefore, my parser needs to be:

  • Handwritten - I don't want to learn another DSL syntax for another grammar generator, plus it gives to programmer more control.
  • Lossless - it has to handle all whitespace characters, it cannot simply skip them,
  • With good error recovery - it cannot fail after first error. It always must produce some Tree.

Part 1. Lexer

At the beginning we need to think about Lexer. We could use Parsing Combinators library and therefore skip lexing but actually it's pretty handy idea to split Parsing from Token detection - it makes life easier. Trust me, I was stubborn about it and I regretted it.

First of all we need some trivial traits to ease our journey.

This particular one gives us handy information about offset of one sublice in reference to another sublice. Note, we are doing here operation on pointers so second argument has to be subslice. Therefore, it can be pretty unsafe.

mod offset {
    pub trait Offset {
        fn offset(&self, second: &Self) -> usize;
    }

    impl Offset for str {
        fn offset(&self, second: &Self) -> usize {
            let fst = self.as_ptr();
            let snd = second.as_ptr();

            snd as usize - fst as usize
        }
    }

    impl<'a> Offset for &'a str {
        fn offset(&self, second: &Self) -> usize {
            let fst = self.as_ptr();
            let snd = second.as_ptr();

            snd as usize - fst as usize
        }
    }
}

Offset trait I took from fantastic parser combinator library nom which is under MIT license.

Another one is quite important, and I miss it in the standard library. Note the difference between this trait and the Peekable struct. Peekable should implement this trait.

mod peekable {
    pub trait PeekableIterator: Iterator {
        fn peek(&mut self) -> Option<&Self::Item>;
    }

    impl<I: Iterator> PeekableIterator for Peekable<I> {
        fn peek(&mut self) -> Option<&Self::Item> {
            Peekable::peek(self)
        }
    }
}

Now, we can define our first non-generic trait:

pub trait TokenKind: Debug + PartialEq {
    fn is_error(&self) -> bool;
}

We "care" only about error tokens, because I would like to coalesce them together. Other kinds of tokens are not important for the sake of internals of our lexer.

Next I defined our Token which contains both kind and subslice with the str.

#[derive(Debug, PartialEq)]
pub struct Token<'a, K: TokenKind> {
    pub kind: K,
    pub span: &'a str,
}

impl<'a, K> Token<'a, K>
where
    K: TokenKind
{
    pub fn new(span: &'a str, kind: K) -> Self {
        Self { kind, span }
    }
}

Having that prepared we can finally define our lexer trait:

pub trait Lexer<'a, K>
where
    Self: PeekableIterator<Item = Token<'a, K>>,
    K: TokenKind
{
    fn input(&self) -> &'a str;
    fn set_input(&mut self, input: &'a str);
}

Because I would like to create a language with a string interpolation, I decided to create a parser with multiple lexer modes. When we parse normal expression like 2 + 3 we use standard "default" lexer. However, when we parse expression like: "Foo ${ 2 + 3 }" + 3 we cannot simply take all chars until next " character. Instead we parse expression normally and when we detect first " token, we switch to another lexer. Then inside this inner lexer, we detect ${, and switch again to regular lexer until we reach the } token.

In other words we need lexers that we can stack like this:

Input string: 2 + "Foo bar ${4 + 5} baz" + 3

tokenthenstack
NUMBER "2"normal_lex
TRIVIA " "normal_lex
PLUS "+"normal_lex
TRIVIA " "normal_lex
DOUBLEQUOTEpush string_lexnormal_lex, string_lex
TEXT "Foo bar "normal_lex, string_lex
OPEN_INTERP "${"push normal_lexnormal_lex, string_lex, normal_lex
NUMBER "4"normal_lex, string_lex, normal_lex
TRIVIA " "normal_lex, string_lex, normal_lex
PLUS "+"normal_lex, string_lex, normal_lex
TRIVIA " "normal_lex, string_lex, normal_lex
NUMBER "5"normal_lex, string_lex, normal_lex
CLOSE_INTERP "}"pop lexernormal_lex, string_lex
TEXT " baz"normal_lex, string_lex
DOUBLEQUOTEpop lexernormal_lex
TRIVIA " "normal_lex
PLUS "+"normal_lex
TRIVIA " "normal_lex
NUMBER "3"normal_lex

Therefore, we need a way to read left input to lex, and to set input after poping lexer.

Next is our Lexer implementation. I'm gonna use peeked with Option<Option<>> just like in the Peekable struct from std. You may ask - why not just use Peekable. Well... We would lose information about input() as Peekable takes ownership of the iterator, and has no method to access the inner field. That's why I'm going to rewrite part of Peekable myself. Not that hard as you may expect.

pub struct Lex<'a, F, K>
where
    K: TokenKind
{
    input: &'a str,
    f: F,
    #[allow(clippy::option_option)]
    peeked: Option<Option<Token<'a, K>>>,
}

impl<'a, F, K> Lex<'a, F, K>
where
    K: TokenKind,
    F: Fn(&'a str) -> Option<(K, &'a str)>,
{
    pub fn new(input: &'a str, f: F) -> Self {
        Self {
            input,
            f,
            peeked: None,
        }
    }
}

Because every lexer may be different, we provide function F where language designer can actually lex code. It returns TokenKind and rest of the string.

Finally we can implement Iterator for our Lex structure. We need this trait and also PeekableIterator to implement Lexer.

impl<'a, F, K> Iterator for Lex<'a, F, K>
where
    F: Fn(&'a str) -> Option<(K, &'a str)>,
    K: TokenKind,
{
    type Item = Token<'a, K>;

    fn next(&mut self) -> Option<Self::Item> {
        //TODO: return here peeked if exists.
        let (kind, rest) = (self.f)(self.input)?;

        let offset = self.input.offset(rest);

        let span = &self.input[0..offset];

        self.input = rest;

        Some(Token::new(span, kind))
    }
}

For now, I simplified next method by ignoring peeked field, as we don't implement PeekableIterator yet. As you can see we use Offset trait to calculate correct span for the token. Also, every time we call next(), we recalculate self.input. It's equivalent to having a cursor on a slice and incrementing it by the span length.

Peekable iterator is quite simple. First we check if we have anything stored, if not, we run next() to obtain it. After that we have to store obtained token in self.peeked. Note, that after everything we reset self.input - by peeking tokens we should not move our cursor forward.

impl<'a, F, K> PeekableIterator for Lex<'a, F, K>
    where
        F: Fn(&'a str) -> Option<(K, &'a str)>,
        K: TokenKind,
{
    fn peek(&mut self) -> Option<&Self::Item> {
        if self.peeked.is_none() {
            let i = self.input;
            self.peeked = Some(self.next());
            self.input = i;
        }

        self.peeked.as_ref().and_then(|i| i.as_ref())
    }
}

Now we can modify our Iterator implementation. When we have peeked the token, we don't want to lex whole input again. Instead, we simply return peeked element and replace it with None. We also move cursor forward.

    fn next(&mut self) -> Option<Self::Item> {
        if let Some(peeked) = self.peeked.take() {
            if let Some(peeked) = peeked.as_ref() {
                let input = &self.input[peeked.span.len()..];
                self.input = input;
            }

            return peeked;
        }
        //The rest of the method
    }

By having those two traits implemented, we can easily implement our Lexer trait:

impl<'a, F, K> Lexer<'a, K> for Lex<'a, F, K>
    where
        F: Fn(&'a str) -> Option<(K, &'a str)>,
        K: TokenKind,
{
    fn input(&self) -> &'a str {
        self.input
    }

    fn set_input(&mut self, input: &'a str) {
        self.input = input;
    }
}

That's it! We created our generic lexer. Now we can make some simple example how to use it. I'm going to stick to the S-Expression. It's much easier.

First token kind:

#[derive(Debug, PartialEq)]
enum Token {
    Error,
    Atom,
    Trivia, // Whitespace
    OpenParent,
    CloseParent
}

impl TokenKind for Token {
    fn is_error(&self) -> bool {
        match self {
            Self::Error => true,
            _ => false,
        }
    }
}

Then lexer function:

pub fn lexer<'a>(input: &'a str) -> impl Lexer<'a, Token> {
    Lex::new(input, |i: &'a str| {
        Some(match i.chars().next()? {
            c if c.is_whitespace() => {
                let rest = i.char_indices()
                    .take_while(|(_, c)| c.is_whitespace())
                    .last()
                    .map(|(idx, c)| idx + c.len_utf8())
                    .unwrap_or_default();
                (Token::Trivia, &i[rest..])
            }
            c if c.is_alphanumeric() => {
                let rest = i.char_indices()
                    .take_while(|(_, c)| c.is_alphanumeric())
                    .last()
                    .map(|(idx, c)| idx + c.len_utf8())
                    .unwrap_or_default();
                (Token::Atom, &i[rest..])
            }
            '(' => (Token::OpenParent, &i[1..]),
            ')' => (Token::CloseParent, &i[1..]),
            _ => (Token::Error, &i[1..]),
        })
    })
}

I know, that take_while consumes more than normally is acceptable, however, we use it only to calculate offset of character after the last one.


Edit: As /u/asleep_rock noticed, working on chars() may be tricky. Before I suggested code like:

let rest = i.chars()
    .take_while(|c| c.is_alphanumeric())
    .count();
(Token::Atom, &i[rest..])

Which works fine only for 1 byte characters. You have to take into the consideration that char length is in inclusive range from 1 to 4.

Thanks for noticing that!


To be sure we are correct, let's make a test:

#[test]
fn lexer_test() {
    let input = "(add 2 (京 4 5))";
    let mut lexer = lexer(input);
    let res: Vec<_> = lexer.collect();
    assert_eq!(res, vec![
        Token { kind: OpenParent, span: "(", },
        Token { kind: Atom, span: "add", },
        Token { kind: Trivia, span: " ", },
        Token { kind: Atom, span: "2", },
        Token { kind: Trivia, span: " ", },
        Token { kind: OpenParent, span: "(", },
        Token { kind: Atom, span: "", },
        Token { kind: Trivia, span: " ", },
        Token { kind: Atom, span: "4", },
        Token { kind: Trivia, span: " ", },
        Token { kind: Atom, span: "5", },
        Token { kind: CloseParent, span: ")", },
        Token { kind: CloseParent, span: ")", },
    ]);

}

Test passed! We are indeed correct! But what if we encounter an error? I told you about coalescing errors...

Our next tests will reveal the truth:

#[test]
fn lexer_error() {
    let input = "(add 2 (+++ 4 5))";
    let mut lexer = lexer(input);
    let res: Vec<_> = lexer.collect();
    assert_eq!(res, vec![
        Token { kind: OpenParent, span: "(", },
        Token { kind: Atom, span: "add", },
        Token { kind: Trivia, span: " ", },
        Token { kind: Atom, span: "2", },
        Token { kind: Trivia, span: " ", },
        Token { kind: OpenParent, span: "(", },
        Token { kind: Error, span: "+++", },
        Token { kind: Trivia, span: " ", },
        Token { kind: Atom, span: "4", },
        Token { kind: Trivia, span: " ", },
        Token { kind: Atom, span: "5", },
        Token { kind: CloseParent, span: ")", },
        Token { kind: CloseParent, span: ")", },
    ]);
}

It... fails. Instead of returning one Error "+++" it returned three separated errors one for each +. Time to change that :)

First of all lets move content of our next() method into private lex():

impl<'a, F, K> Lex<'a, F, K>
where
    F: Fn(&'a str) -> Option<(K, &'a str)>,
    K: TokenKind,
{
    fn lex(&mut self) -> Option<Token<'a, K>> {
        if let Some(peeked) = self.peeked.take() {
            if let Some(peeked) = peeked.as_ref() {
                let input = &self.input[peeked.span.len()..];
                self.input = input;
            }
            return peeked;
        }

        let (kind, rest) = (self.f)(self.input)?;

        let offset = self.input.offset(rest);

        let span = &self.input[0..offset];

        self.input = rest;
        Some(Token::new(span, kind))
    }
}

Then we have to design simple algorithm in next(). We take first token. If it's not an error - we simply return it. However, if it is an error we create a loop where we peek next token, ensure it is an error (otherwise return first), and extend first token span by the length of the second. We move the cursor and repeat until either we encounter EOF or not-an-error token.

fn next(&mut self) -> Option<Self::Item> {
    let input = self.input;
    let mut first = self.lex()?;

    while first.kind.is_error() {
        match self.peek() {
            Some(token) if token.kind.is_error() => {
                let first_len = first.span.len();
                let second_len = token.span.len();
                let len = first_len + second_len;
                first.span = &input[..len];
                self.lex();
            }
            _ => break,
        }
    }
    Some(first)
}

Now second test is passing. Voila ;)

In the next part we will start parsing our tokens.

In a couple of days I will also share the code with you on Github.

See you soon!