As I mentioned in previous articles, I was investigating possibility to use TextSize library created by Rust Analyzer creators. Please note two important things:

  • It's in text-unit repository as it was lately renamed,
  • It has version 0.99.0-dev.2 which means it can be unstable.

But still it hit me hard into my preferences and I couldn't simply walk by near it.

Other reason why I would like to use this library is the fact, it would allow me to ditch lifetimes from Token and Lexer. While for now they are harmless I see couple of reasons to do it.

First of all, lifetimes are contagious. If structure is having a field with a lifetime then itself has to have this lifetime. You can't simply create fore example self-referential structs, where one field is referencing to another one.

Secondly - In the future I would like to use my parser with a wonderful Salsa project, however it does not support lifetimes. One of the maintainers suggested me to use (usize, usize) instead and fortunately text-size is doing exactly that.

Lastly - I encountered several issues with lifetimes and closures which I use to the limit.

As always - source code is available on Github.

Ok, having that in mind I am going to introduce to you new changes with our little project.


First of all, we have to add dependency. Unfortunately this version of the crate is still unstable and not available through the Crates.io, therefore we have to point to the repository.

[dependencies]
text-size = { version = "0.99.0-dev.2", git = "https://github.com/rust-analyzer/text_unit.git", branch = "master"}

First, I would like to reexport TextRange for the convenience. Besides the reexport I want to introduce to you new structure - Input.

// In lib.rs
pub use text_size::TextRange;
pub mod input;

This new struct would keep copy of original str in a heap next to the TextSize which will point where the lexer is.

#[derive(Debug, Clone)]
pub struct Input {
    pub(crate) str: Box<str>,
    pub(crate) cursor: TextSize
}

I would like to create an Input from slice and to retrieve slice for given cursor:

impl From<&'_ str> for Input {
    fn from(input: &str) -> Self {
        let str: Box<str> = Box::from(input);
        Self { str, cursor: TextSize::zero() }
    }
}

impl AsRef<str> for Input {
    fn as_ref(&self) -> &str {
        let size = self.str.text_size();
        let range = TextRange(self.cursor, size);
        &self.str[range]
    }
}

Now I can create a method which will "chomp" characters from stored str, move cursor forward and then return TextRange - our span.

impl Input {
    pub fn chomp(&mut self, len: usize) -> TextRange {
        let end = TextSize::of(&*self.str);

        let range = match self.as_ref().char_indices().nth(len - 1)
            .and_then(|(last, c)| TextSize::try_from(last + c.len_utf8()).ok())
        {
            Some(last) => TextRange(self.cursor, self.cursor + last),
            None => TextRange(end, end)
        };
        self.cursor = range.end();

        range
    }
}

As you can see, it already operates on char_indices which means less repetitive code in user lexer function. We must denote somehow EOF. We do it by having a TextRange(end, end) - this range will have len() == 0 and it will point into the end of the stream.

Perfect. Not it's time to use range in the token.


First of all I would like to store in a token only kind and range, without a slice. But this makes us ask important question - what about our tests. These tests are after all expecting subslices.

Well, I will show you how :)

First of all our Token has new definition.

#[derive(PartialEq)]
pub struct Token<K: TokenKind> {
    pub kind: K,
    pub span: TextRange,
}

You might notice lacking Debug derive - it's because I would like to implement it by myself.

impl<K> Debug for Token<K>
    where
        K: TokenKind,
{
    fn fmt(&self, f: &mut Formatter<'_>) -> Result {
        write!(f, "{:?}@{:?}", self.kind, self.span)
    }
}

Now, what about Display? Instead of implementing it for Token I want to create another temporary struct DisplayToken which would contain references to both kind and original str. Having those two pieces allows us to create sub-slice for display purposes.


impl<K> Token<K>
where
    K: TokenKind,
{
    pub fn new(span: TextRange, kind: K) -> Self { Self { kind, span } }

    pub fn display<'k, 's>(&'k self, str: &'s str) -> DisplayToken<'k, 's, K> {
        DisplayToken {  str, kind: &self.kind, span: self.span }
    }
}

pub struct DisplayToken<'k, 's, K: TokenKind> {
    str: &'s str,
    kind: &'k K,
    span: TextRange
}

impl<'k, 's, K> Display for DisplayToken<'k, 's, K>
where
    K: TokenKind,
{
    fn fmt(&self, f: &mut Formatter<'_>) -> Result {
        write!(f, "{:?} `{}`", self.kind, &self.str[self.span])
    }
}

I decided this would be optimal solution because displaying tokens is usually less frequent. In most cases you really just want token kind.


Having both Input and modified Token we can finally refactor a bit our Lexer:

First of all the trait:

pub trait Lexer<K>
where
    Self: PeekableIterator<Item = Token<K>>,
    K: TokenKind,
{
    fn input(&self) -> Input;
    fn set_input(&mut self, input: Input);
}

I removed the lifetime and replaced &'a str with Input.

Because lexer.rs was replaced quite significantly I would like to only focus on important parts.

You must remember that I removed all lifetimes. If you need a reference feel free to check it on my Github project under pt-3 tag.

Ok - let's dig into details.

The most important change is the function signature:

F: Fn(&mut Input) -> Option<(K, TextRange)>,

Now it takes Input instead of &'a str (easy to predict). But more interesting part is in the Option. Instead of returning the rest of the input, we mutate it (by &mut in argument) and we return actual span. Therefore we don't need Offset trait anymore. We can remove the module.

Second of all in the Iterator implementation I'm using TextRange::covering for an error expansion.

while first.kind.is_error() {
    match self.peek() {
        Some(token) if token.kind.is_error() => {
            first.span = TextRange::covering(first.span, token.span);
            self.lex();
        }
        _ => break,
    }
}
Some(first)

It's much clearer now!


With these changes I need to fix my tests.

First of all the function:

Lex::new(input, |i: &mut Input| {
    Some(match i.as_ref().chars().next()? {
        c if c.is_whitespace() => {
            let rest = i.as_ref()
                .chars()
                .take_while(|c| c.is_whitespace())
                .count();
            (Token::Trivia, i.chomp(rest))
        }
        c if c.is_alphanumeric() => {
            let rest = i.as_ref()
                .chars()
                .take_while(|c| c.is_alphanumeric())
                .count();
            (Token::Atom, i.chomp(rest))
        }
        '(' => (Token::OpenParent, i.chomp(1)),
        ')' => (Token::CloseParent, i.chomp(1)),
        _ => (Token::Error, i.chomp(1)),
    })
})

As you can see because of i.chomp() I could return to the counting of the whitespaces. But... is it really working? We need to make small change in the test to find out.

let res: Vec<_> = lexer.map(|t| t.display(input).to_string()).collect();

I added our display(input) which creates temporary DisplayToken structure. That's how we can keep our compatibility.

That's it! We removed lifetimes, all tests are passing. In the next step I'll prepare Node structure to represent our Concrete Syntax Tree.