렉서

토큰

렉서(토크나이저, 스캐너)는 소스를 토큰으로 바꿉니다. 파서가 토큰을 소비하므로 원문의 공백·주석은 이후 신경 덜어도 됩니다.

먼저 + 문자 하나만 토큰으로 바꿔 봅시다.

rust

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Token {
    /// Token Type
    pub kind: Kind,

    /// Start offset in source
    pub start: usize,

    /// End offset in source
    pub end: usize,
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Kind {
    Eof, // end of file
    Plus,
}

+ 하나는 다음 출력을 줍니다

[
    Token { kind: Kind::Plus, start: 0, end: 1 },
    Token { kind: Kind::Eof,  start: 1, end: 1 }
]

문자열을 도는 방법은 인덱스를 추적하며 마치 C를 쓰는 것처럼 할 수도 있고, string 문서에서 Chars 반복자를 쓸 수도 있습니다.

INFO

Chars는 인덱스 추적과 경계 검사를 추상화해 줍니다.

chars.next()는 Option<char>를 줍니다. char는 ASCII 0–255가 아니라 UTF-8 유니코드 스칼라로 범위는 0~0x10FFFF입니다.

렉서 뼈대를 정의합니다

rust

use std::str::Chars;

struct Lexer<'a> {
    /// Source Text
    source: &'a str,

    /// The remaining characters
    chars: Chars<'a>
}

impl<'a> Lexer<'a> {
    pub fn new(source: &'a str) -> Self {
        Self {
            source,
            chars: source.chars()
        }
    }
}

INFO

수명 'a는 반복자가 참조하는 곳을 나타냅니다. 여기선 &'a str입니다.

chars.next()를 반복해 char에 매치하면 토큰으로 바꿉니다. 마지막 토큰은 항상 Kind::Eof입니다.

rust

impl<'a> Lexer<'a> {
    fn read_next_kind(&mut self) -> Kind {
        while let Some(c) = self.chars.next() {
            match c {
              '+' => return Kind::Plus,
              _ => {}
            }
        }
        Kind::Eof
    }

    fn read_next_token(&mut self) -> Token {
        let start = self.offset();
        let kind = self.read_next_kind();
        let end = self.offset();
        Token { kind, start, end }
    }

    /// Get the length offset from the source text, in UTF-8 bytes
    fn offset(&self) -> usize {
        self.source.len() - self.chars.as_str().len()
    }
}

fn offset 안의 .len()과 .as_str().len()이 O(n)처럼 느껴지니 깊게 파 봅니다.

.as_str()는 문자열 슬라이스 포인터를 반환합니다

rust

// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/str/iter.rs#L112-L115

pub fn as_str(&self) -> &'a str {
    // SAFETY: `Chars` is only made from a str, which guarantees the iter is valid UTF-8.
    unsafe { from_utf8_unchecked(self.iter.as_slice()) }
}

슬라이스는 포인터와 길이로 표현된 메모리 뷰입니다. .len()은 슬라이스에 들어 있는 메타데이터를 돌려줍니다

rust

// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/str/mod.rs#L157-L159

pub const fn len(&self) -> usize {
    self.as_bytes().len()
}

rust

// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/str/mod.rs#L323-L325

pub const fn as_bytes(&self) -> &[u8] {
    // SAFETY: const sound because we transmute two types with the same layout
    unsafe { mem::transmute(self) }
}

rust

// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/slice/mod.rs#L129-L138

pub const fn len(&self) -> usize {
    // FIXME: Replace with `crate::ptr::metadata(self)` when that is const-stable.
    // As of this writing this causes a "Const-stable functions can only call other
    // const-stable functions" error.

    // SAFETY: Accessing the value from the `PtrRepr` union is safe since *const T
    // and PtrComponents<T> have the same memory layouts. Only std can make this
    // guarantee.
    unsafe { crate::ptr::PtrRepr { const_ptr: self }.components.metadata }
}

위 코드는 최종적으로 단일 데이터 접근으로 컴파일되므로 .as_str().len()은 사실 O(1)입니다.

Peek

++, += 같은 다문자 연산자를 토큰화하려면 peek 헬퍼가 필요합니다:

rust

fn peek(&self) -> Option<char> {
    self.chars.clone().next()
}

원본 chars를 전진시키고 싶지 않으니 반복자를 복제합니다.

INFO

소스를 보면 clone은 추적·경계 인덱스만 복사해 비쌉니다.

rust

// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/slice/iter.rs#L148-L152

impl<T> Clone for Iter<'_, T> {
    fn clone(&self) -> Self {
        Iter { ptr: self.ptr, end: self.end, _marker: self._marker }
    }
}

peek과 chars.next()의 차이: peek은 항상 같은 다음 char를 돌려주고, next는 앞으로 진행하며 다른 char를 줍니다.

abc 문자열로 보면:

peek()를 반복하면 Some(a), Some(a), Some(a), …
chars.next() 반복은 Some('a'), Some('b'), Some('c'), None

peek이 있으면 ++, +=는 중첩 if로 토큰화할 수 있습니다.

jsparagus의 실제 구현 예:

rust

// https://github.com/mozilla-spidermonkey/jsparagus/blob/master/crates/parser/src/lexer.rs#L1769-L1791

'+' => match self.peek() {
    Some('+') => {
        self.chars.next();
        return self.set_result(
            TerminalId::Increment,
            SourceLocation::new(start, self.offset()),
            TokenValue::None,
        );
    }
    Some('=') => {
        self.chars.next();
        return self.set_result(
            TerminalId::AddAssign,
            SourceLocation::new(start, self.offset()),
            TokenValue::None,
        );
    }
    _ => return self.set_result(
        TerminalId::Plus,
        SourceLocation::new(start, self.offset()),
        TokenValue::None,
    ),
},

위 패턴은 모든 연산자에 적용되니 JavaScript 렉싱 지식을 넓힙니다.

JavaScript

Rust 렉서는 따분합니다. 문자마다 긴 if 체인을 두는 것 같습니다.

본격적인 즐거움은 JavaScript 렉싱을 시작할 때입니다.

ECMAScript Language Specification을 펴고 JavaScript를 다시 배웁니다.

TIP

명세를 처음 열었을 때 용어만 가득해 구석에서 울고 싶었던 기억이 납니다. 이해가 안 되면 명세 읽기 가이드로 넘어가 보세요.

주석

주석에는 의미가 없어 런타임이라면 건너뛸 수 있지만, 린터나 번들러라면 고려해야 합니다.

식별자와 Unicode

코드 대부분은 ASCII지만 11장 ECMAScript Language: Source Text는 소스가 Unicode여야 한다고 하고, 12.6 Names and Keywords는 식별자가 Unicode Annex #31의 Default Identifier Syntax를 따른다고 합니다. 세부 규격:

IdentifierStartChar ::
    UnicodeIDStart

IdentifierPartChar ::
    UnicodeIDContinue

UnicodeIDStart ::
    any Unicode code point with the Unicode property “ID_Start”

UnicodeIDContinue ::
    any Unicode code point with the Unicode property “ID_Continue”

즉 var ಠ_ಠ는 가능하지만 var 🦀은 안 됩니다. ಠ는 "ID_Start" 속성을 갖지만 🦀은 그렇지 않습니다.

INFO

이 목적 전용으로 unicode-id-start 크레이트를 냈습니다. unicode_id_start::is_id_start(char), unicode_id_start::is_id_continue(char)로 검사하면 됩니다.

키워드

if, while, for 등 모든 키워드는 통째로 토큰화해야 합니다. 토큰 종류 enum에 넣어두면 파서에서 문자열 비교가 필요 없습니다.

rust

pub enum Kind {
    Identifier,
    If,
    While,
    For
}

TIP

undefined는 키워드가 아니므로 여기 넣지 않아도 됩니다.

키워드 토큰화는 위 식별자 판별과 문자열 매칭이 전부입니다.

rust

fn match_keyword(&self, ident: &str) -> Kind {
    // all keywords are 1 < length <= 10
    if ident.len() == 1 || ident.len() > 10 {
        return Kind::Identifier;
    }
    match ident {
        "if" => Kind::If,
        "while" => Kind::While,
        "for" => Kind::For,
        _ => Kind::Identifier
    }
}

토큰 값(Token Value)

컴파일러 후속 단계에서 식별자·숫자·문자열을 자주 비교합니다(예: 린터에서 식별자 검사).

지금은 소스 텍스트 그대로이니 Rust 타입으로 바꾸면 다루기 쉽습니다.

rust

pub enum Kind {
    Eof, // end of file
    Plus,
    Identifier,
    Number,
    String,
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Token {
    /// Token Type
    pub kind: Kind,

    /// Start offset in source
    pub start: usize,

    /// End offset in source
    pub end: usize,

    pub value: TokenValue,
}

#[derive(Debug, Clone, PartialEq)]
pub enum TokenValue {
    None,
    Number(f64),
    String(String),
}

식별자 foo나 문자열 "bar"를 토큰화하면

Token { kind: Kind::Identifier, start: 0, end: 2, value: TokenValue::String("foo") }
Token { kind: Kind::String, start: 0, end: 4, value: TokenValue::String("bar") }

Rust 문자열로는 let s = self.source[token.start..token.end].to_string() 후 token.value = TokenValue::String(s)로 저장합니다.

숫자 1.23을 토큰화하면 Token { start: 0, end: 3 } 같은 토큰이 됩니다. Rust f64로는 문자열 parse로 self.source[token.start..token.end].parse::<f64>() 한 뒤 token.value에 넣습니다. 2진·8진·정수 파싱 예는 jsparagus를 참고하세요.

Rust 최적화

더 작은 토큰

Kind enum 안에 토큰 값을 넣고 싶은 유혹이 듭니다:

rust

pub enum Kind {
    Number(f64),
    String(String),
}

Rust enum 크기는 변안들의 합집합입니다. 원래 1바이트 enum에 비해 훨씬 큽니다. 파서에서 Kind를 아주 많이 쓰므로 1바이트가 여러 바이트보다 빠릅니다.

문자열 인턴

컴파일러에서 String은 다음 이유로 비효율적입니다:

힙 할당 객체
문자열 비교는 O(n)

문자열 인턴은 캐시에 고유 ID로 한 번만 저장해 이를 해소합니다. 서로 다른 식별자·문자열당 힙 할당은 한 번이고 비교는 O(1)이 됩니다.

crates.io에 장단점이 다른 인턴 라이브러리가 많습니다.

시작으로 string-cache면 충분합니다. Atom 타입과 컴파일 타임 atom!("string") 인터페이스가 있습니다.

string-cache를 쓰면 TokenValue는

rust

#[derive(Debug, Clone, PartialEq)]
pub enum TokenValue {
    None,
    Number(f64),
    String(Atom), 
}

문자열 비교는 matches!(value, TokenValue::String(atom!("string")))처럼 씁니다.

렉서 ​

토큰 ​

Peek ​

JavaScript ​

주석 ​

식별자와 Unicode ​

키워드 ​

토큰 값(Token Value) ​

Rust 최적화 ​

더 작은 토큰 ​

문자열 인턴 ​

렉서

토큰

Peek

JavaScript

주석

식별자와 Unicode

키워드

토큰 값(Token Value)

Rust 최적화

더 작은 토큰

문자열 인턴