Table of Contents
- What Is Tokenization?
- Token Types
- Character Classification Helpers
- Reading Variable Names
- The Main Tokenize Function
- Worked Example
- What’s Next
What Is Tokenization?
Before we can understand the structure of an expression like (A & B) -> C, we need to break it into meaningful chunks. This process is called tokenization (or lexing). Each chunk is called a token.
Consider the expression (A & B) -> C. A human can immediately pick out the parentheses, the variables, and the operators - but to our program it’s just a flat string of characters. The tokenizer’s job is to scan through that string character by character and produce a list of structured objects:
( → paren "("
A → variable "A"
& → operator "&"
B → variable "B"
) → paren ")"
-> → operator "->"
C → variable "C"
Once we have this token list, the parser (Chapter 3) can work with meaningful units instead of raw characters.
Token Types
We use three token types:
paren- opening or closing parenthesis:(,)operator- a logical operator:~,&,|,->,<->variable- a variable name likeA,B,myVar, orx1
Each token is a simple object with a type and a value:
// Example tokens:
{ type: 'paren', value: '(' }
{ type: 'operator', value: '->' }
{ type: 'variable', value: 'A' }
Character Classification Helpers
The tokenizer needs to identify what kind of character it’s looking at. We write a small helper function for each category.
Whitespace - we skip these entirely:
function isWhiteSpace(input, i) {
return /\s/.test(input[i]);
}
Parentheses - just ( and ):
function isParenthesis(input, i) {
return input[i] === '(' || input[i] === ')';
}
Negation - we accept ~, !, and the Unicode ¬:
function isNotOperator(input, i) {
return input[i] === '~' || input[i] === '!' || input[i] === '¬';
}
Conjunction and disjunction - single characters, but we also accept the Unicode symbols:
function isAndOperator(input, i) {
return input[i] === '&' || input[i] === '∧';
}
function isOrOperator(input, i) {
return input[i] === '|' || input[i] === '∨';
}
Implication - a two-character operator, either -> or =>:
function isImpliedOperator(input, i) {
if (input[i] !== '-' && input[i] !== '=') return false;
const check = input.slice(i, i + 2);
return check === '->' || check === '=>';
}
Biconditional - a three-character operator, either <-> or <=>:
function isIfAndOnlyIfOperator(input, i) {
if (input[i] !== '<') return false;
const check = input.slice(i, i + 3);
return check === '<->' || check === '<=>';
}
Design choice:Each helper takes the full
inputstring and the current indexi. This makes the multi-character checks straightforward - we can slice ahead without worrying about bounds becauseslicesafely returns a shorter string if we go past the end.
Variable characters - letters, digits, and underscores. This lets users write variable names like x1 or my_var:
function isVariableChar(chr) {
return /[A-Za-z0-9_]/.test(chr);
}
Reading Variable Names
Since variable names can be more than one character, we need a function that reads the entire name starting from a given position:
function getVarName(input, i) {
let varName = '';
while (i < input.length && isVariableChar(input[i])) {
varName += input[i];
i++;
}
return varName;
}
It starts at position i and consumes characters as long as they’re valid variable characters. The resulting string is returned, and the caller advances i by varName.length.
The Main Tokenize Function
Now we put it all together. The tokenize function walks through the input character by character. At each position it checks what kind of token starts there, creates the appropriate token, and advances the index past whatever it consumed:
function tokenize(input) {
let tokens = [];
let i = 0;
while (i < input.length) {
if (isWhiteSpace(input, i)) {
i++;
continue;
}
if (isParenthesis(input, i)) {
tokens.push({ type: 'paren', value: input[i] });
i++;
continue;
}
if (isNotOperator(input, i) || isAndOperator(input, i) || isOrOperator(input, i)) {
tokens.push({ type: 'operator', value: input[i] });
i++;
continue;
}
if (isImpliedOperator(input, i)) {
tokens.push({ type: 'operator', value: '->' });
i += 2;
continue;
}
if (isIfAndOnlyIfOperator(input, i)) {
tokens.push({ type: 'operator', value: '<->' });
i += 3;
continue;
}
if (isVariableChar(input[i])) {
const varName = getVarName(input, i);
tokens.push({ type: 'variable', value: varName });
i += varName.length;
continue;
}
throw new Error('Unknown token at position ' + i + ": '" + input[i] + "'");
}
return tokens;
}
Important:The order of checks matters. We test for the biconditional (
<->, 3 chars) before the implication (->, 2 chars). If we checked->first, the string<->would be incorrectly split into<(error) and->.
If none of the checks match, we throw an error with the position and the offending character. This gives the user a clear message like: Unknown token at position 5: '#'.
Worked Example
Let’s trace through tokenizing (A & B) -> C:
| Position | Character(s) | Action | Token |
|---|---|---|---|
| 0 | ( | Parenthesis | { type: 'paren', value: '(' } |
| 1 | A | Variable | { type: 'variable', value: 'A' } |
| 2 | | Whitespace | skip |
| 3 | & | And operator | { type: 'operator', value: '&' } |
| 4 | | Whitespace | skip |
| 5 | B | Variable | { type: 'variable', value: 'B' } |
| 6 | ) | Parenthesis | { type: 'paren', value: ')' } |
| 7 | | Whitespace | skip |
| 8–9 | -> | Implies operator | { type: 'operator', value: '->' } |
| 10 | | Whitespace | skip |
| 11 | C | Variable | { type: 'variable', value: 'C' } |
Result: an array of 7 tokens, ready for the parser.
What’s Next
We now have a flat list of tokens, but we don’t yet understand the structure of the expression - which operators bind tighter, where the sub-expressions are, or how parentheses group things.
In Chapter 3: Parsing with Recursive Descent, we’ll build a parser that consumes this token list and produces an Abstract Syntax Tree (AST), respecting operator precedence and parenthesization.
Previous: Chapter 1: Setting Up the Frontend