nubpaws.dev
Back to Articles
Article

Chapter 2: Tokenizing Logical Expressions

Build a tokenizer that converts a raw boolean expression string into a structured array of tokens - the first step in our parsing pipeline.

February 28, 2026 NubPaws
JavaScript Tokenization Parsing Tutorial

Table of Contents

  1. What Is Tokenization?
  2. Token Types
  3. Character Classification Helpers
  4. Reading Variable Names
  5. The Main Tokenize Function
  6. Worked Example
  7. What’s Next

What Is Tokenization?

Before we can understand the structure of an expression like (A & B) -> C, we need to break it into meaningful chunks. This process is called tokenization (or lexing). Each chunk is called a token.

Consider the expression (A & B) -> C. A human can immediately pick out the parentheses, the variables, and the operators - but to our program it’s just a flat string of characters. The tokenizer’s job is to scan through that string character by character and produce a list of structured objects:

( → paren "("
A → variable "A"
& → operator "&"
B → variable "B"
) → paren ")"
-> → operator "->"
C → variable "C"

Once we have this token list, the parser (Chapter 3) can work with meaningful units instead of raw characters.

Token Types

We use three token types:

  • paren - opening or closing parenthesis: (, )
  • operator - a logical operator: ~, &, |, ->, <->
  • variable - a variable name like A, B, myVar, or x1

Each token is a simple object with a type and a value:

// Example tokens:
{ type: 'paren',    value: '('   }
{ type: 'operator', value: '->'  }
{ type: 'variable', value: 'A'   }

Character Classification Helpers

The tokenizer needs to identify what kind of character it’s looking at. We write a small helper function for each category.

Whitespace - we skip these entirely:

function isWhiteSpace(input, i) {
  return /\s/.test(input[i]);
}

Parentheses - just ( and ):

function isParenthesis(input, i) {
  return input[i] === '(' || input[i] === ')';
}

Negation - we accept ~, !, and the Unicode ¬:

function isNotOperator(input, i) {
  return input[i] === '~' || input[i] === '!' || input[i] === '¬';
}

Conjunction and disjunction - single characters, but we also accept the Unicode symbols:

function isAndOperator(input, i) {
  return input[i] === '&' || input[i] === '∧';
}

function isOrOperator(input, i) {
  return input[i] === '|' || input[i] === '∨';
}

Implication - a two-character operator, either -> or =>:

function isImpliedOperator(input, i) {
  if (input[i] !== '-' && input[i] !== '=') return false;

  const check = input.slice(i, i + 2);
  return check === '->' || check === '=>';
}

Biconditional - a three-character operator, either <-> or <=>:

function isIfAndOnlyIfOperator(input, i) {
  if (input[i] !== '<') return false;

  const check = input.slice(i, i + 3);
  return check === '<->' || check === '<=>';
}
Design choice:

Each helper takes the full input string and the current index i. This makes the multi-character checks straightforward - we can slice ahead without worrying about bounds because slice safely returns a shorter string if we go past the end.

Variable characters - letters, digits, and underscores. This lets users write variable names like x1 or my_var:

function isVariableChar(chr) {
  return /[A-Za-z0-9_]/.test(chr);
}

Reading Variable Names

Since variable names can be more than one character, we need a function that reads the entire name starting from a given position:

function getVarName(input, i) {
  let varName = '';

  while (i < input.length && isVariableChar(input[i])) {
    varName += input[i];
    i++;
  }

  return varName;
}

It starts at position i and consumes characters as long as they’re valid variable characters. The resulting string is returned, and the caller advances i by varName.length.

The Main Tokenize Function

Now we put it all together. The tokenize function walks through the input character by character. At each position it checks what kind of token starts there, creates the appropriate token, and advances the index past whatever it consumed:

function tokenize(input) {
  let tokens = [];
  let i = 0;

  while (i < input.length) {
    if (isWhiteSpace(input, i)) {
      i++;
      continue;
    }

    if (isParenthesis(input, i)) {
      tokens.push({ type: 'paren', value: input[i] });
      i++;
      continue;
    }

    if (isNotOperator(input, i) || isAndOperator(input, i) || isOrOperator(input, i)) {
      tokens.push({ type: 'operator', value: input[i] });
      i++;
      continue;
    }

    if (isImpliedOperator(input, i)) {
      tokens.push({ type: 'operator', value: '->' });
      i += 2;
      continue;
    }

    if (isIfAndOnlyIfOperator(input, i)) {
      tokens.push({ type: 'operator', value: '<->' });
      i += 3;
      continue;
    }

    if (isVariableChar(input[i])) {
      const varName = getVarName(input, i);
      tokens.push({ type: 'variable', value: varName });
      i += varName.length;
      continue;
    }

    throw new Error('Unknown token at position ' + i + ": '" + input[i] + "'");
  }
  return tokens;
}
Important:

The order of checks matters. We test for the biconditional (<->, 3 chars) before the implication (->, 2 chars). If we checked -> first, the string <-> would be incorrectly split into < (error) and ->.

If none of the checks match, we throw an error with the position and the offending character. This gives the user a clear message like: Unknown token at position 5: '#'.

Worked Example

Let’s trace through tokenizing (A & B) -> C:

PositionCharacter(s)ActionToken
0(Parenthesis{ type: 'paren', value: '(' }
1AVariable{ type: 'variable', value: 'A' }
2 Whitespaceskip
3&And operator{ type: 'operator', value: '&' }
4 Whitespaceskip
5BVariable{ type: 'variable', value: 'B' }
6)Parenthesis{ type: 'paren', value: ')' }
7 Whitespaceskip
8–9->Implies operator{ type: 'operator', value: '->' }
10 Whitespaceskip
11CVariable{ type: 'variable', value: 'C' }

Result: an array of 7 tokens, ready for the parser.

What’s Next

We now have a flat list of tokens, but we don’t yet understand the structure of the expression - which operators bind tighter, where the sub-expressions are, or how parentheses group things.

In Chapter 3: Parsing with Recursive Descent, we’ll build a parser that consumes this token list and produces an Abstract Syntax Tree (AST), respecting operator precedence and parenthesization.


Previous: Chapter 1: Setting Up the Frontend

Series: Back to Truth Table, CNF, and DNF Generator