How MathExec Compiles LaTeX to PyTorch

One of the core pieces of MathExec is the formula compiler: the system that takes a LaTeX expression like y = σ(W₂ · ReLU(W₁x + b₁) + b₂) and produces a working PyTorch nn.Module with the right parameters, shapes, and forward pass.

This post walks through how it works, what design decisions we made along the way, and where we ran into interesting problems.

The pipeline

LaTeX string → Token stream → AST → PyTorch Module

Each stage handles a specific kind of complexity, and keeping them separate makes the compiler easier to test and extend.

Step 1: Tokenization

The compiler first tokenizes the LaTeX into meaningful symbols:

"y = \sigma(W_2 \cdot ReLU(W_1 x + b_1) + b_2)"
# →
['y', '=', 'sigma', '(', 'W_2', 'cdot', 'ReLU', '(', 'W_1', 'x', '+', 'b_1', ')', '+', 'b_2', ')']

We handle LaTeX commands (\sigma, \cdot, \frac), subscripts (W_1), superscripts (x^2), and Greek letters. The tokenizer also normalizes variant representations: \sigmoid, \sigma, and σ all become the same token.

One tricky part is distinguishing between subscripts that are part of a variable name (like W_1 meaning "weight matrix 1") and subscripts that have mathematical meaning (like x_i meaning "the i-th element of x"). We treat numeric subscripts as name qualifiers and alphabetic subscripts as indexing operations, which works for the vast majority of ML formulas.

Step 2: AST construction

Tokens are parsed into an abstract syntax tree following mathematical operator precedence:

Function application binds tightest: σ(...), ReLU(...)
Multiplication (explicit \cdot or implicit juxtaposition): W₁x
Addition/subtraction: ... + b₁
Comparison/assignment: y = ...

Implicit multiplication is one of the more interesting parsing challenges. In mathematical notation, Wx means W times x, but there's no operator between them. The parser inserts an implicit multiplication node whenever two operands appear adjacent without an operator. This also handles cases like 2x (scalar times variable) and W₁W₂x (chained matrix multiplications).

Parentheses and function application are handled by recursive descent. When the parser sees a known function name followed by (, it consumes everything up to the matching ) as the function's argument. This naturally handles nested expressions like σ(ReLU(...)).

Step 3: Code generation

The AST is walked to emit PyTorch code. Each node maps to a PyTorch operation:

LaTeX	PyTorch	Notes
`Wx`	`nn.Linear(in, out)`	Capital letter = weight matrix
`Wx + b`	`nn.Linear(in, out, bias=True)`	Bias detected and folded in
`σ(...)`	`torch.sigmoid(...)`	Sigmoid activation
`ReLU(...)`	`torch.relu(...)`	ReLU activation
`softmax(...)`	`F.softmax(..., dim=-1)`	Softmax with last-dim default
`tanh(...)`	`torch.tanh(...)`	Hyperbolic tangent
`x²`	`x ** 2`	Element-wise power
`\frac{a}{b}`	`a / b`	Division
`\sqrt{x}`	`torch.sqrt(x)`	Square root
`\exp(x)`	`torch.exp(x)`	Exponential
`\log(x)`	`torch.log(x)`	Natural logarithm

When the compiler encounters a Wx + b pattern (linear transformation plus bias), it folds both into a single nn.Linear layer with bias=True rather than creating separate weight and bias parameters. This is both more efficient and produces cleaner generated code.

Step 4: Shape inference

Parameter shapes are inferred from the data at training time, not at compile time. When you provide a CSV with 10 input columns, the compiler sets input_dim=10 on the first layer. Output dimensions are inferred from the target variable.

Hidden layer sizes are a harder problem because they're not specified in the formula. y = σ(W₂ · ReLU(W₁x + b₁) + b₂) tells us there's a hidden layer, but not how wide it should be. We default to 64 hidden units, which is a reasonable starting point for most tabular datasets. You can override this in the training configuration.

For deeper networks, each intermediate dimension defaults to the same hidden size. We've found that 64 works well for datasets under 10,000 rows, and users who need larger networks usually know enough to adjust the setting.

Handling ambiguity

Mathematical notation is inherently ambiguous. Wx + b could mean matrix multiplication or element-wise multiplication. We use conventions that match how most ML textbooks write formulas:

Capital letters (W, M, A) → weight matrices (nn.Linear)
Lowercase letters (b, c) → bias vectors (nn.Parameter)
Greek letters (α, β, γ) → scalar parameters
x, X → input data (not trainable)
y → output/target

These conventions handle about 95% of ML formulas correctly. For the remaining 5%, users can use explicit annotations in the formula editor to override the compiler's assumptions.

Another source of ambiguity is operator precedence with implicit multiplication. Does 2Wx mean (2W)x or 2(Wx)? Since scalar-matrix multiplication is commutative and associative, the result is the same either way. But for expressions like ReLU Wx + b, the parser needs to understand that ReLU is a function applied to Wx + b, not a variable being multiplied. We maintain a dictionary of known function names to resolve this.

Loss function selection

The compiler also selects an appropriate loss function based on the output activation:

σ(...) (sigmoid output) → BCELoss (binary cross-entropy)
softmax(...) → CrossEntropyLoss
No activation or linear output → MSELoss (mean squared error)

This heuristic is right for the most common cases. If you're doing something unusual (like regression with a sigmoid output for bounded predictions), you can override the loss function in the training panel.

Example: 2-layer MLP

Input:

y = \sigma(W_2 \cdot ReLU(W_1 x + b_1) + b_2)

Generated PyTorch:

class FormulaModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        hidden = 64  # default hidden size
        self.layer1 = nn.Linear(input_dim, hidden)
        self.layer2 = nn.Linear(hidden, output_dim)

    def forward(self, x):
        h = torch.relu(self.layer1(x))
        return torch.sigmoid(self.layer2(h))

The compiler handles nested expressions of arbitrary depth, so you can write complex architectures as a single formula. A 4-layer network with mixed activations works just as well:

y = softmax(W_4 \cdot ReLU(W_3 \cdot tanh(W_2 \cdot ReLU(W_1 x + b_1) + b_2) + b_3) + b_4)

Error handling and edge cases

Not every formula compiles cleanly. The compiler needs to handle malformed input gracefully.

Unbalanced parentheses are the most common error. y = σ(Wx + b is missing a closing paren. The compiler detects this during parsing and reports the location of the mismatch.

Unknown functions like y = foo(Wx + b) produce a warning. If the function name doesn't match any known activation or operation, the compiler treats it as a user-defined function and falls back to a no-op, with a message suggesting alternatives.

Circular definitions like y = y + 1 are caught during AST analysis. The compiler checks that the output variable doesn't appear on the right-hand side in a way that would create a feedback loop (skip connections like y = f(x) + x are fine because x is the input, not the output).

Type mismatches between operations (like adding a scalar to a matrix in a way that doesn't broadcast) are caught at training time when actual tensor shapes are known. The compiler generates code that's structurally valid, but shape errors only surface when data flows through the model.

We've found that clear error messages matter more than perfect error detection. Users don't mind if the compiler occasionally produces code that fails at training time, as long as the error message tells them what went wrong and how to fix it.

Performance considerations

The compilation step itself takes 10-50 milliseconds for typical formulas. It's not a bottleneck. The generated PyTorch code is standard nn.Module code with no overhead compared to hand-written models. There's no interpreter or formula evaluation at runtime; the compilation is a one-time code generation step.

For training, performance depends on the model size and dataset. Tabular models with a few hundred parameters train in seconds. Larger architectures (4+ layers, 256+ hidden units) on datasets with tens of thousands of rows might take a minute or two. This is the same performance you'd get from equivalent hand-written PyTorch, since that's exactly what the compiler generates.

We benchmarked the generated code against hand-written equivalents on several standard datasets (Iris, Boston Housing, MNIST subsets) and the training time difference is negligible, within 1-2% on average. The compiler doesn't introduce abstractions that slow things down. The output is the same nn.Module code you'd write yourself.

What's next

We're working on expanding the compiler to handle more architecture types:

Attention mechanisms: softmax(QK^T / \sqrt{d})V
Convolutions: Spatial operators in formula notation
Recurrent structures: Formulas with temporal subscripts
Skip connections: y = f(x) + x residual patterns (partially supported already)

The goal is to cover the architectures you'd find in a typical ML course or research paper, so you can go from reading the paper to training the model in under a minute.

Want to try it? Write a formula in MathExec and see what it compiles to.

pytorch latex compiler deep-dive

Enjoyed this article? Share it with others.