The Formula Compiler Problem: Why Parsing Math is Harder Than Parsing Code
Programming languages are designed to be parsed. They have formal grammars, unambiguous syntax, and explicit operators. Python doesn't make you guess whether * means multiplication or a footnote.
Mathematical notation is different. It evolved over centuries for human readers, not machines. The same symbol can mean different things in different contexts. Operators are often implicit. Conventions vary between fields. When we set out to build a LaTeX-to-PyTorch compiler, we expected the hard part to be code generation. We were wrong. The hard part is parsing.
This post documents the ambiguities we encountered and how we resolved them.
Ambiguity 1: Implicit multiplication
In math, Wx means W times x. There's no operator. In code, Wx is a single variable name. This is the most fundamental difference between mathematical and programming notation.
The formula W₁W₂x means "multiply W₁ by W₂ by x" (a chain of matrix multiplications). But the parser sees three tokens with no operators between them. It has to infer the multiplication.
Our rule: whenever two non-operator tokens appear adjacent, insert an implicit multiplication node in the AST. This handles:
Wx→W * x2x→2 * xW₁W₂x→W₁ * W₂ * xab→a * b
The tricky case is ab vs abs. Is that a * b or the abs function? We maintain a dictionary of known function names (abs, sigmoid, ReLU, softmax, etc.) and check against it before inserting implicit multiplication. If the token sequence matches a known function, it's a function call. Otherwise, it's multiplication.
This works most of the time. It fails when someone names a variable log or uses a function name we don't recognize. But in ML formulas, this is rare.
Ambiguity 2: Subscript semantics
W₁ means "weight matrix number 1." x_i means "the i-th element of x." Same syntactic structure (base + subscript), completely different meanings.
The distinction matters because W₁ tells the compiler to create a separate nn.Linear layer (layer 1), while x_i is an indexing operation on the input tensor.
Our heuristic: numeric subscripts on capital letters are layer indices (W₁, W₂ → layer 1, layer 2). Alphabetic subscripts on lowercase letters are indexing (x_i → x[:, i]). This covers the standard ML conventions.
Where it breaks: a₁ is ambiguous. Is it "parameter number 1" or "the first element of vector a"? We default to treating it as a named parameter (like a_1), which matches how most textbooks use numbered lowercase variables (bias terms, coefficients).
Ambiguity 3: What is a function vs. a variable?
In ReLU(Wx + b), ReLU is clearly a function. But what about f(x) in y = f(x) + x? Is f a known function, a user-defined function, or a variable being multiplied by (x)?
We use a two-pass approach:
- First pass: Check against a dictionary of known operations (sigmoid, ReLU, tanh, softmax, exp, log, sqrt, etc.). If the token matches, it's a function call.
- Second pass: If the token is an unknown word followed by
(, treat it as a generic function. We can't compile it to a specific PyTorch operation, but we can flag it for the user or fall back to LLM interpretation.
The fallback is important. Mathematical notation is open-ended. New function names appear in every paper. We can't maintain a complete dictionary. Instead, we handle the common cases deterministically and use LLM-assisted compilation for everything else.
Ambiguity 4: Variable roles
In y = Wx + b, what are W, x, b, and y?
A programmer would say: "they're all variables." But in ML, they have specific roles:
Wis a learnable weight matrixbis a learnable bias vectorxis the input (not trainable)yis the output (not trainable)
The compiler needs to know these roles to generate the right PyTorch code. W becomes nn.Linear, b becomes the bias term inside nn.Linear, x is the forward pass input, and y is what the forward method returns.
We use capitalization and naming conventions:
| Pattern | Role | PyTorch mapping |
|---|---|---|
| Capital letters (W, M, A) | Weight matrices | nn.Linear parameters |
| Lowercase b, c | Biases | bias=True in nn.Linear |
| x, X | Input data | Forward pass argument |
| y | Output | Forward pass return |
| Greek letters (α, β) | Scalar parameters | nn.Parameter |
| Numbers (2, 3.14) | Constants | Literal values |
This convention-based approach handles about 95% of ML formulas. For the other 5%, the compiler asks the user to clarify (via annotations in the formula editor) or uses the LLM fallback.
Ambiguity 5: Operator precedence with mixed notation
Consider ReLU W₁x + b₁. Is this ReLU(W₁x + b₁) or ReLU(W₁x) + b₁?
In standard math, function application has the highest precedence, so ReLU W₁x + b₁ should mean ReLU(W₁ * x) + b₁. But in ML papers, people almost always write ReLU(W₁x + b₁) with explicit parentheses.
When parentheses are present, there's no ambiguity. When they're missing, we use a heuristic: activation functions (ReLU, sigmoid, tanh) are assumed to wrap the entire remaining expression up to the next closing parenthesis or end of scope. This matches ML convention, not strict mathematical precedence.
We log a warning when this heuristic kicks in, because the user might have meant something different.
Comparison to programming language parsing
| Aspect | Programming languages | Math notation |
|---|---|---|
| Grammar | Formal, unambiguous | Informal, contextual |
| Multiplication | Explicit (*, @) |
Implicit (Wx) |
| Function calls | f(x) is always a call |
f(x) might be f * x |
| Variable types | Declared or inferred | Inferred from conventions |
| Operator precedence | Defined by spec | "Whatever the reader expects" |
| Whitespace | Usually insignificant | Sometimes significant (d x vs dx) |
Programming languages are designed for machines to parse deterministically. Math notation is designed for humans to read with context. Building a compiler that bridges the two means encoding human conventions as formal rules, which is inherently incomplete.
What we don't even try to handle
Some mathematical notation is genuinely beyond rule-based compilation:
Summation with complex bounds: y = Σᵢ₌₁ⁿ wᵢxᵢ (weighted sum) could compile to a simple linear layer, but the subscript indexing pattern is hard to parse reliably. We handle the simple case (Σ over one variable) but not nested or conditional summations.
Piecewise functions: f(x) = {x if x > 0, 0 otherwise} is ReLU, but recognizing piecewise definitions and mapping them to known functions requires pattern matching against an open-ended set of cases.
Integral and differential operators: These appear in physics-informed neural networks but map to computational graphs that don't have simple PyTorch equivalents.
For these cases, we fall back to LLM-assisted compilation, where a language model reads the LaTeX and generates appropriate PyTorch code. This works surprisingly well for common patterns but isn't deterministic, which means you might get slightly different code on repeated compilations.
The 95/5 split
After building the compiler, we found that about 95% of formulas that users actually type into MathExec fall into patterns that can be parsed deterministically. The remaining 5% require either LLM assistance or user clarification.
The 95% covers: linear models, polynomial models, logistic regression, MLPs of any depth with standard activations, and named architectures (LSTM, GRU, CNN, Transformer). These are the formulas people use most.
The 5% covers: attention mechanisms written in full mathematical notation, custom loss functions, unusual activation functions, and physics-inspired formulas. These are important but less frequent in day-to-day use.
Our approach is to keep expanding the deterministic coverage while maintaining the LLM fallback for everything else. Each formula that fails deterministic compilation is a signal to add a new pattern to the parser.
Lessons for anyone building a math parser
If you're building something that needs to parse mathematical notation (and not just render it, which MathJax and KaTeX handle well), here's what we've learned:
Start with a known-function dictionary, not a grammar. Trying to write a formal grammar for mathematical notation is a rabbit hole. There are too many exceptions and conventions. A dictionary of recognized patterns plus heuristics for everything else gets you further, faster.
Test against real user input, not textbook examples. Users write LaTeX differently than textbooks. They omit parentheses, use non-standard function names, mix notation styles within a single formula. Your test suite should include messy, real-world input, not just clean examples.
Fall back gracefully. You won't handle 100% of formulas deterministically. Having an LLM fallback that generates "best effort" code is better than showing an error. The user can always inspect and correct the generated code.
Log everything. Every formula that hits the LLM fallback is a signal to improve the deterministic parser. We track which formulas trigger fallback and prioritize the most common patterns for rule-based coverage.
Curious how your formula compiles? Try it in MathExec and see the generated PyTorch code.
Enjoyed this article? Share it with others.
