LaTeX to PyTorch: A Reference Taxonomy of Formula Patterns

If you've read an ML paper or textbook, you've seen the formulas. If you've implemented models in PyTorch, you've written the code. But there's no standard reference that maps one to the other.

This post is that reference. Each entry shows a LaTeX pattern, what it means architecturally, and what the equivalent PyTorch code looks like. These mappings are derived from MathExec's formula compiler, so every entry has been tested against real compilation and training.

Notation conventions

Before the taxonomy, you need to know the conventions. Mathematical notation in ML is surprisingly inconsistent, but most papers follow these defaults:

Symbol	Meaning	PyTorch equivalent
`x`	Input features	`x` tensor (batch_size, n_features)
`y`	Output / target	Target tensor
`W`, `M`, `A` (capitals)	Weight matrices	`nn.Linear` parameters
`b`, `c` (lowercase)	Bias vectors	`bias=True` in `nn.Linear`
`α`, `β`, `γ` (Greek)	Scalar parameters	`nn.Parameter(torch.tensor(...))`
`σ`	Sigmoid function	`torch.sigmoid()`
Subscripts (`W₁`, `b₂`)	Layer index	Separate `nn.Linear` per index

Category 1: Linear models

`y = mx + b`

The simplest possible model. One input feature, one weight, one bias.

# PyTorch equivalent
self.linear = nn.Linear(1, 1)  # 2 parameters

`y = Wx + b`

Multivariate linear regression. The capital W indicates a weight matrix that handles multiple input features.

self.linear = nn.Linear(n_features, 1)  # n_features + 1 parameters

`y = Wx` (no bias)

Sometimes you see the bias omitted. This compiles to nn.Linear with bias=False.

self.linear = nn.Linear(n_features, 1, bias=False)  # n_features parameters

`y = ax² + bx + c`

Polynomial regression. The input x is feature-expanded (squared, cubed, etc.) before being fed to a linear model.

# Input: x → [x, x², x³, ...]
self.linear = nn.Linear(degree, 1)  # degree + 1 parameters

Key insight: Polynomial regression is just linear regression on transformed features. The model itself is still linear.

Category 2: Classification (single activation wrapping)

`y = σ(Wx + b)`

Logistic regression. The sigmoid σ squashes the linear output to [0, 1] for binary classification.

self.linear = nn.Linear(n_features, 1)
# In forward: torch.sigmoid(self.linear(x))
# Loss: BCELoss

Compilation rule: When the compiler sees σ(...) wrapping a linear expression, it uses binary cross-entropy loss automatically.

`y = softmax(Wx + b)`

Multi-class classification. Softmax outputs a probability distribution over classes.

self.linear = nn.Linear(n_features, n_classes)
# In forward: F.softmax(self.linear(x), dim=-1)
# Loss: CrossEntropyLoss

Note: PyTorch's CrossEntropyLoss includes softmax internally, so the generated code typically uses log_softmax for numerical stability rather than softmax followed by NLLLoss.

Category 3: Multi-layer networks (nested activations)

`y = σ(W₂ · ReLU(W₁x + b₁) + b₂)`

Two-layer neural network for binary classification. The nesting depth tells you the layer count.

self.layer1 = nn.Linear(n_features, hidden_dim)  # W₁, b₁
self.layer2 = nn.Linear(hidden_dim, 1)            # W₂, b₂
# In forward:
#   h = torch.relu(self.layer1(x))
#   return torch.sigmoid(self.layer2(h))

Compilation rules for nested formulas:

Each W_n · activation(...) pattern becomes one nn.Linear layer
The innermost layer connects to n_features
Hidden dimensions default to 64 (configurable)
The outermost activation determines the loss function

`y = W₃ · ReLU(W₂ · ReLU(W₁x + b₁) + b₂) + b₃`

Three-layer MLP for regression (no outer activation = unbounded output = MSE loss).

self.layer1 = nn.Linear(n_features, hidden)
self.layer2 = nn.Linear(hidden, hidden)
self.layer3 = nn.Linear(hidden, 1)
# forward: relu → relu → linear

`y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)`

Neural network for multi-class classification. Same structure as the sigmoid version, but softmax output + cross-entropy loss.

Depth detection

The compiler counts nesting levels to determine depth:

σ(Wx + b) → 1 layer
σ(W₂ · ReLU(W₁x + b₁) + b₂) → 2 layers
σ(W₃ · ReLU(W₂ · ReLU(...) + b₂) + b₃) → 3 layers

Each subscript index on the weight matrices should be sequential. W₁, W₂, W₃ tells the compiler exactly how many layers to create.

Category 4: Activation variants

All of these are drop-in replacements for ReLU in the patterns above:

LaTeX	PyTorch	When to use
`ReLU(x)`	`torch.relu(x)`	Default choice for hidden layers
`tanh(x)`	`torch.tanh(x)`	When outputs should be in [-1, 1]
`σ(x)`	`torch.sigmoid(x)`	Binary output or gating
`GELU(x)`	`F.gelu(x)`	Transformer-style networks
`SiLU(x)`	`F.silu(x)`	Swish activation, smooth alternative to ReLU
`ELU(x)`	`F.elu(x)`	Smoother than ReLU near zero
`LeakyReLU(x)`	`F.leaky_relu(x)`	Avoids dead neurons

These can be mixed within a single formula. y = σ(W₂ · tanh(W₁x + b₁) + b₂) uses tanh in the hidden layer and sigmoid at the output.

Category 5: Named architectures

Some formulas trigger specialized model classes rather than generic MLP compilation:

`y = LSTM(x)`

Maps to a recurrent model with LSTM cells followed by a linear output layer.

self.lstm = nn.LSTM(n_features, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)

`y = GRU(x)`

Same structure as LSTM but with GRU cells (fewer parameters, often comparable performance).

`y = Conv1d(x)`

Maps to a 1D convolutional model, typically for sequence or time-series data.

self.conv1 = nn.Conv1d(n_features, 32, kernel_size=3)
self.pool = nn.AdaptiveAvgPool1d(1)
self.fc = nn.Linear(32, output_dim)

`y = Transformer(x)`

Maps to a transformer encoder layer followed by pooling and a linear output.

encoder_layer = nn.TransformerEncoderLayer(d_model=n_features, nhead=4)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)
self.fc = nn.Linear(n_features, output_dim)

Category 6: Residual patterns

`y = ReLU(W₂ · ReLU(W₁x + b₁) + b₂ + x)`

The + x at the end creates a skip connection. The compiler detects this pattern and wraps it in a residual block where the input is added to the transformed output.

# forward:
h = torch.relu(self.layer1(x))
h = self.layer2(h)
return torch.relu(h + x)  # residual connection

Important: Skip connections require the input and output dimensions to match. If they don't, the compiler inserts a projection layer to align the dimensions.

Decision tree: Which pattern do I need?

Is your target continuous (regression)?
├── Yes → How complex is the relationship?
│   ├── Linear → y = Wx + b
│   ├── Curved → y = ax² + bx + c
│   └── Complex → y = W₂ · ReLU(W₁x + b₁) + b₂
│
└── No (classification)
    ├── Binary (2 classes)?
    │   ├── Simple → y = σ(Wx + b)
    │   └── Complex → y = σ(W₂ · ReLU(W₁x + b₁) + b₂)
    │
    └── Multi-class (3+ classes)?
        ├── Simple → y = softmax(Wx + b)
        └── Complex → y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)

Start at the simplest formula that matches your task. Only add complexity (more layers, different activations) when the simpler version demonstrably underperforms on your data.

Edge cases and gotchas

Implicit multiplication: Wx means matrix multiplication, not element-wise. The compiler inserts nn.Linear for capital-letter-times-variable patterns.

Subscript ambiguity: x₁ means "feature 1" (not a separate variable), but W₁ means "weight matrix for layer 1." The compiler uses the letter case to disambiguate.

Missing bias: If you write y = Wx without + b, the compiler generates nn.Linear(n, m, bias=False). Most practitioners want bias, so y = Wx + b is usually what you mean.

Activation at the wrong level: y = ReLU(σ(Wx + b)) puts ReLU after sigmoid, which is unusual and probably a mistake. The compiler will compile it faithfully, but the model likely won't train well because sigmoid already bounds the output to [0, 1] and ReLU just clips the negatives (which don't exist).

Common mistakes

A few patterns that compile but produce unexpected results:

Double activation: puts ReLU after sigmoid. Since sigmoid outputs are in [0, 1], ReLU just passes them through (no negative values to clip). The ReLU is a no-op. You probably meant (ReLU in the hidden layer, sigmoid at the output).

Wrong output activation: uses ReLU as the output activation for a regression model. This means the model can only predict non-negative values. If your target variable has negative values, the model will never predict them. Use no output activation for regression: .

Softmax for binary: with a binary target creates unnecessary complexity. Softmax with 2 output classes is mathematically equivalent to sigmoid with 1 output, but uses twice the parameters. Use for binary classification.

Extending this taxonomy

This reference covers the patterns MathExec's compiler handles deterministically. For formulas that don't match any pattern, the compiler falls back to LLM-assisted compilation. The LLM reads the LaTeX and generates PyTorch code, which works for many custom architectures but isn't deterministic: you might get slightly different code on repeated compilations.

If you find a formula that should compile but doesn't, or that compiles to something unexpected, let us know. Every report helps us extend the deterministic coverage.

This reference is updated as MathExec's compiler expands. Try any of these formulas in MathExec to see the generated PyTorch code.

reference pytorch latex taxonomy

Enjoyed this article? Share it with others.

LaTeX to PyTorch: A Reference Taxonomy of Formula Patterns and What They Compile To

LaTeX to PyTorch: A Reference Taxonomy of Formula Patterns

Notation conventions

Category 1: Linear models

`y = mx + b`

`y = Wx + b`

`y = Wx` (no bias)

`y = ax² + bx + c`

Category 2: Classification (single activation wrapping)

`y = σ(Wx + b)`

`y = softmax(Wx + b)`

Category 3: Multi-layer networks (nested activations)

`y = σ(W₂ · ReLU(W₁x + b₁) + b₂)`

`y = W₃ · ReLU(W₂ · ReLU(W₁x + b₁) + b₂) + b₃`

`y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)`

Depth detection

Category 4: Activation variants

Category 5: Named architectures

`y = LSTM(x)`

`y = GRU(x)`

`y = Conv1d(x)`

`y = Transformer(x)`

Category 6: Residual patterns

`y = ReLU(W₂ · ReLU(W₁x + b₁) + b₂ + x)`

Decision tree: Which pattern do I need?

Edge cases and gotchas

Common mistakes

Extending this taxonomy

Ready to bring your formulas to life?

LaTeX to PyTorch: A Reference Taxonomy of Formula Patterns

Notation conventions

Category 1: Linear models

y = mx + b

y = Wx + b

y = Wx (no bias)

y = ax² + bx + c

Category 2: Classification (single activation wrapping)

y = σ(Wx + b)

y = softmax(Wx + b)

Category 3: Multi-layer networks (nested activations)

y = σ(W₂ · ReLU(W₁x + b₁) + b₂)

y = W₃ · ReLU(W₂ · ReLU(W₁x + b₁) + b₂) + b₃

y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)

Depth detection

Category 4: Activation variants

Category 5: Named architectures

y = LSTM(x)

y = GRU(x)

y = Conv1d(x)

y = Transformer(x)

Category 6: Residual patterns

y = ReLU(W₂ · ReLU(W₁x + b₁) + b₂ + x)

Decision tree: Which pattern do I need?

Edge cases and gotchas

Common mistakes

Extending this taxonomy

Ready to bring your formulas to life?

`y = mx + b`

`y = Wx + b`

`y = Wx` (no bias)

`y = ax² + bx + c`

`y = σ(Wx + b)`

`y = softmax(Wx + b)`

`y = σ(W₂ · ReLU(W₁x + b₁) + b₂)`

`y = W₃ · ReLU(W₂ · ReLU(W₁x + b₁) + b₂) + b₃`

`y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)`

`y = LSTM(x)`

`y = GRU(x)`

`y = Conv1d(x)`

`y = Transformer(x)`

`y = ReLU(W₂ · ReLU(W₁x + b₁) + b₂ + x)`