LaTeX to PyTorch: A Reference Taxonomy of Formula Patterns
If you've read an ML paper or textbook, you've seen the formulas. If you've implemented models in PyTorch, you've written the code. But there's no standard reference that maps one to the other.
This post is that reference. Each entry shows a LaTeX pattern, what it means architecturally, and what the equivalent PyTorch code looks like. These mappings are derived from MathExec's formula compiler, so every entry has been tested against real compilation and training.
Notation conventions
Before the taxonomy, you need to know the conventions. Mathematical notation in ML is surprisingly inconsistent, but most papers follow these defaults:
| Symbol | Meaning | PyTorch equivalent |
|---|---|---|
x |
Input features | x tensor (batch_size, n_features) |
y |
Output / target | Target tensor |
W, M, A (capitals) |
Weight matrices | nn.Linear parameters |
b, c (lowercase) |
Bias vectors | bias=True in nn.Linear |
α, β, γ (Greek) |
Scalar parameters | nn.Parameter(torch.tensor(...)) |
σ |
Sigmoid function | torch.sigmoid() |
Subscripts (W₁, b₂) |
Layer index | Separate nn.Linear per index |
Category 1: Linear models
y = mx + b
The simplest possible model. One input feature, one weight, one bias.
# PyTorch equivalent
self.linear = nn.Linear(1, 1) # 2 parameters
y = Wx + b
Multivariate linear regression. The capital W indicates a weight matrix that handles multiple input features.
self.linear = nn.Linear(n_features, 1) # n_features + 1 parameters
y = Wx (no bias)
Sometimes you see the bias omitted. This compiles to nn.Linear with bias=False.
self.linear = nn.Linear(n_features, 1, bias=False) # n_features parameters
y = ax² + bx + c
Polynomial regression. The input x is feature-expanded (squared, cubed, etc.) before being fed to a linear model.
# Input: x → [x, x², x³, ...]
self.linear = nn.Linear(degree, 1) # degree + 1 parameters
Key insight: Polynomial regression is just linear regression on transformed features. The model itself is still linear.
Category 2: Classification (single activation wrapping)
y = σ(Wx + b)
Logistic regression. The sigmoid σ squashes the linear output to [0, 1] for binary classification.
self.linear = nn.Linear(n_features, 1)
# In forward: torch.sigmoid(self.linear(x))
# Loss: BCELoss
Compilation rule: When the compiler sees σ(...) wrapping a linear expression, it uses binary cross-entropy loss automatically.
y = softmax(Wx + b)
Multi-class classification. Softmax outputs a probability distribution over classes.
self.linear = nn.Linear(n_features, n_classes)
# In forward: F.softmax(self.linear(x), dim=-1)
# Loss: CrossEntropyLoss
Note: PyTorch's CrossEntropyLoss includes softmax internally, so the generated code typically uses log_softmax for numerical stability rather than softmax followed by NLLLoss.
Category 3: Multi-layer networks (nested activations)
y = σ(W₂ · ReLU(W₁x + b₁) + b₂)
Two-layer neural network for binary classification. The nesting depth tells you the layer count.
self.layer1 = nn.Linear(n_features, hidden_dim) # W₁, b₁
self.layer2 = nn.Linear(hidden_dim, 1) # W₂, b₂
# In forward:
# h = torch.relu(self.layer1(x))
# return torch.sigmoid(self.layer2(h))
Compilation rules for nested formulas:
- Each
W_n · activation(...)pattern becomes onenn.Linearlayer - The innermost layer connects to
n_features - Hidden dimensions default to 64 (configurable)
- The outermost activation determines the loss function
y = W₃ · ReLU(W₂ · ReLU(W₁x + b₁) + b₂) + b₃
Three-layer MLP for regression (no outer activation = unbounded output = MSE loss).
self.layer1 = nn.Linear(n_features, hidden)
self.layer2 = nn.Linear(hidden, hidden)
self.layer3 = nn.Linear(hidden, 1)
# forward: relu → relu → linear
y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)
Neural network for multi-class classification. Same structure as the sigmoid version, but softmax output + cross-entropy loss.
Depth detection
The compiler counts nesting levels to determine depth:
σ(Wx + b)→ 1 layerσ(W₂ · ReLU(W₁x + b₁) + b₂)→ 2 layersσ(W₃ · ReLU(W₂ · ReLU(...) + b₂) + b₃)→ 3 layers
Each subscript index on the weight matrices should be sequential. W₁, W₂, W₃ tells the compiler exactly how many layers to create.
Category 4: Activation variants
All of these are drop-in replacements for ReLU in the patterns above:
| LaTeX | PyTorch | When to use |
|---|---|---|
ReLU(x) |
torch.relu(x) |
Default choice for hidden layers |
tanh(x) |
torch.tanh(x) |
When outputs should be in [-1, 1] |
σ(x) |
torch.sigmoid(x) |
Binary output or gating |
GELU(x) |
F.gelu(x) |
Transformer-style networks |
SiLU(x) |
F.silu(x) |
Swish activation, smooth alternative to ReLU |
ELU(x) |
F.elu(x) |
Smoother than ReLU near zero |
LeakyReLU(x) |
F.leaky_relu(x) |
Avoids dead neurons |
These can be mixed within a single formula. y = σ(W₂ · tanh(W₁x + b₁) + b₂) uses tanh in the hidden layer and sigmoid at the output.
Category 5: Named architectures
Some formulas trigger specialized model classes rather than generic MLP compilation:
y = LSTM(x)
Maps to a recurrent model with LSTM cells followed by a linear output layer.
self.lstm = nn.LSTM(n_features, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
y = GRU(x)
Same structure as LSTM but with GRU cells (fewer parameters, often comparable performance).
y = Conv1d(x)
Maps to a 1D convolutional model, typically for sequence or time-series data.
self.conv1 = nn.Conv1d(n_features, 32, kernel_size=3)
self.pool = nn.AdaptiveAvgPool1d(1)
self.fc = nn.Linear(32, output_dim)
y = Transformer(x)
Maps to a transformer encoder layer followed by pooling and a linear output.
encoder_layer = nn.TransformerEncoderLayer(d_model=n_features, nhead=4)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)
self.fc = nn.Linear(n_features, output_dim)
Category 6: Residual patterns
y = ReLU(W₂ · ReLU(W₁x + b₁) + b₂ + x)
The + x at the end creates a skip connection. The compiler detects this pattern and wraps it in a residual block where the input is added to the transformed output.
# forward:
h = torch.relu(self.layer1(x))
h = self.layer2(h)
return torch.relu(h + x) # residual connection
Important: Skip connections require the input and output dimensions to match. If they don't, the compiler inserts a projection layer to align the dimensions.
Decision tree: Which pattern do I need?
Is your target continuous (regression)?
├── Yes → How complex is the relationship?
│ ├── Linear → y = Wx + b
│ ├── Curved → y = ax² + bx + c
│ └── Complex → y = W₂ · ReLU(W₁x + b₁) + b₂
│
└── No (classification)
├── Binary (2 classes)?
│ ├── Simple → y = σ(Wx + b)
│ └── Complex → y = σ(W₂ · ReLU(W₁x + b₁) + b₂)
│
└── Multi-class (3+ classes)?
├── Simple → y = softmax(Wx + b)
└── Complex → y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)
Start at the simplest formula that matches your task. Only add complexity (more layers, different activations) when the simpler version demonstrably underperforms on your data.
Edge cases and gotchas
Implicit multiplication: Wx means matrix multiplication, not element-wise. The compiler inserts nn.Linear for capital-letter-times-variable patterns.
Subscript ambiguity: x₁ means "feature 1" (not a separate variable), but W₁ means "weight matrix for layer 1." The compiler uses the letter case to disambiguate.
Missing bias: If you write y = Wx without + b, the compiler generates nn.Linear(n, m, bias=False). Most practitioners want bias, so y = Wx + b is usually what you mean.
Activation at the wrong level: y = ReLU(σ(Wx + b)) puts ReLU after sigmoid, which is unusual and probably a mistake. The compiler will compile it faithfully, but the model likely won't train well because sigmoid already bounds the output to [0, 1] and ReLU just clips the negatives (which don't exist).
Common mistakes
A few patterns that compile but produce unexpected results:
Double activation: puts ReLU after sigmoid. Since sigmoid outputs are in [0, 1], ReLU just passes them through (no negative values to clip). The ReLU is a no-op. You probably meant (ReLU in the hidden layer, sigmoid at the output).
Wrong output activation: uses ReLU as the output activation for a regression model. This means the model can only predict non-negative values. If your target variable has negative values, the model will never predict them. Use no output activation for regression: .
Softmax for binary: with a binary target creates unnecessary complexity. Softmax with 2 output classes is mathematically equivalent to sigmoid with 1 output, but uses twice the parameters. Use for binary classification.
Extending this taxonomy
This reference covers the patterns MathExec's compiler handles deterministically. For formulas that don't match any pattern, the compiler falls back to LLM-assisted compilation. The LLM reads the LaTeX and generates PyTorch code, which works for many custom architectures but isn't deterministic: you might get slightly different code on repeated compilations.
If you find a formula that should compile but doesn't, or that compiles to something unexpected, let us know. Every report helps us extend the deterministic coverage.
This reference is updated as MathExec's compiler expands. Try any of these formulas in MathExec to see the generated PyTorch code.
Enjoyed this article? Share it with others.
