[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"blog-post-latex-pytorch-formula-taxonomy":3},{"id":4,"title":5,"slug":6,"excerpt":7,"category":8,"tags":9,"author_name":14,"cover_image":15,"status":16,"view_count":17,"reading_time_minutes":18,"published_at":19,"updated_at":19,"created_at":20,"content":21,"meta_description":22,"og_image":15,"canonical_url":23,"author_uid":23,"previous_slugs":24,"images":25},"69c95b46b422d9f69ccff184","LaTeX to PyTorch: A Reference Taxonomy of Formula Patterns and What They Compile To","latex-pytorch-formula-taxonomy","A definitive reference mapping LaTeX formula patterns to their PyTorch equivalents. Every entry backed by MathExec's compiler.","engineering",[10,11,12,13],"reference","pytorch","latex","taxonomy","Kingsley Michael","https:\u002F\u002Fmathexec.com\u002Fblog\u002Fimages\u002F69c95b46b422d9f69ccff184\u002F67918ae3-6893-4bb0-b56b-928b83c488af.png","published",14,8,"2026-04-01T15:27:37.410000","2026-03-29T17:02:52.547000","# LaTeX to PyTorch: A Reference Taxonomy of Formula Patterns\n\nIf you've read an ML paper or textbook, you've seen the formulas. If you've implemented models in PyTorch, you've written the code. But there's no standard reference that maps one to the other.\n\nThis post is that reference. Each entry shows a LaTeX pattern, what it means architecturally, and what the equivalent PyTorch code looks like. These mappings are derived from MathExec's formula compiler, so every entry has been tested against real compilation and training.\n\n## Notation conventions\n\nBefore the taxonomy, you need to know the conventions. Mathematical notation in ML is surprisingly inconsistent, but most papers follow these defaults:\n\n| Symbol | Meaning | PyTorch equivalent |\n|--------|---------|-------------------|\n| `x` | Input features | `x` tensor (batch_size, n_features) |\n| `y` | Output \u002F target | Target tensor |\n| `W`, `M`, `A` (capitals) | Weight matrices | `nn.Linear` parameters |\n| `b`, `c` (lowercase) | Bias vectors | `bias=True` in `nn.Linear` |\n| `α`, `β`, `γ` (Greek) | Scalar parameters | `nn.Parameter(torch.tensor(...))` |\n| `σ` | Sigmoid function | `torch.sigmoid()` |\n| Subscripts (`W₁`, `b₂`) | Layer index | Separate `nn.Linear` per index |\n\n## Category 1: Linear models\n\n### `y = mx + b`\n\nThe simplest possible model. One input feature, one weight, one bias.\n\n```python\n# PyTorch equivalent\nself.linear = nn.Linear(1, 1)  # 2 parameters\n```\n\n### `y = Wx + b`\n\nMultivariate linear regression. The capital `W` indicates a weight matrix that handles multiple input features.\n\n```python\nself.linear = nn.Linear(n_features, 1)  # n_features + 1 parameters\n```\n\n### `y = Wx` (no bias)\n\nSometimes you see the bias omitted. This compiles to `nn.Linear` with `bias=False`.\n\n```python\nself.linear = nn.Linear(n_features, 1, bias=False)  # n_features parameters\n```\n\n### `y = ax² + bx + c`\n\nPolynomial regression. The input `x` is feature-expanded (squared, cubed, etc.) before being fed to a linear model.\n\n```python\n# Input: x → [x, x², x³, ...]\nself.linear = nn.Linear(degree, 1)  # degree + 1 parameters\n```\n\n**Key insight**: Polynomial regression is just linear regression on transformed features. The model itself is still linear.\n\n## Category 2: Classification (single activation wrapping)\n\n### `y = σ(Wx + b)`\n\nLogistic regression. The sigmoid `σ` squashes the linear output to [0, 1] for binary classification.\n\n```python\nself.linear = nn.Linear(n_features, 1)\n# In forward: torch.sigmoid(self.linear(x))\n# Loss: BCELoss\n```\n\n**Compilation rule**: When the compiler sees `σ(...)` wrapping a linear expression, it uses binary cross-entropy loss automatically.\n\n### `y = softmax(Wx + b)`\n\nMulti-class classification. Softmax outputs a probability distribution over classes.\n\n```python\nself.linear = nn.Linear(n_features, n_classes)\n# In forward: F.softmax(self.linear(x), dim=-1)\n# Loss: CrossEntropyLoss\n```\n\n**Note**: PyTorch's `CrossEntropyLoss` includes softmax internally, so the generated code typically uses `log_softmax` for numerical stability rather than `softmax` followed by `NLLLoss`.\n\n## Category 3: Multi-layer networks (nested activations)\n\n### `y = σ(W₂ · ReLU(W₁x + b₁) + b₂)`\n\nTwo-layer neural network for binary classification. The nesting depth tells you the layer count.\n\n```python\nself.layer1 = nn.Linear(n_features, hidden_dim)  # W₁, b₁\nself.layer2 = nn.Linear(hidden_dim, 1)            # W₂, b₂\n# In forward:\n#   h = torch.relu(self.layer1(x))\n#   return torch.sigmoid(self.layer2(h))\n```\n\n**Compilation rules for nested formulas**:\n- Each `W_n · activation(...)` pattern becomes one `nn.Linear` layer\n- The innermost layer connects to `n_features`\n- Hidden dimensions default to 64 (configurable)\n- The outermost activation determines the loss function\n\n### `y = W₃ · ReLU(W₂ · ReLU(W₁x + b₁) + b₂) + b₃`\n\nThree-layer MLP for regression (no outer activation = unbounded output = MSE loss).\n\n```python\nself.layer1 = nn.Linear(n_features, hidden)\nself.layer2 = nn.Linear(hidden, hidden)\nself.layer3 = nn.Linear(hidden, 1)\n# forward: relu → relu → linear\n```\n\n### `y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)`\n\nNeural network for multi-class classification. Same structure as the sigmoid version, but softmax output + cross-entropy loss.\n\n### Depth detection\n\nThe compiler counts nesting levels to determine depth:\n- `σ(Wx + b)` → 1 layer\n- `σ(W₂ · ReLU(W₁x + b₁) + b₂)` → 2 layers\n- `σ(W₃ · ReLU(W₂ · ReLU(...) + b₂) + b₃)` → 3 layers\n\nEach subscript index on the weight matrices should be sequential. `W₁`, `W₂`, `W₃` tells the compiler exactly how many layers to create.\n\n## Category 4: Activation variants\n\nAll of these are drop-in replacements for ReLU in the patterns above:\n\n| LaTeX | PyTorch | When to use |\n|-------|---------|-------------|\n| `ReLU(x)` | `torch.relu(x)` | Default choice for hidden layers |\n| `tanh(x)` | `torch.tanh(x)` | When outputs should be in [-1, 1] |\n| `σ(x)` | `torch.sigmoid(x)` | Binary output or gating |\n| `GELU(x)` | `F.gelu(x)` | Transformer-style networks |\n| `SiLU(x)` | `F.silu(x)` | Swish activation, smooth alternative to ReLU |\n| `ELU(x)` | `F.elu(x)` | Smoother than ReLU near zero |\n| `LeakyReLU(x)` | `F.leaky_relu(x)` | Avoids dead neurons |\n\nThese can be mixed within a single formula. `y = σ(W₂ · tanh(W₁x + b₁) + b₂)` uses tanh in the hidden layer and sigmoid at the output.\n\n## Category 5: Named architectures\n\nSome formulas trigger specialized model classes rather than generic MLP compilation:\n\n### `y = LSTM(x)`\n\nMaps to a recurrent model with LSTM cells followed by a linear output layer.\n\n```python\nself.lstm = nn.LSTM(n_features, hidden_dim, batch_first=True)\nself.fc = nn.Linear(hidden_dim, output_dim)\n```\n\n### `y = GRU(x)`\n\nSame structure as LSTM but with GRU cells (fewer parameters, often comparable performance).\n\n### `y = Conv1d(x)`\n\nMaps to a 1D convolutional model, typically for sequence or time-series data.\n\n```python\nself.conv1 = nn.Conv1d(n_features, 32, kernel_size=3)\nself.pool = nn.AdaptiveAvgPool1d(1)\nself.fc = nn.Linear(32, output_dim)\n```\n\n### `y = Transformer(x)`\n\nMaps to a transformer encoder layer followed by pooling and a linear output.\n\n```python\nencoder_layer = nn.TransformerEncoderLayer(d_model=n_features, nhead=4)\nself.encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)\nself.fc = nn.Linear(n_features, output_dim)\n```\n\n## Category 6: Residual patterns\n\n### `y = ReLU(W₂ · ReLU(W₁x + b₁) + b₂ + x)`\n\nThe `+ x` at the end creates a skip connection. The compiler detects this pattern and wraps it in a residual block where the input is added to the transformed output.\n\n```python\n# forward:\nh = torch.relu(self.layer1(x))\nh = self.layer2(h)\nreturn torch.relu(h + x)  # residual connection\n```\n\n**Important**: Skip connections require the input and output dimensions to match. If they don't, the compiler inserts a projection layer to align the dimensions.\n\n## Decision tree: Which pattern do I need?\n\n```\nIs your target continuous (regression)?\n├── Yes → How complex is the relationship?\n│   ├── Linear → y = Wx + b\n│   ├── Curved → y = ax² + bx + c\n│   └── Complex → y = W₂ · ReLU(W₁x + b₁) + b₂\n│\n└── No (classification)\n    ├── Binary (2 classes)?\n    │   ├── Simple → y = σ(Wx + b)\n    │   └── Complex → y = σ(W₂ · ReLU(W₁x + b₁) + b₂)\n    │\n    └── Multi-class (3+ classes)?\n        ├── Simple → y = softmax(Wx + b)\n        └── Complex → y = softmax(W₂ · ReLU(W₁x + b₁) + b₂)\n```\n\nStart at the simplest formula that matches your task. Only add complexity (more layers, different activations) when the simpler version demonstrably underperforms on your data.\n\n## Edge cases and gotchas\n\n**Implicit multiplication**: `Wx` means matrix multiplication, not element-wise. The compiler inserts `nn.Linear` for capital-letter-times-variable patterns.\n\n**Subscript ambiguity**: `x₁` means \"feature 1\" (not a separate variable), but `W₁` means \"weight matrix for layer 1.\" The compiler uses the letter case to disambiguate.\n\n**Missing bias**: If you write `y = Wx` without `+ b`, the compiler generates `nn.Linear(n, m, bias=False)`. Most practitioners want bias, so `y = Wx + b` is usually what you mean.\n\n**Activation at the wrong level**: `y = ReLU(σ(Wx + b))` puts ReLU *after* sigmoid, which is unusual and probably a mistake. The compiler will compile it faithfully, but the model likely won't train well because sigmoid already bounds the output to [0, 1] and ReLU just clips the negatives (which don't exist).\n\n## Common mistakes\n\nA few patterns that compile but produce unexpected results:\n\n**Double activation**:  puts ReLU after sigmoid. Since sigmoid outputs are in [0, 1], ReLU just passes them through (no negative values to clip). The ReLU is a no-op. You probably meant  (ReLU in the hidden layer, sigmoid at the output).\n\n**Wrong output activation**:  uses ReLU as the output activation for a regression model. This means the model can only predict non-negative values. If your target variable has negative values, the model will never predict them. Use no output activation for regression: .\n\n**Softmax for binary**:  with a binary target creates unnecessary complexity. Softmax with 2 output classes is mathematically equivalent to sigmoid with 1 output, but uses twice the parameters. Use  for binary classification.\n\n## Extending this taxonomy\n\nThis reference covers the patterns MathExec's compiler handles deterministically. For formulas that don't match any pattern, the compiler falls back to LLM-assisted compilation. The LLM reads the LaTeX and generates PyTorch code, which works for many custom architectures but isn't deterministic: you might get slightly different code on repeated compilations.\n\nIf you find a formula that should compile but doesn't, or that compiles to something unexpected, let us know. Every report helps us extend the deterministic coverage.\n\n---\n\n*This reference is updated as MathExec's compiler expands. Try any of these formulas in [MathExec](https:\u002F\u002Fmathexec.com\u002Fapp) to see the generated PyTorch code.*\n","Reference taxonomy mapping LaTeX ML formula patterns to PyTorch architectures. Covers linear models through transformers with code examples.",null,[],[26],"blog\u002F69c95b46b422d9f69ccff184\u002F29b39cdf-8ed9-46ec-93cc-2bf4cac689b7.png"]