We Counted: The Average ML Paper Formula Takes 94 Lines of PyTorch

Every ML paper has a "Model" section where the architecture is described in mathematical notation. The formula is clean. It fits on one line. y = σ(W₂ · ReLU(W₁x + b₁) + b₂) tells you everything about a 2-layer binary classifier in 28 characters.

Then you implement it and write 80+ lines of Python.

We wanted to quantify this gap. So we took 10 standard ML formulas, wrote the complete, runnable PyTorch code for each (including data loading, training loop, and evaluation), and counted the lines.

The rules

To keep the comparison fair, we established rules for what counts as a "line of PyTorch":

Complete and runnable: The script must run end-to-end. Imports, model definition, data loading, training loop, and evaluation. No "..." or "# rest of the code here" shortcuts.
Minimal but clean: No unnecessary comments, no debug prints, no visualization. Just the code needed to load data, train the model, and print the final metric.
Standard practices: Using torch.utils.data.DataLoader, nn.Module subclass, and a standard training loop with optimizer and loss function. No third-party libraries (no Lightning, no sklearn).
Blank lines count: Because they exist in real code and affect readability.

For each formula, we also counted the LaTeX character count (including backslashes and braces).

The results

#	Formula	LaTeX chars	PyTorch lines	Ratio
1	`y = mx + b`	11	72	1:6.5
2	`y = ax² + bx + c`	18	78	1:4.3
3	`y = Wx + b`	11	74	1:6.7
4	`y = σ(Wx + b)`	14	79	1:5.6
5	`y = softmax(Wx + b)`	20	82	1:4.1
6	`y = σ(W₂·ReLU(W₁x+b₁)+b₂)`	28	89	1:3.2
7	`y = W₃·ReLU(W₂·ReLU(W₁x+b₁)+b₂)+b₃`	38	95	1:2.5
8	`y = softmax(W₂·ReLU(W₁x+b₁)+b₂)`	34	92	1:2.7
9	`y = LSTM(x)`	12	104	1:8.7
10	`y = Transformer(x)`	18	112	1:6.2
	Average	20.4	87.7	1:5.0

The average formula is 20 characters of LaTeX and 88 lines of PyTorch. The median is closer to 85 lines. The range spans from 72 (simple linear regression) to 112 (Transformer).

Note: Our original estimate of 94 lines was from an earlier version of this analysis that included a slightly different set of formulas. The methodology is the same.

Where the lines go

The interesting question isn't "how many lines" but "where are they?" We broke down a typical implementation (formula #6, the 2-layer MLP) into categories:

Category	Lines	% of total
Imports	4	4.5%
Model class definition	12	13.5%
Dataset class	15	16.9%
Data loading and splitting	11	12.4%
Training loop	22	24.7%
Evaluation	9	10.1%
Configuration (optimizer, loss, device)	8	9.0%
Preprocessing (scaling, encoding)	8	9.0%

The model class, the part that directly corresponds to the formula, is 12 lines (13.5%). The other 77 lines (86.5%) are infrastructure: loading data, batching it, running the training loop, tracking metrics, evaluating on a test set.

86.5% of the code has nothing to do with the formula itself.

The boilerplate breakdown

Data loading (26 lines, ~30%)

Loading a CSV into PyTorch requires creating a custom Dataset class with __init__, __len__, and __getitem__, then wrapping it in a DataLoader. For a simple CSV:

class CSVDataset(Dataset):
    def __init__(self, path, target_col):
        df = pd.read_csv(path)
        self.X = torch.tensor(df.drop(columns=[target_col]).values, dtype=torch.float32)
        self.y = torch.tensor(df[target_col].values, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

That's 10 lines for the simplest possible case. No categorical encoding, no missing value handling, no feature scaling. Add those and you're at 20+ lines before you've even defined the model.

In MathExec, this is: "Upload CSV, select target column."

Training loop (22 lines, ~25%)

The standard PyTorch training loop is remarkably verbose for what it does:

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()
        output = model(X_batch)
        loss = criterion(output.squeeze(), y_batch)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    # ... logging, validation, early stopping

Every PyTorch project writes some version of these 15 lines. They're always the same. The only things that change are the loss function and whether you squeeze the output.

In MathExec, this is: "Click Train."

Configuration (8 lines, ~9%)

Choosing a device, creating an optimizer, selecting a loss function, setting hyperparameters:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ChurnModel(input_dim=13).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
num_epochs = 200
batch_size = 32

These are decisions, not creativity. For most experiments, the defaults (Adam, lr=0.001, batch_size=32) work fine. But you still have to write the lines.

The model class: the only part that matters

The 12-line model class is the only part that directly encodes the formula. And even there, half the code is PyTorch ceremony:

class Model(nn.Module):             # ceremony
    def __init__(self, input_dim):   # ceremony
        super().__init__()           # ceremony
        self.layer1 = nn.Linear(input_dim, 64)  # actual model
        self.layer2 = nn.Linear(64, 1)          # actual model

    def forward(self, x):            # ceremony
        h = torch.relu(self.layer1(x))          # actual model
        return torch.sigmoid(self.layer2(h))    # actual model

4 lines of actual model logic. 8 lines of class ceremony. The formula y = σ(W₂ · ReLU(W₁x + b₁) + b₂) is faithfully expressed in those 4 lines. The other 85 lines exist to make PyTorch happy.

Complexity scaling

As formulas get more complex, the PyTorch line count grows slowly while the formula length grows quickly. The LSTM implementation is 104 lines because the model class is larger (LSTM requires handling hidden state), but the formula is still 12 characters.

The ratio of LaTeX chars to PyTorch lines is highest for simple models (1:6.5 for linear regression) and lowest for deep MLPs (1:2.5 for 3-layer MLP). This makes sense: as the formula gets longer, a larger fraction of the code is "actual model logic" vs boilerplate. But even at the best ratio, you're still writing 2.5 lines of code per character of math.

What this means

The 86.5% boilerplate figure is the core argument for formula-level ML tools. Most of the code in a standard PyTorch script is not about your model. It's about the plumbing that connects your data to your model to your metrics.

There are several ways to reduce this:

Libraries (PyTorch Lightning, fastai) reduce boilerplate by 40-60% but still require you to write the model class and some configuration
AutoML (AutoGluon, FLAML) eliminates the model class but also eliminates your control over the architecture
Formula compilation (MathExec) eliminates everything except the formula itself

The question is what you're willing to give up. If you need full control, write the 88 lines. If you need an answer fast, write the 28 characters.

The framework response

Frameworks like PyTorch Lightning exist specifically to reduce this boilerplate. Lightning eliminates the training loop, the device management, and the DataLoader ceremony. A Lightning version of the same model might be 35-40 lines instead of 88.

But Lightning doesn't eliminate the model class. You still write the and methods. You still decide on the optimizer and loss function. And you add a new dependency with its own learning curve and conventions.

scikit-learn gets even more concise for the models it supports. is one line. But you lose fine-grained control over the architecture, and sklearn's neural network support is limited compared to PyTorch.

The point isn't that PyTorch is bad. PyTorch is the right tool when you need full control. The point is that most of the code in a PyTorch script has nothing to do with the mathematical idea you're implementing, and for rapid experimentation, that overhead isn't worth paying.

Why we measure this

We track these numbers because they define the gap MathExec is trying to close. If the average formula takes 88 lines of PyTorch, and MathExec reduces that to 0 lines (just the formula), the value proposition is concrete and measurable: you save 88 lines of code per experiment. Over 10 experiments in a session, that's 880 lines of code you didn't have to write, debug, or maintain.

The formula-to-lines ratio is also a useful proxy for how much "translation tax" a given model type carries. LSTM has the highest ratio (1:8.7) because recurrent models require the most hidden-state plumbing. Linear regression has a lower ratio (1:6.5) but is still far from 1:1.

A 1:1 ratio, where the code is exactly as long as the formula, is probably impossible. But getting closer to it is the goal.

The complete PyTorch scripts for all 10 formulas are available in our GitHub repo. Try any of these formulas in MathExec to see the same model trained in under 60 seconds.

benchmark pytorch papers boilerplate

Enjoyed this article? Share it with others.