Engineering8 min read

60-Second Model: Benchmarking Time-to-First-Prediction Across 5 ML Workflows

We timed 5 different ML workflows on the same task: raw PyTorch, Lightning, scikit-learn, AutoML, and MathExec. The gap was bigger than we expected.

K

Kingsley Michael

April 1, 2026

60-Second Model: Benchmarking Time-to-First-Prediction Across 5 ML Workflows

60-Second Model: Benchmarking Time-to-First-Prediction Across 5 ML Workflows

"Time-to-first-prediction" is a metric nobody talks about. People benchmark model accuracy, training speed, inference latency. But the time from "I want to try this model" to "I have a prediction on my data" is arguably the metric that determines whether an idea gets tested at all.

If it takes 20 minutes, you test 2-3 ideas in a work session. If it takes 60 seconds, you test 20. That changes how you think about modeling.

We ran the same task through 5 different ML workflows and timed everything.

The task

Binary classification on a customer churn dataset. 5,000 rows, 8 features (numeric and categorical), binary target. Standard tabular ML problem. The model: a 2-layer neural network with ReLU hidden layer and sigmoid output.

We measured:

  • Wall clock time from opening the tool to seeing the first prediction
  • Lines of code written (not counting imports for fairness)
  • Keystrokes (approximate, from screen recording)
  • Decisions required (how many things you need to configure before training)

The contenders

1. Raw PyTorch (from scratch)

Start with a blank Python file. Write the model class, the Dataset class, the training loop, the evaluation logic.

class ChurnModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, 64)
        self.layer2 = nn.Linear(64, 1)

    def forward(self, x):
        h = torch.relu(self.layer1(x))
        return torch.sigmoid(self.layer2(h))

This is 9 lines for the model. Then you need data loading (~15 lines), the training loop (~20 lines), evaluation (~10 lines), and assorted setup (device selection, optimizer, loss function, ~8 lines). Total: about 62 lines of actual code.

2. PyTorch Lightning

Lightning reduces boilerplate by providing LightningModule and Trainer. You still define the model, but the training loop, logging, and device management are handled.

class ChurnModel(pl.LightningModule):
    def __init__(self, input_dim):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, 64)
        self.layer2 = nn.Linear(64, 1)
        self.loss_fn = nn.BCELoss()

    def forward(self, x):
        h = torch.relu(self.layer1(x))
        return torch.sigmoid(self.layer2(h))

    def training_step(self, batch, batch_idx):
        x, y = batch
        pred = self(x)
        return self.loss_fn(pred.squeeze(), y)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

About 35 lines total, plus DataLoader setup (~10 lines) and a Trainer call (~3 lines). Total: about 48 lines.

3. scikit-learn

For this specific task (tabular binary classification), sklearn is the most concise traditional option:

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
model = MLPClassifier(hidden_layer_sizes=(64,), max_iter=200)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

About 12 lines including data loading. But you need to preprocess (handle categoricals, scale numerics) yourself. Realistically 20-25 lines for a clean pipeline.

4. AutoML (AutoGluon)

AutoGluon automates model selection and hyperparameter tuning:

from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label='churned').fit(train_data, time_limit=120)
predictions = predictor.predict(test_data)

About 8 lines. But AutoGluon needs installation (large dependency, several GB), and the fit call searches over multiple model types, which takes 2-5 minutes even with time_limit=120. You also don't control the architecture.

5. MathExec

Open the app, type the formula, upload the CSV, map the target column, click Train.

Formula: y = σ(W₂ · ReLU(W₁x + b₁) + b₂)

Lines of code: 0. Keystrokes: ~40 (typing the formula). Decisions: 1 (which column is the target).

Results

Workflow Time to first prediction Lines of code Key decisions Setup overhead
Raw PyTorch ~14 min 62 8+ None (you build everything)
Lightning ~9 min 48 6 pip install pytorch-lightning
scikit-learn ~4 min 24 3 pip install scikit-learn
AutoML (AutoGluon) ~7 min 8 1 pip install autogluon (slow)
MathExec ~47 sec 0 1 Open browser tab

The "key decisions" column counts things like: choosing an optimizer, setting a learning rate, deciding on batch size, configuring the data loader, selecting a loss function, etc. Each decision is a potential mistake that costs debugging time.

What the numbers don't capture

Cognitive load

Raw PyTorch requires you to hold the entire pipeline in your head: data preprocessing, model architecture, training mechanics, evaluation strategy. You're simultaneously an architect and a plumber. This mental overhead is the real reason the 14-minute workflow feels like it takes longer than it does.

MathExec and sklearn have low cognitive load because they collapse multiple decisions into defaults. The tradeoff is that you give up control. But for a first experiment, control isn't what you need. You need signal.

Error surface

In the PyTorch workflows, we hit two bugs during timing: a shape mismatch (forgot to squeeze the output tensor) and a data type error (target column was float64 instead of float32). These are common first-attempt errors that add 2-5 minutes of debugging.

In MathExec and sklearn, these errors don't exist because the tool handles tensor shape and type conversion internally.

The iteration multiplier

One training run isn't the full picture. After seeing results, you typically want to try 2-3 variants: different architectures, different hyperparameters, different feature sets.

In PyTorch, each variant requires editing the model class and potentially the training loop. In MathExec, you delete the formula and type a new one. The iteration cost is ~10 seconds vs ~3-5 minutes. Over a session with 5 experiments, that's 50 seconds vs 15-25 minutes.

When to use what

These tools aren't interchangeable. They serve different stages:

  • MathExec: First 5-10 experiments on a new problem. Fast iteration, formula-level specification, automatic experiment tracking. Best for "is this model family worth pursuing?"
  • scikit-learn: Quick baselines when you know the algorithm and just need a number. Great for tabular data with standard models.
  • AutoML: When you want the tool to search the model space for you. Good when you don't have strong priors about the architecture.
  • Lightning: When you've picked an architecture and want to train it properly with logging, checkpointing, and multi-GPU support.
  • Raw PyTorch: When you need full control over every detail. Custom loss functions, unusual training schedules, research-grade reproducibility.

The mistake is using a heavyweight tool for a lightweight question. You don't need Lightning to check if logistic regression beats a neural network on your dataset. You don't need raw PyTorch to train y = σ(Wx + b).

Methodology notes

Times were measured on a MacBook Pro M2, with all tools pre-installed. The PyTorch times assume familiarity with the API (no time spent reading docs). We used a clean Python environment for each workflow. The AutoGluon time excludes installation but includes the fit search time.

Screen recordings of each workflow are available on request. We encourage anyone to replicate this benchmark on their own setup and share results.

The hidden cost: context switching

One thing the time measurements don't capture is context switching cost. When you're in a raw PyTorch workflow, you're constantly switching between thinking about the model and thinking about the infrastructure. "Does the DataLoader need to shuffle? What dtype does the target need to be? Should I use BCELoss or BCEWithLogitsLoss?"

These micro-decisions individually take seconds, but they pull your attention away from the actual question you're trying to answer: "Does this model architecture work for my data?"

In the MathExec workflow, there are zero infrastructure decisions. You think about the formula, you think about the data, you look at the results. The tool handles everything between input and output.

This is also why the iteration multiplier matters more than the initial time. The first experiment might take 14 minutes in PyTorch and 47 seconds in MathExec, a 17x difference. But the fifth experiment takes 5 minutes in PyTorch (you already have the template) and 10 seconds in MathExec (just swap the formula). The gap widens over a session, not narrows.

Limitations of this benchmark

A few caveats to keep in mind:

This is one task. Binary classification on a clean tabular dataset is the best case for all tools. If the task involved custom loss functions, multi-modal input, or unusual training schedules, the raw PyTorch advantage grows because you need that flexibility.

Pre-installed tools. We didn't count installation time. AutoGluon takes several minutes to install and downloads gigabytes of dependencies. For a first-time user, the real time-to-first-prediction is much longer.

Expert users. The PyTorch times assume familiarity. A beginner would take 30-60 minutes, not 14. The benchmark measures the floor, not the ceiling.

Single metric. We measured time-to-first-prediction, not time-to-best-model. AutoML would likely produce a better final model given enough time, because it searches over many architectures automatically.

We encourage anyone to run this benchmark on their own setup and share results. The methodology is straightforward: same task, same dataset, same model architecture, different tools.

Why time-to-first-prediction matters

Most benchmarks in ML compare final model quality: accuracy, F1, AUC. But those benchmarks assume you've already built and trained the model. They don't capture the time and effort required to get to that point.

Time-to-first-prediction is the metric that determines whether you test an idea at all. If it takes 20 minutes, you'll only test ideas you're fairly confident about. If it takes 60 seconds, you'll test speculative ideas too. And speculative ideas are where breakthroughs come from, not from the safe bets.


Try the 47-second workflow yourself: MathExec. Upload a CSV, type a formula, click Train.

Enjoyed this article? Share it with others.

Ready to bring your formulas to life?

Write math, compile to PyTorch, train on data, export code. Under 60 seconds.

Launch App