Building a Data Studio: NL-Powered Data Transforms
One of the most requested features after launch was better data handling. Users wanted to clean, transform, and explore their datasets before training. So we built the Data Studio.
This post covers the architecture, the design decisions we made, and what we learned along the way.
The problem
Raw CSV data is rarely ready for training. Columns need renaming, missing values need handling, features need engineering. A dataset of customer records might have ages stored as strings, missing values scattered across columns, and a target variable that needs to be derived from existing fields.
Traditionally, this means opening a Jupyter notebook and writing pandas code. Load the CSV, inspect the columns, write a few .dropna() and .astype() calls, create new features, save the cleaned version. It's not hard, but it's yet another context switch away from the actual modeling work.
We wanted to bring that capability directly into MathExec, but without requiring users to write Python.
The solution: Natural language transforms
In Data Studio, you type what you want in plain English:
"Drop rows where age is missing, then normalize the salary column"
MathExec's LLM backend generates the equivalent pandas code, executes it in a sandbox, and shows you a before/after preview. If it looks right, you apply it. If not, you refine the instruction and try again.
This turns data cleaning from a programming task into a conversation. You describe what you want, see the result, and iterate until you're happy.
More examples of what you can say
- "Convert the date column to year and month features"
- "One-hot encode the category column"
- "Remove outliers in the price column (keep values within 3 standard deviations)"
- "Create a new column called BMI calculated from height_cm and weight_kg"
- "Rename 'col1' to 'temperature' and 'col2' to 'humidity'"
- "Fill missing values in the income column with the median"
These range from simple operations to multi-step transforms. The LLM handles both.
Architecture
User instruction → LLM (Gemini) → pandas code → Sandbox execution → Preview → Apply
The prompt engineering
Getting reliable code generation from an LLM requires careful prompt design. We send the model:
- The user's instruction
- Column names, data types, and 3-5 sample values per column
- Summary statistics (min, max, mean, null count) for numeric columns
- A strict output format requiring only pandas code, no explanation or markdown
The sample values and statistics give the LLM enough context to write correct code without seeing the full dataset. If a user says "remove outliers," the LLM can look at the distribution stats to decide what constitutes an outlier.
We also include a few constraints in the prompt:
- The input dataframe is always called
df - The output must also be called
df - Only pandas and numpy are available
- No file I/O, no network calls, no imports beyond pandas/numpy
- The code must not reference variables that don't exist in the dataframe
These constraints reduce the space of possible outputs and make the generated code more predictable.
The LLM returns something like:
df = df.dropna(subset=['age'])
df['salary'] = (df['salary'] - df['salary'].mean()) / df['salary'].std()
Sandboxed execution
We execute the generated code in a restricted environment with only pandas and numpy available. The code runs on a copy of the data, not the original. This is important for two reasons: safety (we don't want generated code deleting files or making network requests) and reversibility (if the transform produces garbage, the original data is untouched).
The sandbox uses Python's exec() with a restricted global namespace. We strip out builtins that could be dangerous (open, eval, __import__, os, sys) and only inject pd (pandas) and np (numpy) into the execution scope.
If the code throws an exception, we catch it and show the error message to the user with a suggestion to rephrase the instruction. Common errors include referencing columns that don't exist (usually a spelling issue) or type mismatches (trying to do arithmetic on string columns).
Before/after preview
Before any transform is applied, the user sees a side-by-side view: the first few rows of the dataset before and after the transform. Changed cells are highlighted, added columns are marked in green, and removed columns are marked in red.
This preview step is the single most important UI element in Data Studio. It builds trust. Users can verify that the LLM did what they asked before committing to the change. In our testing, about 85% of transforms are accepted on the first try. The other 15% get refined with a follow-up instruction.
Transform pipeline
Each transform is recorded in a pipeline:
- Drop missing ages
- Normalize salary
- One-hot encode category
The pipeline is more than just a log. Users can:
- Reorder transforms: Drag steps to change the execution order
- Remove transforms: Delete a step and all subsequent steps re-execute
- Replay transforms: Apply the same pipeline to a different dataset
- View the code: See the pandas code that was generated for each step
Pipelines are saved per-project and persisted in localStorage. When you come back to a project next week, your transform pipeline is still there, ready to replay on fresh data.
Integration with training
The Data Studio connects directly to the training workflow. When you switch from the Data Studio tab to the Canvas and hit Train, MathExec uses the transformed version of your data. You don't need to export or re-upload anything.
This matters because data cleaning and feature engineering are iterative. You might train a model, see poor results, go back to Data Studio to engineer a new feature, and train again. That loop should be frictionless.
What we learned
Building Data Studio taught us several things about LLM-powered tools:
Show, don't tell. The before/after preview is the most important UI element. Users need to see what changed. Showing "Transform applied successfully" is not enough. People don't trust what they can't see, and they shouldn't have to.
Reversibility matters. Every transform must be undoable. We store the full pipeline, not just the final state. If a user applies 5 transforms and realizes the third one was wrong, they can remove it and the pipeline re-executes from that point.
LLMs are good at pandas. For single-step data transforms, LLM-generated pandas code is correct about 90% of the time. The preview step catches the other 10%. Multi-step instructions ("do X then Y then Z") have a lower success rate, around 75%, because errors compound. We encourage users to apply transforms one at a time.
Column context is everything. The quality of generated code improves dramatically when you include sample values and statistics in the prompt. Without context, the LLM has to guess what "normalize" means for a given column. With context, it can choose between z-score normalization, min-max scaling, or log transforms based on the distribution.
Error messages help iteration. When a transform fails, showing the raw Python traceback isn't helpful for most users. We extract the relevant part of the error and pair it with a suggestion: "Column 'Age' not found. Did you mean 'age'?" This small touch reduces the number of failed attempts significantly.
What didn't work
Not everything we tried panned out.
Our first approach was to generate transforms without showing the code at all. Just the instruction and the result. This felt cleaner, but users kept asking "what did it actually do?" Hiding the code created anxiety. We added a collapsible "View code" section to each pipeline step, and usage went up immediately. Transparency beats simplicity when people's data is involved.
We also experimented with multi-turn conversations, where the LLM remembered previous transforms and could reference them. "Now do the same thing to the price column." This worked sometimes but was unreliable. The LLM would lose track of what "the same thing" meant across turns. We scrapped it in favor of independent, single-instruction transforms. Less elegant, more reliable.
Finally, we considered running transforms server-side for larger datasets. The current approach sends data to the backend for LLM code generation, but the actual pandas execution happens in the browser via Pyodide. This keeps data client-side (a privacy win) but limits dataset size to what the browser can handle. For our target use case (quick experimentation on CSVs under 100MB), this works well. For production-scale data, users should be using proper ETL tools anyway.
What's next for Data Studio
We're working on a few things for the next iteration:
- Transform templates: Pre-built transforms for common tasks (standardize all numeric columns, drop all columns with >50% missing values) that you can apply with one click
- Auto-suggestions: Based on data profiling, Data Studio will suggest transforms before you ask. If a column has 30% null values, it'll suggest strategies for handling them
- Richer previews: Distribution plots and correlation changes alongside the table preview, so you can see how transforms affect the statistical properties of your data
Try it
Upload a dataset in MathExec, switch to the Data Studio tab, and type a transform instruction. Start with something simple like "show summary statistics" and work your way up to complex feature engineering.
Data Studio is available to all MathExec users. No API key required for basic transforms.
Enjoyed this article? Share it with others.
