Three LLM agents wrote 600,000 lines of code and ran 850 experiments to win a Kaggle competition

In March 2026, three LLM agents generated over 600,000 lines of code, ran 850 experiments, and helped secure a first-place finish in a Kaggle Playground competition focused on telecom customer churn prediction. The NVIDIA developer blog post describes the workflow in detail, positioning LLM agents as the solution to the code-writing bottleneck that has historically limited rapid ML experimentation.

The competition metric was area under the curve (AUC). The winning solution is a four-level stack of 150 models, selected from the 850 that were run. Three models participated in the agent workflow: GPT-5.4 Pro, Gemini 3.1 Pro, and Claude Opus 4.6, operating in a human-in-the-loop configuration.

The two bottlenecks LLM agents address

The post frames modern ML competition performance as determined by how quickly you can generate, test, and iterate on ideas. Two bottlenecks have historically limited this: how quickly you can write code for new experiments, and how quickly you can execute those experiments. According to the post, GPU-accelerated libraries — NVIDIA cuDF, NVIDIA cuML, XGBoost, and PyTorch — have largely solved the execution problem. LLM agents address the code-writing problem, unlocking a new scale of rapid iterative experimentation.

The four-step workflow

The workflow the agents followed maps onto a standard Kaggle Grandmaster playbook: exploratory data analysis, baseline construction, feature engineering, and model combination.

In Step 1, an LLM agent performs EDA to understand data structure before generating any full pipeline. The key questions are the number of rows and columns, the target column format, whether the task is classification or regression, what features are available and their types, and whether missing data exists. The post offers two patterns: when using an LLM in a chat window, the developer runs the code and feeds plots and text back; when using an LLM with code execution like Claude Code, the agent writes and runs its own EDA code directly.

In Step 2, once the model understands the data, agents build the first full pipeline: a k-fold model that saves out-of-fold (OOF) predictions and test predictions to disk as NumPy files. Each experiment reports a CV score and saves predictions as train_oof_[MODEL]_[VERSION].npy and test_preds_[MODEL]_[VERSION].npy. The post is explicit that these files are important for later combination steps. Baselines span gradient boosted decision trees, neural networks, and classical ML models.

Step 3 uses agents for feature engineering and model improvement. LLM agents write code as fast as requested, so the limiting factor shifts to GPU execution speed. The post describes running each experiment with NVIDIA cuDF, cuML, GPU-accelerated gradient boosting, and PyTorch GPUs to keep the iteration cycle fast. Idea generation for new experiments can come from prompting agents to find and read research papers, read forums and publicly shared code, perform additional EDA to identify feature-target relationships, or brainstorm collaboratively with the human operator. Every experiment result, whether it improves the metric or not, saves its OOF and test predictions to disk.

In Step 4, agents combine the accumulated experiments. The post describes several combination techniques. Agents can summarize all model types and feature engineering across hundreds of files by reading experiment scripts and aggregating results. They can combine ideas from multiple models into a stronger single model. They can build ensembles and multi-level stacks. They can use OOF and test predictions as pseudo-labels for knowledge distillation into new models — the post gives a direct prompt example: “Can you please train a new single NN or GBDT using knowledge distillation from all our OOF and Test PREDs and make a new high performing single model?” For the final stacking step, the agents try hill climbing and multiple meta models including Ridge/Logistic regression, neural networks, and GBDT stackers.

Why the stack worked

The advantage the post identifies is not any single technique but the combination of scale and speed. LLM agents can write code faster than human engineers, GPU-accelerated libraries can execute that code faster than CPU alternatives, and the two together compress the idea-to-result loop enough to explore 850 experiments in a single competition cycle. The four-level stack of 150 models is a direct output of that exploratory capacity — it would not be practical to build manually within a competition timeline.

The human-in-the-loop structure means the operator guided the agents through the workflow stages, reviewed results, and directed the exploration rather than leaving agents fully autonomous. The post frames this as applicable to any tabular data prediction task, not just Kaggle competitions, for anyone searching for high-performing solutions under time constraints.