Writing scientific code using agents

Scientific programming differs from general-purpose coding. For instance, a web server should be robust to all sorts of terrible inputs and never go down. In contrast, if I’m writing a scientific data analysis pipeline, I want it to come to a screeching halt at the first indication that the inputs deviate from my expectations.

This post builds on the Agentic Git Flow workflow with domain-specific considerations for scientific code.

Correctness is paramount, but typical tests can be challenging to formulate

In science we sometimes don’t know the right answer, making it challenging to write an effective unit test. One useful strategy is to compare implementations. For example, I can often write a model component in a horribly slow but obviously correct way with nested for loops. The output of a slow version can be used to test a fast vectorized version.

If the code is mathematical, I have found it useful to use the Math PR Summarizer agent to translate the code back into mathematical notation. Having the model in a different, more easily read format can help catch errors.

Fail fast

“Defensive programming” in industry means having code fail gracefully. This often means silently absorbing errors. That is great for webservers but hugely problematic for science. If something is wrong, I want to know immediately via an exception that stops everything.

Because claude is trained to be a helpful coder on average, and “good” code means “defensive” in this way for most settings, it will write code that handles edge cases with defaults and silently absorbs errors. That is bad for us, so here is a section of my CLAUDE.md:

#### Fail-Fast, No Fallbacks
- **No Silent Fallbacks**: Code must fail immediately when expected conditions aren't met. Silent fallback behavior masks bugs and creates unpredictable systems.
- **Explicit Error Messages**: When something goes wrong, stop execution with clear error messages explaining what failed and what was expected.
- **Example**: `raise ValueError(f"Required model {model_name} not found")` instead of falling back to first available model.

Backwards compatibility is not a thing

Similarly, in most settings you want to be careful changing an API, because other users or software may depend on it. In contrast, our scientific APIs typically have a single “user” within the same package. For this use case, backwards compatibility is clutter.

### ⚠️ **IMPORTANT: Rewrite Project - Breaking Changes Encouraged**

**DASM2 is a complete rewrite**, not an actively used codebase with external dependencies. This means:

- **Breaking changes are encouraged** when they follow best practices
- **No backward compatibility constraints** - optimize for clean architecture
- **Clean module organization** - each module has a single, clear purpose

This approach ensures the codebase remains maintainable and forces explicit dependencies that make the architecture clear to all developers.

Pipelines

Pipelines are sometimes not considered real software engineering, but they are! As such, the same processes and clean code considerations apply.

For example, can you write a complex pipeline in one enormous Snakefile? Yes you can! Is that a good idea? No it is not!

A clean code approach is to make a small package with a nice API that gets called by the Snakefile. This package can have tests! Personally, I have been moving toward giving the package a small command-line interface built with Google’s Python Fire which then gets called by Snakemake.

To ensure a computational pipeline is doing the right thing, I use the following three strategies:

Read the original papers/documentation describing the data and identify “invariants” that should be true for your pipeline. For instance, we should have X sequences in the output and they should have at most Y mutations. Assert that these numbers are all correct in the full pipeline.
Write a special-case implementation for a small subset of the data. Make this eminently readable. Check it carefully, then ensure that your automated pipeline gives the same results as that miniature case.
Make an interactive visualization of the processed data before you start modeling it. Why not? Code is cheap and Altair is fun!

Additional safeguards

Because we want scientific code to be correct, one can use additional verification tools, which are made much easier when coding with agents. For example, it has become best practice in industry to use static typing tools such as mypy to ensure that types agree between calls in non-natively-typed languages such as Python. (Note that Claude Code itself is written in TypeScript, a typed version of JavaScript.)

Having types makes code easier to understand and catches bugs. It also catches bad design decisions. For example, having a single function return multiple different types indicates a bug or bad design. It’s easy to find those code smells by examining the type annotations. For instance, a problematic return type might be Union[List, Tuple], or worse, Any.

However, maintaining type annotations is dreadful when done by hand. For this reason, we haven’t used them in our code. But now it’s easy! Coding agents can add types and fix type errors automatically, which we do by having make check run mypy. Agents do need some guidelines to avoid overly broad types. This can be in your CLAUDE.md or in a review document.

We also use ruff, which does basic linting and formatting. Same deal: just add it as a makefile target and insist it runs cleanly before PR merges.

Notebooks

Notebooks are great for science. They are a form of literate programming where one can see plots alongside the code that generated them. The current state of using Jupyter notebooks with claude is decent.

On the plus side, claude can look at the plots in your notebook and interpret them directly. If you want it to run your notebook and then read the result, you can ask it to use nbconvert. It is also good at editing individual cells in your notebook. On the minus side, if you ask it to do a major rearrangement of cells, it will sometimes get confused and break the Jupyter Notebook syntax.

Because of this, I have been moving toward plain Python that generates HTML visualizations. Another option is Marimo notebooks, as their plain Python syntax should work better with agentic coding tools (H/T Jesse Bloom).

This is part 3 of a 4-part series on agentic coding:

Agentic Coding from First Principles
Agentic Git Flow
Writing Scientific Code Using Agents (this post)
The Human Experience of Coding with an Agent

View the complete series →