Quoting Kenny Workman on Harness Design

Describing results from their manuscript “SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?” (arxiv), Kenny Workman writes:

The most actionable result: harness design should be a first class object for engineering and benchmarking. While largely considered “glue code”, tools, prompts, control flow, etc. can change the outcome as much as swapping the base models.

What is meant by “harness”? From the manuscript:

a harness denotes the full execution wrapper around a base model: the system prompt, available tools, control flow (planning and retry policies), answer schema enforcement, and the runtime environment used to execute code. All harnesses provide an interactive compute setting with access to common scientific Python tooling and the local workspace containing the problem’s data snapshot. Harnesses differ primarily in their prompting strategy, tool routing, and how they structure multi-step work (e.g., whether they enforce intermediate checks, how they respond to errors, and how they decide when to stop).

The results:

A chart depicting the impact of harness design on SpatialBench benchmark scores.

I find it striking how impactful harness design is on benchmark performance and it suggests we’re still very early in the optimization of AI models for biological tasks.

The SpatialBench repo can be found on GitHub.

What do you think? Did I miss anything? Contact me!

This work by Derek Croote is licensed under CC BY-NC 4.0 Creative Commons Icon Creative Commons Attribution Icon Creative Commons Noncommercial Icon