Research

Evaluating Large Language Models on Agentic Workspace Bootstrapping

Ethan Henley, Gianni Crivello18 min read

TLDR;

We want Agents to configure and manage their own environments through Agent Routines. One Agent Routine that is particularly helpful is the Boot Agent Routine. Given a BOOT.md file, we’ll let an Agent configure its packages, write initial library code, create AGENTS.md, and SKILLS.md that will be used in downstream tasks. We want the agent to do this well and align with the user’s use case.

We present an evaluation of three frontier LLMs on this task.

We use both a deterministic and LLM-as-judge methodology with calibrated rubrics across five quality dimensions, to evaluate Claude Opus 4.5, GPT-5.2 Codex, and Kimi K2.5 across 30 independent trials.

Our results indicate that while models achieve comparable aggregate performance (let’s go open source Kimi!), significant differences emerge in specific capabilities: Claude Opus 4.5 demonstrates superior performance on holistic project documentation (+10.6 points on AGENTS.md), while Kimi K2.5 shows advantages in structured knowledge representation (the quality of the SKILLS.md files).

All models exhibited a consistent weakness in generating safety-critical guidelines.

Why do this?

The Nightshift project was started to help us ship coding agents to our customers. Namely, business users. People that aren’t technical and will not even be vibe coding. They simply want an “engineer” or “IT guy” in their computer that can help them automate work tasks. That means we need to set the Agent up for success. We want to support the use case of an Agent (through Agent Routines) to manage its workspace through the duration of its life. That means, writing library code, updating SKILLS.md based on session history, updating it’s AGENTS.md, and scheduling recurring work.

When you run nightshift install –prefix ~/.agent the cli will take you through a “boot” sequence. This uses a new file we’re proposing which is the BOOT.md that will let the agent know what the user wants to accomplish, and it’s up to the agent to configure all of the above.

You might wonder, “can’t we just use something like skills.sh?”, and the answer is we certainly could but then we would sacrifice flexibility for our users. Requiring them to manually install the SKILLS for the agent based off of their use case. Instead, we have found that it’s a much better experience to declaratively tell Nightshift what they want the agent to be used for and the agent will bootstrap itself. Also, most SKILLS or AGENTS files are vibe coded anyway.

Let’s formalize this a bit.


Introduction

The emergence of fully declarative agentic AI systems has created new requirements for workspace initialization and configuration.

Unlike traditional software scaffolding, agentic workspaces require rich semantic documentation that enables downstream AI systems to understand project context, adhere to safety boundaries, and invoke appropriate capabilities.

Our interpretation of the current "boot agent" paradigm addresses this challenge by tasking an LLM with generating the foundational layer for a new workspace. This includes:

  1. Library Code: Python code to support the use-case
  2. Python Packages: Python packages to support the use case
  3. Skill Files: Domain-specific knowledge encapsulated in structured markdown
  4. Agent Configuration: Project-level instructions including commands, safety boundaries, and capability routing

We have found that the quality of these artifacts directly impacts the effectiveness and safety of subsequent agentic operations within the workspace.

Questions We Wanted to Answer

1: How do frontier LLMs (closed and open source) compare on agentic workspace bootstrapping tasks?

2: What quality dimensions differentiate model performance, and where do systematic weaknesses exist?

3: What is the variance in output quality across repeated trials, and how does this affect deployment reliability?

Before we get started, you can run this eval yourself! Check out https://docs.nightshift.sh/eval for more info.


Task Definition

The boot agent task presents the model with a BOOT.md file containing:

  • User intent statement (natural language description of project purpose)
  • Suggested skill domains (semantic categories for knowledge encapsulation)
  • Technical context (target platform, data sources, authentication methods)

The model must autonomously:

  1. Generate domain-specific SKILL.md files
  2. Generate one project-level AGENTS.md configuration file
  3. Install appropriate dependencies (evaluated but not scored)

Test Configuration

For this test we chose an environment target we have worked extensively with in the past: procurement/purchasing analysis with the Odoo ERP (https://www.odoo.com/).

We worked with procurement experts to benchmark different Agent configurations until we received expert approval on a series of tasks.

We thus had a clear understanding of the skills and direction that an agent would need in this environment to succeed. For skills, the agent must have a deep understanding of analytics, purchasing logic, Odoo schema context, and common formulas. The AGENTS.md file must be clear in how it defines skill locations to make search seamless, and write proper commands to navigate the agent's environment and tool set.

We define the following parameters in setting up our agent environment in the test BOOT.md file.

ParameterValue
User Intent"Supply chain analytics for Odoo ERP data stored as parquet files on Google Cloud Storage"
Skill DomainsSchema reference, purchasing analysis, query patterns, calculation formulas
Target EnvironmentPython 3.13, DuckDB, GCS with HMAC authentication
Files per Trial5 (4 skills + 1 AGENTS.md)
Trials per Model10
Total Evaluations150

We use the following frontier (as of writing) models for this analysis:

ModelProviderContext Window
Claude Opus 4.5Anthropic200K
GPT-5.2 CodexOpenAI128K
Kimi K2.5Moonshot AI128K

All models were accessed via API with each trial using an isolated workspace with fresh state.

Evaluation Framework

This is a two phase evaluation with the first being deterministic and the second using an LLM-as-a-Judge. The first check ensures that the files exist and are in the right places. For example, all skills need to be written to `.opencode/skills/[skill-name]/SKILL.md, and AGENTS.md must be located in the workspace root.

We employed GPT-4o as the evaluation judge, selected for its demonstrated calibration on structured assessment tasks. The judge received the generated content along with the original user intent and a detailed rubric.

Each file type was evaluated on five dimensions (0-20 points each, total 0-100). See Appendix C and Appendix D for full scoring criteria.

Table 1: Skill File Dimensions

CodeDimensionDescription
D1Knowledge DeltaExpert knowledge not derivable from first principles
D2Specificity & ActionabilityConcrete, executable instructions
D3Anti-Patterns & SafetyExplicit NEVER/ALWAYS/ASK-FIRST rules
D4Structure & DiscoverabilityClear triggers and progressive disclosure
D5Tailoring to IntentCustomization to user's purpose and stack

Table 2: AGENTS.md Dimensions

CodeDimensionDescription
D1Project SpecificityGrounded in actual workspace paths and packages
D2Command AccuracyExact, copy-pasteable commands
D3Safety BoundariesThree-tier permission structure
D4Code Style ConcretenessStyle shown through code examples
D5Skill Catalog & RoutingComplete skill listing with paths and triggers

The judge was instructed to be "harsh" and reserve scores above 15 for exceptional work, reducing ceiling effects common in LLM evaluation.


Results

Aggregate Performance

Table 3: Overall Model Performance (All Files)

ModelμMedianσMinMaxn
Claude Opus 4.569.6271.58.74478350
Kimi K2.569.5268.57.68548450
GPT-5.2 Codex67.7066.05.31598250

The three models achieve comparable aggregate performance, with Claude Opus 4.5 and Kimi K2.5 effectively tied (Δ=0.1) and GPT-5.2 Codex trailing by approximately 2 points.

Performance by File Type

Disaggregating by file type reveals meaningful differences obscured by aggregate scores.

Table 4: Skill File Performance (n=40 per model)

ModelμMedianσ
Kimi K2.567.9068.07.19
GPT-5.2 Codex67.5366.05.43
Claude Opus 4.567.2867.08.03

Table 5: AGENTS.md Performance (n=10 per model)

ModelμMedianσ
Claude Opus 4.579.0080.53.83
Kimi K2.576.0077.06.18
GPT-5.2 Codex68.4068.55.02

Finding 1: Claude Opus 4.5's overall advantage derives primarily from AGENTS.md generation, where it outperforms GPT-5.2 Codex by 10.6 points, a substantial effect size. On skill files alone, all models perform within 0.6 points of each other.

Performance by Skill Domain

Table 6: Mean Score by Skill Domain

SkillBest ModelScoreSecondScoreThirdScore
odoo-schemaGPT-5.2 Codex64.7Kimi K2.560.1Claude Opus 4.557.3
purchasingClaude Opus 4.575.0Kimi K2.574.0GPT-5.2 Codex73.0
analyticsKimi K2.571.9Claude Opus 4.569.7GPT-5.2 Codex66.7
formulasClaude Opus 4.567.1GPT-5.2 Codex65.7Kimi K2.565.6

Finding 2: Model advantages are domain-dependent. GPT-5.2 Codex leads on schema documentation (+7.4 vs Claude), while Claude Opus 4.5 excels on domain expertise skills (purchasing, formulas). This suggests different model strengths in technical specification versus domain reasoning.

Dimension Analysis

Table 7: Mean Dimension Scores for Skill Files (0-20 scale)

DimensionClaude Opus 4.5GPT-5.2 CodexKimi K2.5Overall μ
D1: Knowledge Delta13.713.713.813.7
D2: Specificity15.615.715.115.5
D3: Anti-Patterns10.49.910.010.1
D4: Structure14.214.915.214.8
D5: Tailoring13.313.213.913.5

Table 8: Mean Dimension Scores for AGENTS.md (0-20 scale)

DimensionClaude Opus 4.5GPT-5.2 CodexKimi K2.5Overall μ
D1: Project Specificity16.915.117.516.5
D2: Command Accuracy14.915.014.514.8
D3: Safety Boundaries15.413.715.114.7
D4: Code Style15.813.815.515.0
D5: Skill Catalog16.010.813.413.4

Finding 3: D3 (Anti-Patterns & Safety) is the weakest dimension across all models for skill files (μ=10.1), falling into the "adequate" tier. This represents a systematic capability gap.

Finding 4: The largest inter-model variance occurs on AGENTS.md D5 (Skill Catalog & Routing), where Claude Opus 4.5 (16.0) outperforms GPT-5.2 Codex (10.8) by 5.2 points. GPT-5.2 Codex frequently fails to generate complete skill catalogs with routing descriptions.

Consistency Analysis

Table 9: Output Consistency (Standard Deviation)

Modelσ (Overall)σ (Skills)σ (AGENTS.md)
GPT-5.2 Codex5.315.435.02
Kimi K2.57.687.196.18
Claude Opus 4.58.748.033.83

Finding 5: GPT-5.2 Codex produces the most consistent outputs overall (σ=5.31), while Claude Opus 4.5 shows the highest variance on skills but lowest on AGENTS.md (σ=3.83). This suggests Claude has internalized a stable template for project documentation while exhibiting more variation on domain-specific content.

Score Distribution Analysis

Table 10: Score Distribution by Model

Score RangeClaude Opus 4.5GPT-5.2 CodexKimi K2.5
≤502 (4%)0 (0%)0 (0%)
51-605 (10%)2 (4%)6 (12%)
61-7017 (34%)32 (64%)22 (44%)
71-8021 (42%)15 (30%)18 (36%)
81-905 (10%)1 (2%)4 (8%)

Finding 6: GPT-5.2 Codex clusters tightly in the 61-70 range (64% of outputs), while Claude Opus 4.5 shows a bimodal tendency with more outputs in both tails (4% below 50, 10% above 80). For risk-averse deployments, GPT-5.2 Codex offers more predictable quality floors.


Conclusion

Interpretation of Results

Our findings suggest that current frontier models achieve comparable aggregate performance on agentic bootstrapping tasks, but with distinct capability profiles:

Claude Opus 4.5 excels at holistic project documentation, consistently generating comprehensive AGENTS.md files with complete skill catalogs and safety boundaries. However, it shows higher variance on individual skill files and occasionally produces outlier low scores.

GPT-5.2 Codex provides the most consistent outputs with the narrowest score distribution. It excels at technical specification tasks (schema documentation, command accuracy) but underperforms on integrative documentation requiring cross-file awareness.

Kimi K2.5 shows balanced performance with particular strength in structural organization (D4) and user intent alignment (D5). It achieves the highest scores on skill files overall, though by a narrow margin.

Our Takeaways

Use CaseRecommended ModelRationale
Production agentic systemsClaude Opus 4.5Best safety documentation, complete skill routing
Consistent batch generationGPT-5.2 CodexLowest variance, predictable quality floor
Domain expertise captureKimi K2.5 or ClaudeHigher knowledge delta and tailoring scores

Limitations

  1. Single task domain: Results may not generalize beyond supply chain analytics. Future work should try other use-case environments in tailored BOOT.md files.
  2. Judge model bias: GPT-4o as judge may systematically favor certain output styles.
  3. Sample size: 10 trials per model limits statistical power for small effects.

There are many others, but as this is a preliminary analysis we will stick with some of the most glaring limitations for now.

Thank you!

Thanks for reading, and feel free to join our Slack to ask us any questions/provide feedback, and be sure to check out the Nightshift repo!


Appendix A: Judge Prompts

Complete prompts for skill and AGENTS.md evaluation are documented in:

  • src/cli/cmd/eval/boot-agent.ts (lines 116-176, 178-238)

Appendix B: Supplementary Tables

Table C1: Per-Skill Dimension Breakdown (Claude Opus 4.5)

SkillD1D2D3D4D5Total
odoo-schema11.914.08.812.510.157.3
purchasing15.715.912.515.615.375.0
analytics13.317.89.814.614.269.7
formulas13.914.510.614.313.867.1

Table C2: Per-Skill Dimension Breakdown (GPT-5.2 Codex)

SkillD1D2D3D4D5Total
odoo-schema12.914.58.615.213.564.7
purchasing15.316.011.715.614.473.0
analytics12.617.59.214.413.066.7
formulas14.014.810.314.512.165.7

Table C3: Per-Skill Dimension Breakdown (Kimi K2.5)

SkillD1D2D3D4D5Total
odoo-schema12.612.97.914.112.660.1
purchasing15.315.412.116.314.974.0
analytics13.818.09.915.614.671.9
formulas13.514.110.014.613.465.6

Appendix C: Skill File Grading Rubric

Each dimension is scored 0-20 points.

D1: Knowledge Delta (0-20)

  • 16-20: Contains genuine expert knowledge the model doesn't already possess: decision trees, non-obvious trade-offs, domain-specific heuristics, and hard-won patterns that can't be derived from first principles.
  • 11-15: Mix of expert knowledge and commonly known information; some genuinely useful insights buried among filler.
  • 6-10: Mostly restates what models already know with occasional useful specifics.
  • 0-5: Pure redundancy: "what is X" explanations, generic best practices, nothing an LLM couldn't generate unprompted.

D2: Specificity & Actionability (0-20)

  • 16-20: Every instruction is concrete and executable: exact commands are copy-pasteable, code examples compile, file paths are real, and an agent can follow them without interpretation or guesswork.
  • 11-15: Most instructions are actionable but some steps require the agent to fill in gaps or interpret vague phrasing.
  • 6-10: Mixes actionable commands with vague directives like "ensure quality" or "follow best practices."
  • 0-5: Almost entirely abstract prose with no executable instructions.

D3: Anti-Patterns & Safety Boundaries (0-20)

  • 16-20: Defines explicit NEVER/ALWAYS/ASK-FIRST rules grounded in hard-won experience, each with a concrete reason: an agent reading this would avoid the specific mistakes that cause real damage in this domain.
  • 11-15: Has some specific anti-patterns but missing reasoning, or covers obvious cases while missing the non-obvious dangerous ones.
  • 6-10: Generic warnings like "be careful" without naming specific failure modes.
  • 0-5: No anti-patterns or safety boundaries mentioned at all.

D4: Structure & Discoverability (0-20)

  • 16-20: The description field alone tells an agent exactly what this skill does and when to activate it; content is organized with progressive disclosure (critical decisions first, reference material later) and stays concise enough to fit in context without truncation.
  • 11-15: Reasonable structure but the description is too vague to trigger correctly, or key information isn't surfaced early enough.
  • 6-10: Flat structure with no clear hierarchy; an agent would have to read the entire document to find what it needs.
  • 0-5: No description, no structure, wall of text.

D5: Tailoring to User Intent (0-20)

  • 16-20: Every section is customized to the user's stated purpose, chosen tech stack, and actual project structure: the skill reflects genuine understanding of the bootstrapped environment.
  • 11-15: Partially adapted to the user intent but contains generic sections that could apply to any project.
  • 6-10: Superficially references the user intent but the substance is generic.
  • 0-5: Shows no adaptation to the specific user intent, installed packages, or library structure.

Appendix D: AGENTS.md Grading Rubric

Each dimension is scored 0-20 points.

D1: Project Specificity (0-20)

  • 16-20: Every section is grounded in the actual workspace: real package names, real file paths, real architecture decisions. The file could only belong to this project and no other.
  • 11-15: References real project details but pads with generic advice that could apply anywhere.
  • 6-10: Mostly generic with a few project-specific references sprinkled in.
  • 0-5: A template with no project-specific content.

D2: Command Accuracy (0-20)

  • 16-20: Build, test, lint, and run commands are exact, copy-pasteable, and verified to work in this environment: an agent can execute them without modification or interpretation.
  • 11-15: Most commands look correct but some may have wrong flags, missing arguments, or reference tools that weren't installed.
  • 6-10: Commands are present but aspirational ("run the tests") rather than exact.
  • 0-5: No commands section, or commands are clearly wrong.

D3: Safety Boundaries (0-20)

  • 16-20: Three clear tiers (ALWAYS do / ASK FIRST / NEVER do) name specific commands, files, and operations that matter for this project: an agent knows exactly where the guardrails are.
  • 11-15: Has safety boundaries but missing a tier, or the rules are somewhat generic rather than project-specific.
  • 6-10: Generic platitudes like "don't delete important files" without naming specifics.
  • 0-5: No safety boundaries mentioned.

D4: Code Style Concreteness (0-20)

  • 16-20: Formatting rules and design patterns are shown through short code examples: an agent can pattern-match against them without interpreting prose descriptions of style.
  • 11-15: Some code examples but also relies on prose descriptions for key style rules.
  • 6-10: Mostly prose ("use clean architecture," "follow SOLID") with minimal or no code examples.
  • 0-5: No style section, or purely abstract descriptions.

D5: Skill Catalog & Routing (0-20)

  • 16-20: Every available skill is listed with its exact name, file path, and a one-line description of when to use it: an agent can scan the list and know which skill to invoke for any given task without guessing or searching the filesystem.
  • 11-15: Skills are listed but missing file paths, or descriptions are too vague to enable routing decisions.
  • 6-10: Skills are mentioned by name only without paths or descriptions of when to use them.
  • 0-5: Skills are not mentioned at all in the AGENTS.md.