Research
Evaluating Large Language Models on Agentic Workspace Bootstrapping
TLDR;
We want Agents to configure and manage their own environments through Agent Routines. One Agent Routine that is particularly helpful is the Boot Agent Routine. Given a BOOT.md file, we’ll let an Agent configure its packages, write initial library code, create AGENTS.md, and SKILLS.md that will be used in downstream tasks. We want the agent to do this well and align with the user’s use case.
We present an evaluation of three frontier LLMs on this task.
We use both a deterministic and LLM-as-judge methodology with calibrated rubrics across five quality dimensions, to evaluate Claude Opus 4.5, GPT-5.2 Codex, and Kimi K2.5 across 30 independent trials.
Our results indicate that while models achieve comparable aggregate performance (let’s go open source Kimi!), significant differences emerge in specific capabilities: Claude Opus 4.5 demonstrates superior performance on holistic project documentation (+10.6 points on AGENTS.md), while Kimi K2.5 shows advantages in structured knowledge representation (the quality of the SKILLS.md files).
All models exhibited a consistent weakness in generating safety-critical guidelines.
Why do this?
The Nightshift project was started to help us ship coding agents to our customers. Namely, business users. People that aren’t technical and will not even be vibe coding. They simply want an “engineer” or “IT guy” in their computer that can help them automate work tasks. That means we need to set the Agent up for success. We want to support the use case of an Agent (through Agent Routines) to manage its workspace through the duration of its life. That means, writing library code, updating SKILLS.md based on session history, updating it’s AGENTS.md, and scheduling recurring work.
When you run nightshift install –prefix ~/.agent the cli will take you through a “boot” sequence. This uses a new file
we’re proposing which is the BOOT.md that will let the agent know what the user wants to accomplish, and it’s up to the
agent to configure all of the above.
You might wonder, “can’t we just use something like skills.sh?”, and the answer is we certainly could but then we would sacrifice flexibility for our users. Requiring them to manually install the SKILLS for the agent based off of their use case. Instead, we have found that it’s a much better experience to declaratively tell Nightshift what they want the agent to be used for and the agent will bootstrap itself. Also, most SKILLS or AGENTS files are vibe coded anyway.
Let’s formalize this a bit.
Introduction
The emergence of fully declarative agentic AI systems has created new requirements for workspace initialization and configuration.
Unlike traditional software scaffolding, agentic workspaces require rich semantic documentation that enables downstream AI systems to understand project context, adhere to safety boundaries, and invoke appropriate capabilities.
Our interpretation of the current "boot agent" paradigm addresses this challenge by tasking an LLM with generating the foundational layer for a new workspace. This includes:
- Library Code: Python code to support the use-case
- Python Packages: Python packages to support the use case
- Skill Files: Domain-specific knowledge encapsulated in structured markdown
- Agent Configuration: Project-level instructions including commands, safety boundaries, and capability routing
We have found that the quality of these artifacts directly impacts the effectiveness and safety of subsequent agentic operations within the workspace.
Questions We Wanted to Answer
1: How do frontier LLMs (closed and open source) compare on agentic workspace bootstrapping tasks?
2: What quality dimensions differentiate model performance, and where do systematic weaknesses exist?
3: What is the variance in output quality across repeated trials, and how does this affect deployment reliability?
Before we get started, you can run this eval yourself! Check out https://docs.nightshift.sh/eval for more info.
Task Definition
The boot agent task presents the model with a BOOT.md file containing:
- User intent statement (natural language description of project purpose)
- Suggested skill domains (semantic categories for knowledge encapsulation)
- Technical context (target platform, data sources, authentication methods)
The model must autonomously:
- Generate domain-specific
SKILL.mdfiles - Generate one project-level
AGENTS.mdconfiguration file - Install appropriate dependencies (evaluated but not scored)
Test Configuration
For this test we chose an environment target we have worked extensively with in the past: procurement/purchasing analysis with the Odoo ERP (https://www.odoo.com/).
We worked with procurement experts to benchmark different Agent configurations until we received expert approval on a series of tasks.
We thus had a clear understanding of the skills and direction that an agent would need in this environment to succeed. For skills, the agent must have a deep understanding of analytics, purchasing logic, Odoo schema context, and common formulas. The AGENTS.md file must be clear in how it defines skill locations to make search seamless, and write proper commands to navigate the agent's environment and tool set.
We define the following parameters in setting up our agent environment in the test BOOT.md file.
| Parameter | Value |
|---|---|
| User Intent | "Supply chain analytics for Odoo ERP data stored as parquet files on Google Cloud Storage" |
| Skill Domains | Schema reference, purchasing analysis, query patterns, calculation formulas |
| Target Environment | Python 3.13, DuckDB, GCS with HMAC authentication |
| Files per Trial | 5 (4 skills + 1 AGENTS.md) |
| Trials per Model | 10 |
| Total Evaluations | 150 |
We use the following frontier (as of writing) models for this analysis:
| Model | Provider | Context Window |
|---|---|---|
| Claude Opus 4.5 | Anthropic | 200K |
| GPT-5.2 Codex | OpenAI | 128K |
| Kimi K2.5 | Moonshot AI | 128K |
All models were accessed via API with each trial using an isolated workspace with fresh state.
Evaluation Framework
This is a two phase evaluation with the first being deterministic and the second using an LLM-as-a-Judge. The first check ensures that the files exist and are in the right places. For example, all skills need to be written to `.opencode/skills/[skill-name]/SKILL.md, and AGENTS.md must be located in the workspace root.
We employed GPT-4o as the evaluation judge, selected for its demonstrated calibration on structured assessment tasks. The judge received the generated content along with the original user intent and a detailed rubric.
Each file type was evaluated on five dimensions (0-20 points each, total 0-100). See Appendix C and Appendix D for full scoring criteria.
Table 1: Skill File Dimensions
| Code | Dimension | Description |
|---|---|---|
| D1 | Knowledge Delta | Expert knowledge not derivable from first principles |
| D2 | Specificity & Actionability | Concrete, executable instructions |
| D3 | Anti-Patterns & Safety | Explicit NEVER/ALWAYS/ASK-FIRST rules |
| D4 | Structure & Discoverability | Clear triggers and progressive disclosure |
| D5 | Tailoring to Intent | Customization to user's purpose and stack |
Table 2: AGENTS.md Dimensions
| Code | Dimension | Description |
|---|---|---|
| D1 | Project Specificity | Grounded in actual workspace paths and packages |
| D2 | Command Accuracy | Exact, copy-pasteable commands |
| D3 | Safety Boundaries | Three-tier permission structure |
| D4 | Code Style Concreteness | Style shown through code examples |
| D5 | Skill Catalog & Routing | Complete skill listing with paths and triggers |
The judge was instructed to be "harsh" and reserve scores above 15 for exceptional work, reducing ceiling effects common in LLM evaluation.
Results
Aggregate Performance
Table 3: Overall Model Performance (All Files)
| Model | μ | Median | σ | Min | Max | n |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | 69.62 | 71.5 | 8.74 | 47 | 83 | 50 |
| Kimi K2.5 | 69.52 | 68.5 | 7.68 | 54 | 84 | 50 |
| GPT-5.2 Codex | 67.70 | 66.0 | 5.31 | 59 | 82 | 50 |
The three models achieve comparable aggregate performance, with Claude Opus 4.5 and Kimi K2.5 effectively tied (Δ=0.1) and GPT-5.2 Codex trailing by approximately 2 points.
Performance by File Type
Disaggregating by file type reveals meaningful differences obscured by aggregate scores.
Table 4: Skill File Performance (n=40 per model)
| Model | μ | Median | σ |
|---|---|---|---|
| Kimi K2.5 | 67.90 | 68.0 | 7.19 |
| GPT-5.2 Codex | 67.53 | 66.0 | 5.43 |
| Claude Opus 4.5 | 67.28 | 67.0 | 8.03 |
Table 5: AGENTS.md Performance (n=10 per model)
| Model | μ | Median | σ |
|---|---|---|---|
| Claude Opus 4.5 | 79.00 | 80.5 | 3.83 |
| Kimi K2.5 | 76.00 | 77.0 | 6.18 |
| GPT-5.2 Codex | 68.40 | 68.5 | 5.02 |
Finding 1: Claude Opus 4.5's overall advantage derives primarily from AGENTS.md generation, where it outperforms GPT-5.2 Codex by 10.6 points, a substantial effect size. On skill files alone, all models perform within 0.6 points of each other.
Performance by Skill Domain
Table 6: Mean Score by Skill Domain
| Skill | Best Model | Score | Second | Score | Third | Score |
|---|---|---|---|---|---|---|
| odoo-schema | GPT-5.2 Codex | 64.7 | Kimi K2.5 | 60.1 | Claude Opus 4.5 | 57.3 |
| purchasing | Claude Opus 4.5 | 75.0 | Kimi K2.5 | 74.0 | GPT-5.2 Codex | 73.0 |
| analytics | Kimi K2.5 | 71.9 | Claude Opus 4.5 | 69.7 | GPT-5.2 Codex | 66.7 |
| formulas | Claude Opus 4.5 | 67.1 | GPT-5.2 Codex | 65.7 | Kimi K2.5 | 65.6 |
Finding 2: Model advantages are domain-dependent. GPT-5.2 Codex leads on schema documentation (+7.4 vs Claude), while Claude Opus 4.5 excels on domain expertise skills (purchasing, formulas). This suggests different model strengths in technical specification versus domain reasoning.
Dimension Analysis
Table 7: Mean Dimension Scores for Skill Files (0-20 scale)
| Dimension | Claude Opus 4.5 | GPT-5.2 Codex | Kimi K2.5 | Overall μ |
|---|---|---|---|---|
| D1: Knowledge Delta | 13.7 | 13.7 | 13.8 | 13.7 |
| D2: Specificity | 15.6 | 15.7 | 15.1 | 15.5 |
| D3: Anti-Patterns | 10.4 | 9.9 | 10.0 | 10.1 |
| D4: Structure | 14.2 | 14.9 | 15.2 | 14.8 |
| D5: Tailoring | 13.3 | 13.2 | 13.9 | 13.5 |
Table 8: Mean Dimension Scores for AGENTS.md (0-20 scale)
| Dimension | Claude Opus 4.5 | GPT-5.2 Codex | Kimi K2.5 | Overall μ |
|---|---|---|---|---|
| D1: Project Specificity | 16.9 | 15.1 | 17.5 | 16.5 |
| D2: Command Accuracy | 14.9 | 15.0 | 14.5 | 14.8 |
| D3: Safety Boundaries | 15.4 | 13.7 | 15.1 | 14.7 |
| D4: Code Style | 15.8 | 13.8 | 15.5 | 15.0 |
| D5: Skill Catalog | 16.0 | 10.8 | 13.4 | 13.4 |
Finding 3: D3 (Anti-Patterns & Safety) is the weakest dimension across all models for skill files (μ=10.1), falling into the "adequate" tier. This represents a systematic capability gap.
Finding 4: The largest inter-model variance occurs on AGENTS.md D5 (Skill Catalog & Routing), where Claude Opus 4.5 (16.0) outperforms GPT-5.2 Codex (10.8) by 5.2 points. GPT-5.2 Codex frequently fails to generate complete skill catalogs with routing descriptions.
Consistency Analysis
Table 9: Output Consistency (Standard Deviation)
| Model | σ (Overall) | σ (Skills) | σ (AGENTS.md) |
|---|---|---|---|
| GPT-5.2 Codex | 5.31 | 5.43 | 5.02 |
| Kimi K2.5 | 7.68 | 7.19 | 6.18 |
| Claude Opus 4.5 | 8.74 | 8.03 | 3.83 |
Finding 5: GPT-5.2 Codex produces the most consistent outputs overall (σ=5.31), while Claude Opus 4.5 shows the highest variance on skills but lowest on AGENTS.md (σ=3.83). This suggests Claude has internalized a stable template for project documentation while exhibiting more variation on domain-specific content.
Score Distribution Analysis
Table 10: Score Distribution by Model
| Score Range | Claude Opus 4.5 | GPT-5.2 Codex | Kimi K2.5 |
|---|---|---|---|
| ≤50 | 2 (4%) | 0 (0%) | 0 (0%) |
| 51-60 | 5 (10%) | 2 (4%) | 6 (12%) |
| 61-70 | 17 (34%) | 32 (64%) | 22 (44%) |
| 71-80 | 21 (42%) | 15 (30%) | 18 (36%) |
| 81-90 | 5 (10%) | 1 (2%) | 4 (8%) |
Finding 6: GPT-5.2 Codex clusters tightly in the 61-70 range (64% of outputs), while Claude Opus 4.5 shows a bimodal tendency with more outputs in both tails (4% below 50, 10% above 80). For risk-averse deployments, GPT-5.2 Codex offers more predictable quality floors.
Conclusion
Interpretation of Results
Our findings suggest that current frontier models achieve comparable aggregate performance on agentic bootstrapping tasks, but with distinct capability profiles:
Claude Opus 4.5 excels at holistic project documentation, consistently generating comprehensive AGENTS.md files with complete skill catalogs and safety boundaries. However, it shows higher variance on individual skill files and occasionally produces outlier low scores.
GPT-5.2 Codex provides the most consistent outputs with the narrowest score distribution. It excels at technical specification tasks (schema documentation, command accuracy) but underperforms on integrative documentation requiring cross-file awareness.
Kimi K2.5 shows balanced performance with particular strength in structural organization (D4) and user intent alignment (D5). It achieves the highest scores on skill files overall, though by a narrow margin.
Our Takeaways
| Use Case | Recommended Model | Rationale |
|---|---|---|
| Production agentic systems | Claude Opus 4.5 | Best safety documentation, complete skill routing |
| Consistent batch generation | GPT-5.2 Codex | Lowest variance, predictable quality floor |
| Domain expertise capture | Kimi K2.5 or Claude | Higher knowledge delta and tailoring scores |
Limitations
- Single task domain: Results may not generalize beyond supply chain analytics. Future work should try other use-case environments in tailored BOOT.md files.
- Judge model bias: GPT-4o as judge may systematically favor certain output styles.
- Sample size: 10 trials per model limits statistical power for small effects.
There are many others, but as this is a preliminary analysis we will stick with some of the most glaring limitations for now.
Thank you!
Thanks for reading, and feel free to join our Slack to ask us any questions/provide feedback, and be sure to check out the Nightshift repo!
Appendix A: Judge Prompts
Complete prompts for skill and AGENTS.md evaluation are documented in:
src/cli/cmd/eval/boot-agent.ts(lines 116-176, 178-238)
Appendix B: Supplementary Tables
Table C1: Per-Skill Dimension Breakdown (Claude Opus 4.5)
| Skill | D1 | D2 | D3 | D4 | D5 | Total |
|---|---|---|---|---|---|---|
| odoo-schema | 11.9 | 14.0 | 8.8 | 12.5 | 10.1 | 57.3 |
| purchasing | 15.7 | 15.9 | 12.5 | 15.6 | 15.3 | 75.0 |
| analytics | 13.3 | 17.8 | 9.8 | 14.6 | 14.2 | 69.7 |
| formulas | 13.9 | 14.5 | 10.6 | 14.3 | 13.8 | 67.1 |
Table C2: Per-Skill Dimension Breakdown (GPT-5.2 Codex)
| Skill | D1 | D2 | D3 | D4 | D5 | Total |
|---|---|---|---|---|---|---|
| odoo-schema | 12.9 | 14.5 | 8.6 | 15.2 | 13.5 | 64.7 |
| purchasing | 15.3 | 16.0 | 11.7 | 15.6 | 14.4 | 73.0 |
| analytics | 12.6 | 17.5 | 9.2 | 14.4 | 13.0 | 66.7 |
| formulas | 14.0 | 14.8 | 10.3 | 14.5 | 12.1 | 65.7 |
Table C3: Per-Skill Dimension Breakdown (Kimi K2.5)
| Skill | D1 | D2 | D3 | D4 | D5 | Total |
|---|---|---|---|---|---|---|
| odoo-schema | 12.6 | 12.9 | 7.9 | 14.1 | 12.6 | 60.1 |
| purchasing | 15.3 | 15.4 | 12.1 | 16.3 | 14.9 | 74.0 |
| analytics | 13.8 | 18.0 | 9.9 | 15.6 | 14.6 | 71.9 |
| formulas | 13.5 | 14.1 | 10.0 | 14.6 | 13.4 | 65.6 |
Appendix C: Skill File Grading Rubric
Each dimension is scored 0-20 points.
D1: Knowledge Delta (0-20)
- 16-20: Contains genuine expert knowledge the model doesn't already possess: decision trees, non-obvious trade-offs, domain-specific heuristics, and hard-won patterns that can't be derived from first principles.
- 11-15: Mix of expert knowledge and commonly known information; some genuinely useful insights buried among filler.
- 6-10: Mostly restates what models already know with occasional useful specifics.
- 0-5: Pure redundancy: "what is X" explanations, generic best practices, nothing an LLM couldn't generate unprompted.
D2: Specificity & Actionability (0-20)
- 16-20: Every instruction is concrete and executable: exact commands are copy-pasteable, code examples compile, file paths are real, and an agent can follow them without interpretation or guesswork.
- 11-15: Most instructions are actionable but some steps require the agent to fill in gaps or interpret vague phrasing.
- 6-10: Mixes actionable commands with vague directives like "ensure quality" or "follow best practices."
- 0-5: Almost entirely abstract prose with no executable instructions.
D3: Anti-Patterns & Safety Boundaries (0-20)
- 16-20: Defines explicit NEVER/ALWAYS/ASK-FIRST rules grounded in hard-won experience, each with a concrete reason: an agent reading this would avoid the specific mistakes that cause real damage in this domain.
- 11-15: Has some specific anti-patterns but missing reasoning, or covers obvious cases while missing the non-obvious dangerous ones.
- 6-10: Generic warnings like "be careful" without naming specific failure modes.
- 0-5: No anti-patterns or safety boundaries mentioned at all.
D4: Structure & Discoverability (0-20)
- 16-20: The description field alone tells an agent exactly what this skill does and when to activate it; content is organized with progressive disclosure (critical decisions first, reference material later) and stays concise enough to fit in context without truncation.
- 11-15: Reasonable structure but the description is too vague to trigger correctly, or key information isn't surfaced early enough.
- 6-10: Flat structure with no clear hierarchy; an agent would have to read the entire document to find what it needs.
- 0-5: No description, no structure, wall of text.
D5: Tailoring to User Intent (0-20)
- 16-20: Every section is customized to the user's stated purpose, chosen tech stack, and actual project structure: the skill reflects genuine understanding of the bootstrapped environment.
- 11-15: Partially adapted to the user intent but contains generic sections that could apply to any project.
- 6-10: Superficially references the user intent but the substance is generic.
- 0-5: Shows no adaptation to the specific user intent, installed packages, or library structure.
Appendix D: AGENTS.md Grading Rubric
Each dimension is scored 0-20 points.
D1: Project Specificity (0-20)
- 16-20: Every section is grounded in the actual workspace: real package names, real file paths, real architecture decisions. The file could only belong to this project and no other.
- 11-15: References real project details but pads with generic advice that could apply anywhere.
- 6-10: Mostly generic with a few project-specific references sprinkled in.
- 0-5: A template with no project-specific content.
D2: Command Accuracy (0-20)
- 16-20: Build, test, lint, and run commands are exact, copy-pasteable, and verified to work in this environment: an agent can execute them without modification or interpretation.
- 11-15: Most commands look correct but some may have wrong flags, missing arguments, or reference tools that weren't installed.
- 6-10: Commands are present but aspirational ("run the tests") rather than exact.
- 0-5: No commands section, or commands are clearly wrong.
D3: Safety Boundaries (0-20)
- 16-20: Three clear tiers (ALWAYS do / ASK FIRST / NEVER do) name specific commands, files, and operations that matter for this project: an agent knows exactly where the guardrails are.
- 11-15: Has safety boundaries but missing a tier, or the rules are somewhat generic rather than project-specific.
- 6-10: Generic platitudes like "don't delete important files" without naming specifics.
- 0-5: No safety boundaries mentioned.
D4: Code Style Concreteness (0-20)
- 16-20: Formatting rules and design patterns are shown through short code examples: an agent can pattern-match against them without interpreting prose descriptions of style.
- 11-15: Some code examples but also relies on prose descriptions for key style rules.
- 6-10: Mostly prose ("use clean architecture," "follow SOLID") with minimal or no code examples.
- 0-5: No style section, or purely abstract descriptions.
D5: Skill Catalog & Routing (0-20)
- 16-20: Every available skill is listed with its exact name, file path, and a one-line description of when to use it: an agent can scan the list and know which skill to invoke for any given task without guessing or searching the filesystem.
- 11-15: Skills are listed but missing file paths, or descriptions are too vague to enable routing decisions.
- 6-10: Skills are mentioned by name only without paths or descriptions of when to use them.
- 0-5: Skills are not mentioned at all in the AGENTS.md.
