Full-Stack Test with Claude Code + Sub-agents: The Decisive Factor in the 'Process Power' of Four Domestic Models

2025-08-29 641 words 4 minutes

Contents

Background and Methods

Evaluating AI programming models in real full-stack tasks is the only way to see if they can “deliver.” This test used a tech stack of Astro + TypeScript + Tailwind + WordPress Headless (REST) to deliver a company website and an independent blog (list, details, search, pagination, SEO). Standardized prompts and constraints (strict TS, pnpm, responsive and performance metrics) were used, and the entire front-end and back-end pipeline was run through the automatic coordination of Claude Code’s sub-agents, recording latency, repair costs, and billing data.

Workflow and Sub-agent Settings

Five sub-agents were used: Project Coordination, UI/UX, Front-end, WordPress Integration, and Blog Development. Project Coordination was responsible for milestones, quality gates, and cross-agent handoffs; UI/UX provided the design system and component specifications; Front-end implemented the page skeletons and interactions; WP Integration was responsible for the API, categories/search, caching, and error handling; Blog Development completed dynamic routing, rendering, and SEO. Dangerous mode was executed automatically, and the process formed a closed loop of “error reproduction + expected behavior + minimal reproduction + log snippets” to avoid long-chain loss of control.

Test Results: Speed/Completeness/Cost

Completeness and Stability: GLM-4.5 was the most stable, with good process adherence and a smooth end-to-end closed loop; DeepSeek V3.1 was robust, with occasional CSS/routing flaws that were usable after repair; Kimi K2 had good interaction and aesthetics, but build and repair times were high; Qwen3-coder easily deviated from the process (package management, agent calls) and required strong constraints.
Speed: Qwen ≈ 16–20 minutes for a sample; GLM ≈ 1.5 hours for delivery; DeepSeek ≈ 3 hours; Kimi ≈ 6 hours. Fast does not equal worry-free; rework will swallow the nominal speed advantage.
Aesthetics and Consistency: Kimi > GLM > DeepSeek > Qwen. Component consistency and responsive details affect the first-screen look and feel and conversion.
Cost and Rework: After summing up the comprehensive bill and rework costs, GLM was the best, followed by DeepSeek; Kimi and Qwen depend on whether you can tolerate the “rework brought by slow/fast.”

In a nutshell: GLM gets the job done completely; Kimi gets the job done beautifully; Qwen gets the job done quickly; DeepSeek gets the job done stably.

Common Pitfalls and Countermeasures

WordPress is not “install and go.” If the initial installation and permalinks (/%postname%/) are not ready, the front-end blog will show “no content.” Countermeasure: List installation status, permalinks, CORS, REST route connectivity, and sample articles as a pre-check checklist.
Interface debugging is prone to scattering. Countermeasure: Standardize the “debugging script”: reproduce the error → clarify expectations → provide a minimal reproduction → provide logs → regression verification; abstract the data access layer into a single point to unify loading/error states.
Package management and command deviation. Countermeasure: Solidify pnpm in both the prompt and the package.json script; integrate CI pre-checks into the first step of the agent task.

Engineering Insights

Organize first, then generate. Process decomposition and role division determine code quality and rhythm stability.
Swimlane delivery: Information architecture → design system → page skeleton → data access → SEO/performance, with supporting quality gates for phased acceptance.
Anchor on logs. Let the agent iterate around logs and reproduction paths to reduce “context loss.”
Standardize “repair wording,” and the agent can handle common faults in batches. Essentially, developers are shifting from “coders” to “AI workflow architects,” with the boundary being the workflow, not the single-point capability of the model.

Selection Suggestions and Conclusion

Comprehensive delivery/peace of mind first: GLM-4.5.
Visual presentation first: Kimi K2 (reserve time for building and repairs).
Extreme speed first: Qwen3-coder (must strengthen process and dependency constraints).
Robust and tunable: DeepSeek V3.1.

Conclusion: The single point of “writing code” is no longer the decisive factor; process design and collaborative orchestration are the upper limits. When the process is clear and the quality gates are defined, the model is just “changing the knife,” not “changing the chef.”