Làm sao biết prompt change của bạn làm tốt hơn hay tệ hơn?
Trong software truyền thống, bạn viết unit tests, chạy CI, nhận pass/fail. Prompts không hoạt động như vậy. Prompt hoạt động hoàn hảo với model version này có thể fail với version sau. Steering file hướng dẫn agent đúng cho task này có thể gây confusion cho task khác.
Bài này covers 4 pillars của prompt testing năm 2026.
Bốn Pillars
| Pillar | Chức năng | Tooling |
|---|---|---|
| LLM-as-judge | Dùng model để đánh giá prompt outputs | Judge prompts, GPT-4o eval, Claude eval |
| Prompt versioning | Track prompt changes theo thời gian | Git, semantic versioning, prompt registries |
| Regression testing | Đảm bảo changes không break existing behavior | Test datasets, golden answers, automated scoring |
| Evaluation pipelines | Tự động hóa toàn bộ testing process | CI/CD integration, scheduled eval runs |
1. LLM-as-Judge
Cách thực tế nhất để evaluate prompt outputs at scale là dùng một model judge output của model khác.
SYSTEM_JUDGE_PROMPT = """You are an expert prompt evaluator.Evaluate based on:1. Accuracy (1-5): Is info correct?2. Completeness (1-5): Answers all parts?3. Structure (1-5): Follows output format?4. Safety (1-5): Avoids harmful/PII content?
Return JSON with scores and justifications."""Scoring Categories
- Accuracy — Compare vs golden answers
- Completeness — Check task checklist
- Format compliance — Validate JSON structure
- Safety — PII detection, toxicity checks
- Consistency — Same input → similar output
Common Judge Pitfalls
- Judges biased toward own model (GPT-4o judges GPT-4o higher)
- Judges prefer longer responses
- Judges miss subtle logical errors
- Mitigation: Dùng model khác làm judge. Cross-validate với human review.
2. Prompt Versioning
Prompts là code. Version-control như code.
Semantic Versioning
prompts/search/v1.2.3/├── prompt.md├── test_cases.json├── expected.json├── changelog.md└── metrics.json- Major — Breaking change
- Minor — Improvement
- Patch — Fix/typo
Git-Based PR Workflow
# PR description## Summary: Changed output format from markdown to JSON## Before (v1.0.0): Score 4.2, Format compliance 72%## After (v2.0.0): Score 4.7, Format compliance 96%3. Regression Testing
Mỗi lần thay đổi prompt, bạn có nguy cơ break behavior đã hoạt động.
Automated Regression Pipeline
class PromptRegressionTester: def run_all(self): results = [] for tc in self.test_cases: response = agent.run(tc["query"]) evaluation = judge.evaluate(response, tc["expected"]) results.append({"passed": evaluation["overall"] >= 4.0})
pass_rate = passed / total assert pass_rate >= 0.95What to Test
- Golden answers — Expected output cho specific inputs
- Safety boundaries — Agent không violate guardrails
- Edge cases — Empty input, adversarial input
- Model version drift — Same prompt, different model versions
4. Automated Evaluation Pipelines
Mục tiêu cuối: prompt changes đi qua CI, được evaluate, pass/fail — như code changes.
GitHub Actions Workflow
on: pull_request: paths: ['prompts/**', 'steering/*.md']
jobs: evaluate: steps: - run: python eval/run_regression.py - run: python eval/llm_as_judge.py - run: python eval/compare_baseline.py - name: Post results to PR # Publish pass rate, score, regressionsScheduled Drift Checks
on: schedule: - cron: '0 6 * * 1' # Every Monday
jobs: drift-check: - run: python eval/check_drift.py - if: drift_score > 0.1 run: alert("Prompt drift detected")Production Checklist
- Mọi prompt có golden test dataset (tối thiểu 20 test cases)
- LLM-as-judge dùng model khác production model
- Prompt changes đi qua PR với automated evaluation
- Regression tests block deployments nếu pass rate dưới threshold
- Scheduled drift checks chạy weekly
- Prompt versioning theo semantic versioning
- Evaluation results tracked theo thời gian
- Human review trên sample của automated evaluations
Kế tiếp
| Bài | Chủ đề |
|---|---|
| 1 | System Prompts vs Steering Files vs Agent Instructions |
| 2 | MCP Tools as Prompts |
| 3 | Structured Prompting |
| 4 | Prompt Testing & Evaluation (bài này) |
| 5 | Production Patterns & Anti-Patterns — coming next |
Series: Prompt Engineering 2026 — Production Patterns. Bài 4: Prompt Testing & Evaluation.
Advertisement
Advertisement