---
name: skill-optimizer
description: "GEPA-powered framework for optimizing any Claude skill. Use when asked to: (1) improve a skill's performance, (2) analyze why a skill fails on certain tasks, (3) optimize skill instructions based on execution failures, (4) systematically test and evolve skill quality, (5) create optimized variants of existing skills. Works with skills in ~/.claude/skills/, .claude/skills/, or /mnt/skills/."
---

# Skill Optimizer

A meta-skill for **automatically improving any Claude skill** using GEPA (Genetic-Pareto) reflective text evolution.

## When to Use This Skill

Use this skill when the user asks to:
- "Improve my PPTX skill"
- "Optimize the docx skill"
- "Why does my skill fail on edge cases?"
- "Make my skill better at handling [specific scenario]"
- "Test and improve this skill"
- "Analyze skill performance"

## Quick Reference

### 1. Analyze a Skill

```bash
# Load and analyze any skill
python -c "
from src.core.skill_loader import SkillLoader
loader = SkillLoader()
skill = loader.load('~/.claude/skills/my-skill')
print(f'Skill: {skill.name}')
print(f'Sections: {list(skill.sections.keys())}')
print(f'References: {len(skill.references)}')
print(f'Scripts: {len(skill.scripts)}')
"
```

### 2. Generate Test Cases

```bash
python -m src.generators.test_generator \
    --skill-path ~/.claude/skills/my-skill \
    --output test_cases.yaml \
    --count 15
```

### 3. Run Optimization

```bash
python -m src.core.skill_optimizer optimize \
    --config config.yaml \
    --max-iterations 10
```

---

## Complete Optimization Workflow

When asked to optimize a skill, follow this workflow:

### Step 1: Locate and Load the Skill

```python
from src.core.skill_loader import SkillLoader

loader = SkillLoader()

# Find skill in standard locations
skill_locations = [
    "~/.claude/skills/{name}",      # User skills (Claude Code)
    "./.claude/skills/{name}",       # Project skills
    "/mnt/skills/public/{name}",     # Claude.ai public
    "/mnt/skills/user/{name}",       # Claude.ai user
]

# Load the skill
skill = loader.load(skill_path)

# Analyze components
print(f"Name: {skill.name}")
print(f"Description: {skill.description}")
print(f"Main instructions: {len(skill.instructions)} chars")
print(f"Sections: {list(skill.sections.keys())}")
print(f"Reference files: {[r.name for r in skill.references]}")
print(f"Scripts: {[s.name for s in skill.scripts]}")
```

### Step 2: Identify Optimization Goals

Ask the user or infer from context:

| Goal | Focus Areas |
|------|-------------|
| **Quality** | Output accuracy, completeness, formatting |
| **Reliability** | Error handling, edge cases, validation |
| **Performance** | Speed, token efficiency, file size |
| **Compliance** | Brand guidelines, templates, standards |
| **Usability** | Clear instructions, good examples |

### Step 3: Create Test Cases

```python
from src.generators.test_generator import TestCaseGenerator

generator = TestCaseGenerator(skill)
test_cases = generator.generate(
    count=20,
    include_edge_cases=True,
    categories=["simple", "complex", "edge_case"]
)

# Save for optimization
generator.save_to_yaml(test_cases, "test_cases.yaml")
```

**Test Case Categories:**

| Category | Purpose | Example |
|----------|---------|---------|
| `simple` | Basic functionality | "Create a 3-slide presentation" |
| `complex` | Multi-step tasks | "Create deck with charts, images, and animations" |
| `edge_case` | Boundary conditions | "Handle 50+ slides", "Unicode characters" |
| `failure_mode` | Known problems | "Long titles that overflow" |
| `brand` | Compliance tests | "Use exact brand colors" |
| `template` | Template usage | "Apply corporate template correctly" |

### Step 4: Configure Optimization

Create `config.yaml`:

```yaml
skill:
  path: ~/.claude/skills/pptx
  name: pptx
  components:
    - SKILL.md
    - html2pptx.md
    - css.md

optimization:
  max_iterations: 10
  max_evaluations: 100
  population_size: 5
  batch_size: 5
  
  # What to optimize
  optimize:
    - instructions      # Main SKILL.md content
    - design_philosophy # Design guidelines
    - validation_rules  # Quality checks
    - examples          # Code examples
    - workflows         # Step-by-step procedures

evaluation:
  metrics:
    - name: task_completion
      weight: 0.25
      type: binary
      description: "Did the task complete successfully?"
      
    - name: output_quality
      weight: 0.30
      type: llm_judge
      description: "Quality of the generated output"
      
    - name: error_rate
      weight: 0.20
      type: computed
      description: "Number of errors encountered"
      
    - name: edge_case_handling
      weight: 0.15
      type: binary
      description: "Handles edge cases correctly"
      
    - name: efficiency
      weight: 0.10
      type: computed
      description: "Execution time and resource usage"

claude:
  model: claude-sonnet-4-20250514
  timeout: 300
  executor_mode: cli  # Use Claude Code CLI

output:
  dir: ./optimization_results
  save_checkpoints: true
  checkpoint_every: 2
```

### Step 5: Run GEPA Optimization

```python
from src.core.skill_optimizer import SkillOptimizer, OptimizationConfig

# Load configuration
config = OptimizationConfig.from_yaml("config.yaml")

# Load test cases
test_cases = load_test_cases("test_cases.yaml")

# Create optimizer
optimizer = SkillOptimizer(config, test_cases)

# Run optimization
result = optimizer.optimize()

# Review results
print(f"Initial scores: {result.initial_scores}")
print(f"Final scores: {result.final_scores}")
print(f"Improvement: {result.improvement}")
print(f"Best candidate: {result.best_candidate.id}")
```

### Step 6: Apply Improvements

```bash
# Review the optimized skill
diff ~/.claude/skills/my-skill/SKILL.md \
     ./optimization_results/optimized_skill/SKILL.md

# Apply if satisfied
cp ./optimization_results/optimized_skill/SKILL.md \
   ~/.claude/skills/my-skill/SKILL.md
```

---

## The GEPA Algorithm

GEPA (Genetic-Pareto) optimizes skills through **reflective text evolution**:

```
┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐  │
│   │  Skill   │────▶│  Execute │────▶│ Evaluate │────▶│ Reflect  │  │
│   │  v1.0    │     │  Tasks   │     │ Outputs  │     │ Failures │  │
│   └──────────┘     └──────────┘     └──────────┘     └──────────┘  │
│        ▲                                                    │       │
│        │           ┌──────────┐     ┌──────────┐           │       │
│        │           │  Pareto  │◀────│  Mutate  │◀──────────┘       │
│        │           │ Selection│     │  Skill   │                   │
│        │           └──────────┘     └──────────┘                   │
│        │                 │                                          │
│        └─────────────────┘                                          │
│                                                                      │
│   Output: Skill v2.0 with targeted improvements                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Key Concepts

1. **Execution Traces**: Capture what Claude actually does when using the skill
   - Reasoning steps
   - Code generated
   - Commands run
   - Errors encountered
   - Files created

2. **Reflection**: Analyze failures to propose targeted improvements
   - "Why did this task fail?"
   - "What instructions were ambiguous?"
   - "What edge cases weren't handled?"

3. **Mutation**: Generate improved skill variants
   - Add clarifying instructions
   - Include better examples
   - Add validation steps
   - Improve error handling

4. **Pareto Selection**: Balance multiple objectives
   - Quality vs. Speed
   - Reliability vs. Flexibility
   - Completeness vs. Conciseness

---

## Execution Trace Capture

When Claude executes a task, capture:

```python
@dataclass
class ExecutionTrace:
    task_id: str
    prompt: str
    success: bool
    execution_time: float
    
    # What Claude decided
    reasoning: list[str]
    decisions: list[dict]
    
    # What Claude produced
    code_blocks: list[dict]
    commands: list[dict]
    files_created: list[dict]
    
    # What went wrong
    errors: list[str]
    warnings: list[str]
    
    # Metrics
    tokens_used: int
```

### Trace for Reflection

```python
def to_reflection_text(trace: ExecutionTrace) -> str:
    """Convert trace to text for LLM reflection."""
    return f"""
## Task: {trace.task_id}
**Prompt**: {trace.prompt[:200]}...
**Success**: {trace.success}

### Claude's Reasoning:
{chr(10).join(f'- {r}' for r in trace.reasoning)}

### Errors:
{chr(10).join(f'- ❌ {e}' for e in trace.errors)}

### Code Generated:
```
{trace.code_blocks[0]['content'][:500] if trace.code_blocks else 'None'}
```
"""
```

---

## Built-in Evaluators

| Evaluator | Type | What It Checks |
|-----------|------|----------------|
| `BinaryEvaluator` | binary | Success/failure |
| `FileExistsEvaluator` | file_exists | Expected files created |
| `FileValidityEvaluator` | file_validity | Files not corrupt (valid PPTX, DOCX, etc.) |
| `ContentMatchEvaluator` | content_match | Expected content present |
| `ErrorRateEvaluator` | error_rate | Error count |
| `EfficiencyEvaluator` | efficiency | Time/token usage |

### Custom Evaluator Example

```python
from src.evaluators.base import BaseEvaluator, EvaluationResult

class BrandComplianceEvaluator(BaseEvaluator):
    """Check brand guideline compliance."""
    
    def __init__(self, brand_config: dict):
        super().__init__(name="brand_compliance")
        self.brand = brand_config
    
    def evaluate(self, trace, test_case=None):
        issues = []
        score = 1.0
        
        # Check colors
        for file in trace.files_created:
            if file['type'] == 'presentation':
                colors_used = self._extract_colors(file['path'])
                brand_colors = self.brand['colors']
                
                for color in colors_used:
                    if color not in brand_colors.values():
                        issues.append(f"Non-brand color: {color}")
                        score -= 0.1
        
        return EvaluationResult(
            metric_name=self.name,
            score=max(0, score),
            details={"colors_checked": len(colors_used)},
            issues=issues
        )
```

---

## Skill-Specific Optimization Examples

### PPTX Skill

```yaml
# test_cases.yaml for PPTX
test_cases:
  - id: template_corporate
    description: "Use corporate template correctly"
    prompt: |
      Using the Acme Corp template, create a 5-slide 
      quarterly results presentation with charts.
    expected_outputs:
      - type: file
        pattern: "*.pptx"
    quality_criteria:
      - "Uses template master slides"
      - "Brand colors applied"
      - "Charts properly formatted"
    tags: [template, brand, charts]
    
  - id: edge_long_title
    description: "Handle title overflow"
    prompt: |
      Create presentation titled: "Understanding the 
      Comprehensive Long-Term Strategic Impact of AI 
      on Global Economic Systems and International Trade"
    tags: [edge_case, overflow]
    
  - id: image_embedding
    description: "Embed images correctly"
    prompt: |
      Create a presentation with 3 product photos, 
      each properly sized and positioned.
    quality_criteria:
      - "Images not distorted"
      - "Proper aspect ratio maintained"
      - "Images centered on slides"
    tags: [images, layout]
```

### DOCX Skill

```yaml
test_cases:
  - id: report_structure
    description: "Create structured report"
    prompt: |
      Create a 10-page project report with executive 
      summary, sections, subsections, and appendix.
    quality_criteria:
      - "Proper heading hierarchy"
      - "Table of contents works"
      - "Page numbers correct"
      
  - id: track_changes
    description: "Work with tracked changes"
    prompt: |
      Open the document and accept all tracked changes,
      then add a new paragraph with revision marks.
    tags: [editing, collaboration]
```

---

## Output Structure

After optimization:

```
optimization_results/
├── report.md                    # Human-readable summary
├── config_used.yaml             # Configuration snapshot
├── optimized_skill/
│   ├── SKILL.md                 # ✨ Optimized instructions
│   ├── references/              # Updated reference files
│   └── scripts/                 # Updated scripts
├── candidates/
│   ├── gen1_v1.md
│   ├── gen2_v3.md
│   └── ...
├── traces/
│   ├── test_001.json
│   └── ...
├── metrics.json                 # Score history
└── checkpoints/
    └── checkpoint_gen5.json
```

---

## Best Practices

### 1. Start with Known Failures

If you know the skill fails on certain tasks:

```yaml
test_cases:
  - id: known_failure_001
    description: "Long titles overflow"
    prompt: "Create presentation with very long title..."
    tags: [known_failure, regression]
```

### 2. Include Diverse Test Cases

```yaml
# Mix of complexities
- simple: 30%
- medium: 40%
- complex: 20%
- edge_case: 10%
```

### 3. Use Meaningful Metrics

Match metrics to what "good" means for this skill:

```yaml
# For a code generation skill
metrics:
  - name: compiles
    type: binary
  - name: passes_tests
    type: ratio
  - name: code_quality
    type: llm_judge
```

### 4. Iterate Incrementally

```bash
# Quick test run first
python -m src.core.skill_optimizer optimize \
    --config config.yaml \
    --max-iterations 3 \
    --mock

# Review, adjust, then full run
python -m src.core.skill_optimizer optimize \
    --config config.yaml \
    --max-iterations 15
```

### 5. Review Before Applying

Always diff the changes:

```bash
diff -u original_SKILL.md optimized_SKILL.md | less
```

---

## Troubleshooting

| Issue | Solution |
|-------|----------|
| "Skill not found" | Check path, use absolute path |
| "No test cases" | Generate with test_generator |
| "All tests fail" | Check Claude Code CLI works |
| "No improvement" | Add more diverse test cases |
| "Regression" | Keep edge case tests, increase weight |

---

## CLI Commands

```bash
# Initialize optimization for a skill
skill-optimizer init --skill-path PATH --output-dir DIR

# Generate test cases
skill-test-gen --skill-path PATH --output FILE --count N

# Run optimization
skill-optimizer optimize --config CONFIG [--max-iterations N]

# Evaluate without optimization
skill-optimizer evaluate --skill-path PATH --test-cases FILE

# Compare original vs optimized
skill-optimizer compare --original PATH --optimized PATH
```

---

## Integration with Claude Code

This skill works with Claude Code CLI. Ensure it's installed:

```bash
# Check Claude Code is available
which claude

# Test execution
claude --version
```

The executor invokes:
```bash
claude --add-dir {skill_path} --output-format json --print -p "{prompt}"
```

---

## Expected Results

Based on GEPA research:

| Metric | Typical Improvement |
|--------|---------------------|
| Task Completion | +10-15% |
| Output Quality | +15-25% |
| Error Rate | -30-50% |
| Edge Case Handling | +20-40% |

---

## References

- [GEPA Paper](https://arxiv.org/abs/2507.19457) - Reflective Prompt Evolution
- [Claude Code Documentation](https://docs.anthropic.com)
- [Skill Creator Best Practices](https://docs.anthropic.com)
