AI product craft
These skills encode the frameworks that distinguish disciplined AI product development from vibes-driven shipping: defaulting to “no AI,” building evals before committing engineering, designing counter-metrics that catch Goodhart failures, and translating model behavior into business outcomes.
ai-pm-frameworks
Section titled “ai-pm-frameworks”Triggers: “should we use AI for this?”, “what AI approach fits here?”, “help me scope this AI feature”
The core AI PM decision framework. Covers:
- The default-to-no question. Is this actually an AI problem, or a UI problem, a data problem, a rules problem? Forces this question before anything else.
- Build vs. buy vs. rule. When to use a pre-built API, when to fine-tune, when to write a rule instead.
- Eval design. Golden dataset structure, metric choices, the minimum viable eval before committing engineering resources.
- Model selection. How to match model capability, cost, and latency to the actual requirement — not the model that impressed in a demo.
- Guardrail planning. Safety sandwiches around LLM calls. What belongs in code vs. what belongs in the model.
Reads profile.md (for stack context) and strategy/<area>.md (for product context) when present.
metrics-design
Section titled “metrics-design”Triggers: “design the metrics for [feature]”, “what should I measure?”, “help me define success metrics”
Designs the full metric tree for an AI feature:
- North star — the one metric that, if it moves, you’re confident the product is improving
- Leading indicators — what you can measure before the north star moves
- Lagging indicators — downstream outcomes you’re ultimately optimizing for
- Counter-metrics — the signals that catch Goodhart-style failures (you optimized the thing you measured into a worse product)
- Operational metrics — latency, cost, error rate; the infra signals that keep the north star credible
Reads strategy/<area>.md when present — the strategy’s key metrics become the north stars, and this skill expands them.
demo-prep
Section titled “demo-prep”Triggers: “help me prep for this demo”, “I’m demoing [X] to [audience]”, “prep a high-stakes demo”
Five-pass prep for high-stakes demos:
- Headline — what single thing should the audience remember?
- Audience — what do they already know, what do they care about, what will they push on?
- Pre-mortem — what could go wrong? How do you handle each failure mode?
- Script — the actual flow, with specific words for the hard transitions
- Q&A — the questions they will ask, with prepared answers
Extra handling for AI-feature demos: behavioral uncertainty, “why did it do that” questions, confidence calibration, and the specific hazard of demos that work in practice but fail in front of audiences.
prioritization-frameworks
Section titled “prioritization-frameworks”Triggers: “help me prioritize this backlog”, “should I pick X or Y?”, “RICE vs ICE for this”
Picks the right prioritization framework for the situation:
| Framework | Best for |
|---|---|
| RICE | Large backlogs with heterogeneous items; teams that over-weight gut feel |
| ICE | Fast triage; smaller backlogs; when effort is hard to estimate |
| WSJF | Programs where cost of delay is the key variable; agile-at-scale |
| MoSCoW | Scope negotiation with stakeholders; fixed-deadline releases |
| Kano | Feature discovery; understanding which attributes delight vs. satisfy |
| Cost of Delay | When sequencing decisions need to be explained to executives |
Forces hidden assumptions into the open: who scored this, on what basis, what would change the ranking?
Reads strategy/<area>.md when present — backlog items are scored against strategic tracks, and items that don’t fit any track are candidates for the “not working on” list.
decision-log
Section titled “decision-log”Triggers: “log this decision”, “document why we chose X”, “ADR for [decision]”
Captures meaningful product/technical decisions in ADR format:
- Decision: what was chosen
- Context: what made this a real decision (not a default)
- Alternatives considered: what else was on the table
- Rationale: why this option
- Reversibility: easily reversible / reversible with effort / effectively irreversible
- Revisit triggers: what would make you revisit this?
- Confidence: high / medium / low at time of decision
Especially load-bearing for AI decisions tied to specific model versions. When you upgrade from Claude Sonnet 3.5 to 3.7, you want a record of why you chose the previous model and what the revisit criteria were.
self-reflect
Section titled “self-reflect”Triggers: “let me reflect on myself”, “self-reflection session”, “how am I doing as a leader?”
Guides reflection across five dimensions:
- Behavior under pressure (what triggers you, what you default to)
- Communication (how you’re landing, what’s not getting through)
- Time and energy (what you’re giving energy to vs. what you wish you were)
- Fulfillment (what’s meaningful, what’s not)
- Advocating for yourself (are you making your work visible, are you asking for what you need?)
Writes dated entries to ~/bettersense-work-reflections/self/reflections.md. The patterns-watch skill reads this alongside stakeholder and wins data.
patterns-watch
Section titled “patterns-watch”Triggers: “anything notable across my reflections this week?”, “Sunday evening check-in”, “patterns across my work”
Cross-cutting scan across the entire reflection ecosystem (stakeholder + wins + self files). Surfaces unsolicited patterns:
- Attention gaps — stakeholders you haven’t reflected on in a while
- Contradictions — stated focus vs. actual logged work
- Trajectory shifts — patterns that have changed direction
- Type imbalances — wins log skewed toward one type (e.g. all delivery, no mentorship)
Every observation is cited to specific dated entries. Never claims a pattern without evidence.
Most useful as a recurring scan:
/schedule "Every Sunday at 7pm, run /bettersense:patterns-watch"