Initial commit
This commit is contained in:
309
.claude/skills/shipping-and-launch/SKILL.md
Normal file
309
.claude/skills/shipping-and-launch/SKILL.md
Normal file
@@ -0,0 +1,309 @@
|
||||
---
|
||||
name: shipping-and-launch
|
||||
description: Prepares production launches. Use when preparing to deploy to production. Use when you need a pre-launch checklist, when setting up monitoring, when planning a staged rollout, or when you need a rollback strategy.
|
||||
---
|
||||
|
||||
# Shipping and Launch
|
||||
|
||||
## Overview
|
||||
|
||||
Ship with confidence. The goal is not just to deploy — it's to deploy safely, with monitoring in place, a rollback plan ready, and a clear understanding of what success looks like. Every launch should be reversible, observable, and incremental.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Deploying a feature to production for the first time
|
||||
- Releasing a significant change to users
|
||||
- Migrating data or infrastructure
|
||||
- Opening a beta or early access program
|
||||
- Any deployment that carries risk (all of them)
|
||||
|
||||
## The Pre-Launch Checklist
|
||||
|
||||
### Code Quality
|
||||
|
||||
- [ ] All tests pass (unit, integration, e2e)
|
||||
- [ ] Build succeeds with no warnings
|
||||
- [ ] Lint and type checking pass
|
||||
- [ ] Code reviewed and approved
|
||||
- [ ] No TODO comments that should be resolved before launch
|
||||
- [ ] No `console.log` debugging statements in production code
|
||||
- [ ] Error handling covers expected failure modes
|
||||
|
||||
### Security
|
||||
|
||||
- [ ] No secrets in code or version control
|
||||
- [ ] `npm audit` shows no critical or high vulnerabilities
|
||||
- [ ] Input validation on all user-facing endpoints
|
||||
- [ ] Authentication and authorization checks in place
|
||||
- [ ] Security headers configured (CSP, HSTS, etc.)
|
||||
- [ ] Rate limiting on authentication endpoints
|
||||
- [ ] CORS configured to specific origins (not wildcard)
|
||||
|
||||
### Performance
|
||||
|
||||
- [ ] Core Web Vitals within "Good" thresholds
|
||||
- [ ] No N+1 queries in critical paths
|
||||
- [ ] Images optimized (compression, responsive sizes, lazy loading)
|
||||
- [ ] Bundle size within budget
|
||||
- [ ] Database queries have appropriate indexes
|
||||
- [ ] Caching configured for static assets and repeated queries
|
||||
|
||||
### Accessibility
|
||||
|
||||
- [ ] Keyboard navigation works for all interactive elements
|
||||
- [ ] Screen reader can convey page content and structure
|
||||
- [ ] Color contrast meets WCAG 2.1 AA (4.5:1 for text)
|
||||
- [ ] Focus management correct for modals and dynamic content
|
||||
- [ ] Error messages are descriptive and associated with form fields
|
||||
- [ ] No accessibility warnings in axe-core or Lighthouse
|
||||
|
||||
### Infrastructure
|
||||
|
||||
- [ ] Environment variables set in production
|
||||
- [ ] Database migrations applied (or ready to apply)
|
||||
- [ ] DNS and SSL configured
|
||||
- [ ] CDN configured for static assets
|
||||
- [ ] Logging and error reporting configured
|
||||
- [ ] Health check endpoint exists and responds
|
||||
|
||||
### Documentation
|
||||
|
||||
- [ ] README updated with any new setup requirements
|
||||
- [ ] API documentation current
|
||||
- [ ] ADRs written for any architectural decisions
|
||||
- [ ] Changelog updated
|
||||
- [ ] User-facing documentation updated (if applicable)
|
||||
|
||||
## Feature Flag Strategy
|
||||
|
||||
Ship behind feature flags to decouple deployment from release:
|
||||
|
||||
```typescript
|
||||
// Feature flag check
|
||||
const flags = await getFeatureFlags(userId);
|
||||
|
||||
if (flags.taskSharing) {
|
||||
// New feature: task sharing
|
||||
return <TaskSharingPanel task={task} />;
|
||||
}
|
||||
|
||||
// Default: existing behavior
|
||||
return null;
|
||||
```
|
||||
|
||||
**Feature flag lifecycle:**
|
||||
|
||||
```
|
||||
1. DEPLOY with flag OFF → Code is in production but inactive
|
||||
2. ENABLE for team/beta → Internal testing in production environment
|
||||
3. GRADUAL ROLLOUT → 5% → 25% → 50% → 100% of users
|
||||
4. MONITOR at each stage → Watch error rates, performance, user feedback
|
||||
5. CLEAN UP → Remove flag and dead code path after full rollout
|
||||
```
|
||||
|
||||
**Rules:**
|
||||
- Every feature flag has an owner and an expiration date
|
||||
- Clean up flags within 2 weeks of full rollout
|
||||
- Don't nest feature flags (creates exponential combinations)
|
||||
- Test both flag states (on and off) in CI
|
||||
|
||||
## Staged Rollout
|
||||
|
||||
### The Rollout Sequence
|
||||
|
||||
```
|
||||
1. DEPLOY to staging
|
||||
└── Full test suite in staging environment
|
||||
└── Manual smoke test of critical flows
|
||||
|
||||
2. DEPLOY to production (feature flag OFF)
|
||||
└── Verify deployment succeeded (health check)
|
||||
└── Check error monitoring (no new errors)
|
||||
|
||||
3. ENABLE for team (flag ON for internal users)
|
||||
└── Team uses the feature in production
|
||||
└── 24-hour monitoring window
|
||||
|
||||
4. CANARY rollout (flag ON for 5% of users)
|
||||
└── Monitor error rates, latency, user behavior
|
||||
└── Compare metrics: canary vs. baseline
|
||||
└── 24-48 hour monitoring window
|
||||
└── Advance only if all thresholds pass (see table below)
|
||||
|
||||
5. GRADUAL increase (25% -> 50% -> 100%)
|
||||
└── Same monitoring at each step
|
||||
└── Ability to roll back to previous percentage at any point
|
||||
|
||||
6. FULL rollout (flag ON for all users)
|
||||
└── Monitor for 1 week
|
||||
└── Clean up feature flag
|
||||
```
|
||||
|
||||
### Rollout Decision Thresholds
|
||||
|
||||
Use these thresholds to decide whether to advance, hold, or roll back at each stage:
|
||||
|
||||
| Metric | Advance (green) | Hold and investigate (yellow) | Roll back (red) |
|
||||
|--------|-----------------|-------------------------------|-----------------|
|
||||
| Error rate | Within 10% of baseline | 10-100% above baseline | >2x baseline |
|
||||
| P95 latency | Within 20% of baseline | 20-50% above baseline | >50% above baseline |
|
||||
| Client JS errors | No new error types | New errors at <0.1% of sessions | New errors at >0.1% of sessions |
|
||||
| Business metrics | Neutral or positive | Decline <5% (may be noise) | Decline >5% |
|
||||
|
||||
### When to Roll Back
|
||||
|
||||
Roll back immediately if:
|
||||
- Error rate increases by more than 2x baseline
|
||||
- P95 latency increases by more than 50%
|
||||
- User-reported issues spike
|
||||
- Data integrity issues detected
|
||||
- Security vulnerability discovered
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### What to Monitor
|
||||
|
||||
```
|
||||
Application metrics:
|
||||
├── Error rate (total and by endpoint)
|
||||
├── Response time (p50, p95, p99)
|
||||
├── Request volume
|
||||
├── Active users
|
||||
└── Key business metrics (conversion, engagement)
|
||||
|
||||
Infrastructure metrics:
|
||||
├── CPU and memory utilization
|
||||
├── Database connection pool usage
|
||||
├── Disk space
|
||||
├── Network latency
|
||||
└── Queue depth (if applicable)
|
||||
|
||||
Client metrics:
|
||||
├── Core Web Vitals (LCP, INP, CLS)
|
||||
├── JavaScript errors
|
||||
├── API error rates from client perspective
|
||||
└── Page load time
|
||||
```
|
||||
|
||||
### Error Reporting
|
||||
|
||||
```typescript
|
||||
// Set up error boundary with reporting
|
||||
class ErrorBoundary extends React.Component {
|
||||
componentDidCatch(error: Error, info: React.ErrorInfo) {
|
||||
// Report to error tracking service
|
||||
reportError(error, {
|
||||
componentStack: info.componentStack,
|
||||
userId: getCurrentUser()?.id,
|
||||
page: window.location.pathname,
|
||||
});
|
||||
}
|
||||
|
||||
render() {
|
||||
if (this.state.hasError) {
|
||||
return <ErrorFallback onRetry={() => this.setState({ hasError: false })} />;
|
||||
}
|
||||
return this.props.children;
|
||||
}
|
||||
}
|
||||
|
||||
// Server-side error reporting
|
||||
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
|
||||
reportError(err, {
|
||||
method: req.method,
|
||||
url: req.url,
|
||||
userId: req.user?.id,
|
||||
});
|
||||
|
||||
// Don't expose internals to users
|
||||
res.status(500).json({
|
||||
error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' },
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
### Post-Launch Verification
|
||||
|
||||
In the first hour after launch:
|
||||
|
||||
```
|
||||
1. Check health endpoint returns 200
|
||||
2. Check error monitoring dashboard (no new error types)
|
||||
3. Check latency dashboard (no regression)
|
||||
4. Test the critical user flow manually
|
||||
5. Verify logs are flowing and readable
|
||||
6. Confirm rollback mechanism works (dry run if possible)
|
||||
```
|
||||
|
||||
## Rollback Strategy
|
||||
|
||||
Every deployment needs a rollback plan before it happens:
|
||||
|
||||
```markdown
|
||||
## Rollback Plan for [Feature/Release]
|
||||
|
||||
### Trigger Conditions
|
||||
- Error rate > 2x baseline
|
||||
- P95 latency > [X]ms
|
||||
- User reports of [specific issue]
|
||||
|
||||
### Rollback Steps
|
||||
1. Disable feature flag (if applicable)
|
||||
OR
|
||||
1. Deploy previous version: `git revert <commit> && git push`
|
||||
2. Verify rollback: health check, error monitoring
|
||||
3. Communicate: notify team of rollback
|
||||
|
||||
### Database Considerations
|
||||
- Migration [X] has a rollback: `npx prisma migrate rollback`
|
||||
- Data inserted by new feature: [preserved / cleaned up]
|
||||
|
||||
### Time to Rollback
|
||||
- Feature flag: < 1 minute
|
||||
- Redeploy previous version: < 5 minutes
|
||||
- Database rollback: < 15 minutes
|
||||
```
|
||||
## See Also
|
||||
|
||||
- For security pre-launch checks, see `references/security-checklist.md`
|
||||
- For performance pre-launch checklist, see `references/performance-checklist.md`
|
||||
- For accessibility verification before launch, see `references/accessibility-checklist.md`
|
||||
|
||||
## Common Rationalizations
|
||||
|
||||
| Rationalization | Reality |
|
||||
|---|---|
|
||||
| "It works in staging, it'll work in production" | Production has different data, traffic patterns, and edge cases. Monitor after deploy. |
|
||||
| "We don't need feature flags for this" | Every feature benefits from a kill switch. Even "simple" changes can break things. |
|
||||
| "Monitoring is overhead" | Not having monitoring means you discover problems from user complaints instead of dashboards. |
|
||||
| "We'll add monitoring later" | Add it before launch. You can't debug what you can't see. |
|
||||
| "Rolling back is admitting failure" | Rolling back is responsible engineering. Shipping a broken feature is the failure. |
|
||||
|
||||
## Red Flags
|
||||
|
||||
- Deploying without a rollback plan
|
||||
- No monitoring or error reporting in production
|
||||
- Big-bang releases (everything at once, no staging)
|
||||
- Feature flags with no expiration or owner
|
||||
- No one monitoring the deploy for the first hour
|
||||
- Production environment configuration done by memory, not code
|
||||
- "It's Friday afternoon, let's ship it"
|
||||
|
||||
## Verification
|
||||
|
||||
Before deploying:
|
||||
|
||||
- [ ] Pre-launch checklist completed (all sections green)
|
||||
- [ ] Feature flag configured (if applicable)
|
||||
- [ ] Rollback plan documented
|
||||
- [ ] Monitoring dashboards set up
|
||||
- [ ] Team notified of deployment
|
||||
|
||||
After deploying:
|
||||
|
||||
- [ ] Health check returns 200
|
||||
- [ ] Error rate is normal
|
||||
- [ ] Latency is normal
|
||||
- [ ] Critical user flow works
|
||||
- [ ] Logs are flowing
|
||||
- [ ] Rollback tested or verified ready
|
||||
Reference in New Issue
Block a user