Appearance
Disaster Recovery Plan
Document ID: PLCY-DRP-001
Version: 1.1
Effective Date: December 22, 2025
Last Review: December 22, 2025
Owner: Hop And Haul Team
CONFIDENTIAL
This document is CONFIDENTIAL and for internal use only. Do not distribute outside the organization.
1. Purpose
This document establishes the disaster recovery (DR) procedures for Hop And Haul's production infrastructure, ensuring business continuity and data integrity in the event of system failures, natural disasters, or security incidents.
2. Scope
This policy applies to all Hop And Haul production systems including:
- AWS infrastructure (compute, storage, networking)
- Cloudflare DNS and Zero Trust services
- PostgreSQL databases
- Swift Vapor application servers
- Authentication and authorization systems
3. Infrastructure Architecture
3.1 Network Architecture
| Layer | Technology | Purpose |
|---|---|---|
| DNS | Cloudflare | DNS resolution, DDoS protection |
| Edge Security | Cloudflare Zero Trust | Identity-aware access, tunnel ingress |
| Tunnels | Cloudflare Tunnel | No public ports exposed |
| Compute | AWS EC2 (single instance) | Application hosting |
| Database | AWS RDS PostgreSQL Multi-AZ | Persistent data storage with automatic failover |
| Storage | AWS S3 | Object storage, AMI backups |
3.2 Capacity Sizing
| Resource | Specification | Justification |
|---|---|---|
| EC2 Instance | r6g.xlarge (32GB RAM, 4 vCPU) | Headroom for 5000 concurrent connections |
| RDS Instance | db.t3.small (2GB) or db.t3.medium (4GB) | Connection pooling limits actual DB connections to ~50 |
| Max Users | 5,000 drivers/riders | Same user pool (drivers are riders) |
| Max WebSocket Connections | 5,000 | GPS updates via persistent WebSocket |
| Max RPS | 5,000 | Well within single-instance Swift Vapor capacity |
Sizing Analysis:
| Metric | Calculation | Result |
|---|---|---|
| WebSocket memory | 5,000 connections x 50KB | 250MB |
| Swift Vapor overhead | Base + connection handling | ~2GB |
| Available headroom | 32GB - 3GB used | 29GB buffer |
| RDS connections | Vapor connection pool | 20-50 actual connections |
| RDS memory pressure | 50 connections x 10MB work_mem | 500MB |
3.3 Security Posture
| Control | Implementation |
|---|---|
| Public Port Exposure | Zero (all traffic via Cloudflare Tunnel) |
| Authentication | Stateless JWT with role-based access |
| Multi-tenancy | Organization-scoped data isolation |
| Application Runtime | Swift Vapor (compiled, memory-safe) |
| Database Access | Application-only (no direct access) |
3.4 Single-Instance Architecture Rationale
Given the workload characteristics (5,000 max concurrent users, WebSocket-based GPS), a single EC2 instance provides:
| Benefit | Description |
|---|---|
| Simplicity | No load balancer, no session affinity concerns |
| WebSocket affinity | All connections to single instance, no sticky session routing |
| Cost efficiency | One instance vs. cluster overhead |
| Reduced failure modes | Fewer components to fail |
| RDS Multi-AZ | Database HA handled by AWS, not application |
Trade-off acknowledged: Single point of failure for compute. Mitigation: AMI-based rapid recovery (see Section 5.2).
4. Recovery Objectives
4.1 Recovery Time Objective (RTO)
| Tier | Systems | RTO |
|---|---|---|
| Tier 1 - Critical | Authentication, Core API, Driver matching | 1 hour |
| Tier 2 - Essential | Reporting, Notifications, Voice agent | 4 hours |
| Tier 3 - Standard | Analytics, Admin dashboards | 24 hours |
4.2 Recovery Point Objective (RPO)
| Data Category | RPO | Backup Method |
|---|---|---|
| Transaction data | 5 minutes | Continuous replication |
| User/Driver data | 1 hour | Hourly snapshots |
| Audit logs | 1 hour | Continuous streaming |
| Configuration | 24 hours | Daily snapshots |
| Analytics | 24 hours | Daily exports |
5. Backup Strategy
5.1 RDS PostgreSQL (Multi-AZ)
RDS Multi-AZ provides automatic failover with no application changes required.
| Feature | Configuration |
|---|---|
| Multi-AZ | Enabled (synchronous standby in separate AZ) |
| Automatic failover | Yes (60-120 seconds) |
| Backup window | 03:00-04:00 UTC |
| Maintenance window | Sun 05:00-06:00 UTC |
| Backup Type | Frequency | Retention | Storage |
|---|---|---|---|
| Automated snapshots | Daily | 30 days | RDS snapshots |
| Point-in-time recovery | Continuous (5-min granularity) | 30 days | RDS transaction logs |
| Manual snapshots | Before major changes | Indefinite | RDS snapshots |
| Monthly archive | Monthly | 7 years | S3 Glacier Deep Archive |
5.2 EC2 Application (AMI-Based)
Application instances are backed up via Amazon Machine Images (AMIs), not container images.
| Backup Type | Frequency | Retention | Storage |
|---|---|---|---|
| Golden AMI | After each deployment | 30 days (last 10 versions) | EC2 AMI |
| Pre-change AMI | Before config changes | 7 days | EC2 AMI |
| Weekly AMI | Sunday 02:00 UTC | 90 days | EC2 AMI |
AMI Contents:
| Component | Included in AMI |
|---|---|
| Swift Vapor binary | Yes (compiled application) |
| System configuration | Yes (systemd, cloudflared, etc.) |
| Cloudflare Tunnel daemon | Yes (cloudflared) |
| Environment variables | No (pulled from Secrets Manager at boot) |
| SSL/TLS certificates | No (Cloudflare-managed) |
Recovery from AMI:
| Step | Action | Time |
|---|---|---|
| 1 | Launch new EC2 from latest AMI | 2-3 min |
| 2 | Instance boots, pulls secrets | 1 min |
| 3 | Cloudflare Tunnel reconnects | 30 sec |
| 4 | Health check passes | 30 sec |
| Total | < 5 min |
5.3 Application State
| Component | Backup Method | Frequency |
|---|---|---|
| Swift Vapor source | Git repository | On commit |
| Compiled binary | Included in AMI | On deploy |
| Environment variables | AWS Secrets Manager | Versioned |
| SSL/TLS certificates | Cloudflare managed | Automatic |
| Cloudflare Tunnel config | Included in AMI + Cloudflare dashboard | On change |
5.4 Backup Verification
| Test | Frequency | Owner |
|---|---|---|
| AMI launch test | Weekly | Infrastructure |
| RDS snapshot restore test | Monthly | Infrastructure |
| Point-in-time recovery test | Quarterly | Infrastructure |
| Full DR drill (AMI + RDS) | Quarterly | Infrastructure + Operations |
6. Disaster Scenarios
6.1 Scenario Matrix
| Scenario | Likelihood | Impact | Recovery Procedure |
|---|---|---|---|
| EC2 instance failure | Low | High | Launch from AMI (< 5 min) |
| RDS primary failure | Low | Medium | Automatic Multi-AZ failover (60-120 sec) |
| RDS AZ failure | Very Low | Medium | Automatic Multi-AZ failover (60-120 sec) |
| Database corruption | Low | Critical | Point-in-time recovery |
| Cloudflare outage | Very Low | Critical | Direct access failover |
| Security breach | Low | Critical | Isolation + forensics |
| Application bug (data loss) | Medium | Medium | Point-in-time recovery |
| Bad deployment | Medium | High | Rollback to previous AMI (< 5 min) |
6.2 EC2 Instance Failure
Single-box architecture means EC2 failure causes full outage until recovery.
| Step | Action | Owner | Target Time |
|---|---|---|---|
| 1 | Detect instance unavailability | CloudWatch alarm | < 1 min |
| 2 | Confirm failure (not transient) | Auto-recovery or operator | 2 min |
| 3 | Launch new instance from latest AMI | Infrastructure | 2-3 min |
| 4 | Instance boots, pulls secrets from Secrets Manager | Automatic | 1 min |
| 5 | Cloudflare Tunnel reconnects | Automatic | 30 sec |
| 6 | Health check passes, traffic resumes | Automatic | 30 sec |
| 7 | Notify stakeholders | Operations | 5 min |
| Total | < 10 min |
Automation option: EC2 Auto Recovery can automatically recover instance on hardware failure.
6.3 RDS Multi-AZ Failover
RDS handles this automatically with no operator intervention.
| Step | Action | Owner | Target Time |
|---|---|---|---|
| 1 | Primary instance failure detected | RDS | Immediate |
| 2 | DNS CNAME updated to standby | RDS | 30-60 sec |
| 3 | Standby promoted to primary | RDS | 30-60 sec |
| 4 | Application reconnects automatically | Vapor connection pool | < 30 sec |
| 5 | New standby provisioned in background | RDS | Minutes (non-blocking) |
| Total | 60-120 sec (automatic) |
Application behavior during failover:
- Vapor connection pool detects closed connections
- Automatic reconnection to new primary
- Brief error responses during 60-120 second window
- No data loss (synchronous replication)
6.4 Database Corruption
| Step | Action | Owner | Target Time |
|---|---|---|---|
| 1 | Detect data anomaly | Automated/Operations | Variable |
| 2 | Stop application writes | Infrastructure | 5 min |
| 3 | Identify corruption timestamp | Infrastructure | 30 min |
| 4 | Initiate point-in-time recovery to new instance | Infrastructure | 15 min |
| 5 | RDS restores to target time | RDS | 30-60 min |
| 6 | Verify data integrity on new instance | Operations | 30 min |
| 7 | Update application connection string | Infrastructure | 5 min |
| 8 | Resume operations | Operations | 5 min |
| 9 | Post-incident review | All teams | 24 hours |
| Total | ~2 hours |
6.5 Bad Deployment Rollback
| Step | Action | Owner | Target Time |
|---|---|---|---|
| 1 | Detect application issue | Monitoring/Operations | Variable |
| 2 | Decision to rollback | Operations | 5 min |
| 3 | Launch instance from previous AMI | Infrastructure | 2-3 min |
| 4 | Instance boots, tunnel reconnects | Automatic | 1-2 min |
| 5 | Terminate bad instance | Infrastructure | 1 min |
| 6 | Verify rollback successful | Operations | 5 min |
| Total | < 15 min |
6.6 Cloudflare Service Disruption
| Step | Action | Owner | Target Time |
|---|---|---|---|
| 1 | Detect Cloudflare unavailability | Monitoring | < 5 min |
| 2 | Assess scope (DNS vs Tunnel vs Full) | Infrastructure | 5 min |
| 3 | Activate backup DNS (Route 53) | Infrastructure | 10 min |
| 4 | Temporarily assign public IP to EC2 | Infrastructure | 5 min |
| 5 | Update security group for direct access | Infrastructure | 5 min |
| 6 | Enable AWS WAF rules | Infrastructure | 5 min |
| 7 | Monitor for attacks | Security | Continuous |
| 8 | Revert when Cloudflare restored | Infrastructure | 15 min |
7. Multi-Tenant Considerations
7.1 Tenant Isolation During Recovery
| Requirement | Implementation |
|---|---|
| Data isolation maintained | Organization-scoped recovery queries |
| No cross-tenant data exposure | JWT org claims validated post-recovery |
| Tenant-specific rollback | Supported via org_id partitioning |
| Audit trail preservation | Tenant-scoped audit logs maintained |
7.2 Tenant Communication
| Event | Notification | Channel |
|---|---|---|
| Planned maintenance | 72 hours advance | Email + In-app |
| Unplanned outage | Within 15 minutes | Status page + Email |
| Recovery complete | Immediately | Status page + Email |
| Post-incident report | Within 48 hours |
8. Recovery Procedures
8.1 Swift Vapor Application Recovery (AMI-Based)
Recovery Checklist:
[ ] Identify target AMI (latest golden or specific version)
[ ] Verify AMI available in region
[ ] Launch new EC2 instance from AMI
- Instance type: r6g.xlarge
- Subnet: private subnet with NAT
- Security group: app-server-sg (no inbound, egress to RDS + Secrets Manager)
- IAM role: fleetlink-app-role (Secrets Manager read)
[ ] Wait for instance status checks to pass
[ ] Verify Cloudflare Tunnel connectivity (check Cloudflare dashboard)
[ ] Run health check: curl https://api.fleetlink.com/health
[ ] Verify JWT validation: test auth endpoint
[ ] Confirm database connectivity: check /health/db endpoint
[ ] Verify WebSocket connections accepting
[ ] Monitor error rates in CloudWatch
[ ] Terminate old instance (if applicable)8.2 PostgreSQL Recovery (RDS Multi-AZ)
For Multi-AZ Failover (automatic):
No action required - RDS handles automatically
[ ] Monitor RDS events for failover completion
[ ] Verify application reconnected (check logs)
[ ] Confirm no data lossFor Point-in-Time Recovery:
Recovery Checklist:
[ ] Identify target recovery timestamp
[ ] Initiate PITR via RDS console/CLI
- New instance identifier: fleetlink-db-recovered-YYYYMMDD
- Target time: [specific timestamp]
- Instance class: db.t3.small or db.t3.medium
- Multi-AZ: Yes
[ ] Wait for new instance availability (30-60 min)
[ ] Run data integrity queries:
- SELECT COUNT(*) FROM users;
- SELECT COUNT(*) FROM rides WHERE created_at > '[recovery_point]';
- SELECT COUNT(DISTINCT org_id) FROM users;
[ ] Update Secrets Manager with new endpoint
[ ] Restart application to pick up new connection
[ ] Test read operations
[ ] Test write operations
[ ] Verify org_id constraints intact
[ ] Resume normal operations
[ ] Delete old instance after verification period (24-48 hours)8.3 Authentication Recovery
| Component | Recovery Method |
|---|---|
| JWT signing keys | Restore from Secrets Manager (versioned) |
| Role definitions | Restored with AMI (compiled into app) |
| Active sessions | Stateless - no recovery needed |
| Refresh tokens | Re-authentication required |
| Organization configs | Restored with database |
8.4 WebSocket Connection Recovery
| Scenario | Client Behavior | Server Behavior |
|---|---|---|
| Instance restart | Clients auto-reconnect | Accept new connections |
| Brief network blip | WebSocket ping/pong timeout | Connection cleanup |
| Cloudflare tunnel restart | Transparent to clients | Re-establishes tunnel |
Client reconnection policy:
- Exponential backoff: 1s, 2s, 4s, 8s, max 30s
- GPS updates queued locally during disconnection
- Batch upload on reconnection
9. Testing Schedule
9.1 DR Test Calendar
| Test Type | Frequency | Scope | Duration |
|---|---|---|---|
| AMI launch test | Weekly | Launch from latest AMI, verify health | 30 min |
| RDS failover test | Monthly | Trigger Multi-AZ failover | 15 min |
| RDS PITR test | Quarterly | Point-in-time recovery to test instance | 2 hours |
| Full DR drill | Quarterly | AMI recovery + RDS restore | 4 hours |
| Tabletop exercise | Bi-annually | All scenarios | Half day |
9.2 Test Documentation
Each DR test must document:
| Field | Required |
|---|---|
| Test date and time | Yes |
| Scenario tested | Yes |
| Participants | Yes |
| Actual recovery time | Yes |
| Issues encountered | Yes |
| Remediation actions | Yes |
| Sign-off | Yes |
10. Roles and Responsibilities
10.1 DR Team
| Role | Primary | Backup |
|---|---|---|
| DR Coordinator | Infrastructure Director | Operations Director |
| Database Recovery | Senior DBA | Infrastructure Engineer |
| Application Recovery | Lead Developer | DevOps Engineer |
| Network Recovery | Infrastructure Engineer | Security Engineer |
| Communications | Operations Director | Support Lead |
10.2 Escalation Matrix
| Severity | Response Time | Escalation Path |
|---|---|---|
| Critical (full outage) | Immediate | DR Coordinator > CTO > CEO |
| High (partial outage) | 15 minutes | On-call > DR Coordinator |
| Medium (degraded) | 1 hour | On-call > Team Lead |
| Low (non-impacting) | 4 hours | Standard ticket |
11. Dependencies
11.1 External Dependencies
| Service | Criticality | Fallback | Recovery |
|---|---|---|---|
| AWS EC2 | Critical | Launch from AMI | < 5 min |
| AWS RDS | Critical | Multi-AZ automatic failover | 60-120 sec |
| Cloudflare | Critical | Route 53 + direct access | 15 min |
| Secrets Manager | High | Values cached in app memory | Restart required |
11.2 Internal Dependencies
| System | Depends On | Impact if Unavailable |
|---|---|---|
| API | RDS, Secrets Manager | Full outage |
| WebSocket | EC2 instance | GPS updates paused, client queue |
| Auth | Secrets Manager (JWT keys) | No new sessions |
| Voice Agent | API, Twilio | Voice unavailable |
| Notifications | API, Email provider | Delayed communications |
11.3 Single Point of Failure Analysis
| Component | SPOF? | Mitigation |
|---|---|---|
| EC2 instance | Yes | AMI-based recovery < 5 min |
| RDS primary | No | Multi-AZ automatic failover |
| Cloudflare Tunnel | No | Auto-reconnect, can use direct access |
| Secrets Manager | No | Regional service, multi-AZ |
| S3 (AMIs) | No | Regional service, 11 9s durability |
12. Document References
| Document | Relevance |
|---|---|
| PLCY-INC-001 Incident Response | Incident declaration procedures |
| PLCY-SEC-001 Security Controls | Security requirements during recovery |
| PLCY-AUD-001 Audit Trail Specs | Audit requirements for DR events |
| PLCY-RET-001 Records Retention | Backup retention requirements |
13. Review and Maintenance
| Activity | Frequency | Owner |
|---|---|---|
| Policy review | Annual | Infrastructure Director |
| Contact list update | Quarterly | Operations |
| Procedure validation | After each DR test | DR Coordinator |
| Technology review | Semi-annual | Infrastructure |
14. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | December 22, 2025 | Infrastructure Director | Initial release |
| 1.1 | December 22, 2025 | Infrastructure Director | Updated to single-box AMI architecture, RDS Multi-AZ, capacity sizing |