Disaster Recovery Plan

Document ID: PLCY-DRP-001
Version: 1.1
Effective Date: December 22, 2025
Last Review: December 22, 2025
Owner: Hop And Haul Team

CONFIDENTIAL

This document is CONFIDENTIAL and for internal use only. Do not distribute outside the organization.

1. Purpose

This document establishes the disaster recovery (DR) procedures for Hop And Haul's production infrastructure, ensuring business continuity and data integrity in the event of system failures, natural disasters, or security incidents.

2. Scope

This policy applies to all Hop And Haul production systems including:

AWS infrastructure (compute, storage, networking)
Cloudflare DNS and Zero Trust services
PostgreSQL databases
Swift Vapor application servers
Authentication and authorization systems

3. Infrastructure Architecture

3.1 Network Architecture

Layer	Technology	Purpose
DNS	Cloudflare	DNS resolution, DDoS protection
Edge Security	Cloudflare Zero Trust	Identity-aware access, tunnel ingress
Tunnels	Cloudflare Tunnel	No public ports exposed
Compute	AWS EC2 (single instance)	Application hosting
Database	AWS RDS PostgreSQL Multi-AZ	Persistent data storage with automatic failover
Storage	AWS S3	Object storage, AMI backups

3.2 Capacity Sizing

Resource	Specification	Justification
EC2 Instance	r6g.xlarge (32GB RAM, 4 vCPU)	Headroom for 5000 concurrent connections
RDS Instance	db.t3.small (2GB) or db.t3.medium (4GB)	Connection pooling limits actual DB connections to ~50
Max Users	5,000 drivers/riders	Same user pool (drivers are riders)
Max WebSocket Connections	5,000	GPS updates via persistent WebSocket
Max RPS	5,000	Well within single-instance Swift Vapor capacity

Sizing Analysis:

Metric	Calculation	Result
WebSocket memory	5,000 connections x 50KB	250MB
Swift Vapor overhead	Base + connection handling	~2GB
Available headroom	32GB - 3GB used	29GB buffer
RDS connections	Vapor connection pool	20-50 actual connections
RDS memory pressure	50 connections x 10MB work_mem	500MB

3.3 Security Posture

Control	Implementation
Public Port Exposure	Zero (all traffic via Cloudflare Tunnel)
Authentication	Stateless JWT with role-based access
Multi-tenancy	Organization-scoped data isolation
Application Runtime	Swift Vapor (compiled, memory-safe)
Database Access	Application-only (no direct access)

3.4 Single-Instance Architecture Rationale

Given the workload characteristics (5,000 max concurrent users, WebSocket-based GPS), a single EC2 instance provides:

Benefit	Description
Simplicity	No load balancer, no session affinity concerns
WebSocket affinity	All connections to single instance, no sticky session routing
Cost efficiency	One instance vs. cluster overhead
Reduced failure modes	Fewer components to fail
RDS Multi-AZ	Database HA handled by AWS, not application

Trade-off acknowledged: Single point of failure for compute. Mitigation: AMI-based rapid recovery (see Section 5.2).

4. Recovery Objectives

4.1 Recovery Time Objective (RTO)

Tier	Systems	RTO
Tier 1 - Critical	Authentication, Core API, Driver matching	1 hour
Tier 2 - Essential	Reporting, Notifications, Voice agent	4 hours
Tier 3 - Standard	Analytics, Admin dashboards	24 hours

4.2 Recovery Point Objective (RPO)

Data Category	RPO	Backup Method
Transaction data	5 minutes	Continuous replication
User/Driver data	1 hour	Hourly snapshots
Audit logs	1 hour	Continuous streaming
Configuration	24 hours	Daily snapshots
Analytics	24 hours	Daily exports

5. Backup Strategy

5.1 RDS PostgreSQL (Multi-AZ)

RDS Multi-AZ provides automatic failover with no application changes required.

Feature	Configuration
Multi-AZ	Enabled (synchronous standby in separate AZ)
Automatic failover	Yes (60-120 seconds)
Backup window	03:00-04:00 UTC
Maintenance window	Sun 05:00-06:00 UTC

Backup Type	Frequency	Retention	Storage
Automated snapshots	Daily	30 days	RDS snapshots
Point-in-time recovery	Continuous (5-min granularity)	30 days	RDS transaction logs
Manual snapshots	Before major changes	Indefinite	RDS snapshots
Monthly archive	Monthly	7 years	S3 Glacier Deep Archive

5.2 EC2 Application (AMI-Based)

Application instances are backed up via Amazon Machine Images (AMIs), not container images.

Backup Type	Frequency	Retention	Storage
Golden AMI	After each deployment	30 days (last 10 versions)	EC2 AMI
Pre-change AMI	Before config changes	7 days	EC2 AMI
Weekly AMI	Sunday 02:00 UTC	90 days	EC2 AMI

AMI Contents:

Component	Included in AMI
Swift Vapor binary	Yes (compiled application)
System configuration	Yes (systemd, cloudflared, etc.)
Cloudflare Tunnel daemon	Yes (cloudflared)
Environment variables	No (pulled from Secrets Manager at boot)
SSL/TLS certificates	No (Cloudflare-managed)

Recovery from AMI:

Step	Action	Time
1	Launch new EC2 from latest AMI	2-3 min
2	Instance boots, pulls secrets	1 min
3	Cloudflare Tunnel reconnects	30 sec
4	Health check passes	30 sec
Total		< 5 min

5.3 Application State

Component	Backup Method	Frequency
Swift Vapor source	Git repository	On commit
Compiled binary	Included in AMI	On deploy
Environment variables	AWS Secrets Manager	Versioned
SSL/TLS certificates	Cloudflare managed	Automatic
Cloudflare Tunnel config	Included in AMI + Cloudflare dashboard	On change

5.4 Backup Verification

Test	Frequency	Owner
AMI launch test	Weekly	Infrastructure
RDS snapshot restore test	Monthly	Infrastructure
Point-in-time recovery test	Quarterly	Infrastructure
Full DR drill (AMI + RDS)	Quarterly	Infrastructure + Operations

6. Disaster Scenarios

6.1 Scenario Matrix

Scenario	Likelihood	Impact	Recovery Procedure
EC2 instance failure	Low	High	Launch from AMI (< 5 min)
RDS primary failure	Low	Medium	Automatic Multi-AZ failover (60-120 sec)
RDS AZ failure	Very Low	Medium	Automatic Multi-AZ failover (60-120 sec)
Database corruption	Low	Critical	Point-in-time recovery
Cloudflare outage	Very Low	Critical	Direct access failover
Security breach	Low	Critical	Isolation + forensics
Application bug (data loss)	Medium	Medium	Point-in-time recovery
Bad deployment	Medium	High	Rollback to previous AMI (< 5 min)

6.2 EC2 Instance Failure

Single-box architecture means EC2 failure causes full outage until recovery.

Step	Action	Owner	Target Time
1	Detect instance unavailability	CloudWatch alarm	< 1 min
2	Confirm failure (not transient)	Auto-recovery or operator	2 min
3	Launch new instance from latest AMI	Infrastructure	2-3 min
4	Instance boots, pulls secrets from Secrets Manager	Automatic	1 min
5	Cloudflare Tunnel reconnects	Automatic	30 sec
6	Health check passes, traffic resumes	Automatic	30 sec
7	Notify stakeholders	Operations	5 min
Total			< 10 min

Automation option: EC2 Auto Recovery can automatically recover instance on hardware failure.

6.3 RDS Multi-AZ Failover

RDS handles this automatically with no operator intervention.

Step	Action	Owner	Target Time
1	Primary instance failure detected	RDS	Immediate
2	DNS CNAME updated to standby	RDS	30-60 sec
3	Standby promoted to primary	RDS	30-60 sec
4	Application reconnects automatically	Vapor connection pool	< 30 sec
5	New standby provisioned in background	RDS	Minutes (non-blocking)
Total			60-120 sec (automatic)

Application behavior during failover:

Vapor connection pool detects closed connections
Automatic reconnection to new primary
Brief error responses during 60-120 second window
No data loss (synchronous replication)

6.4 Database Corruption

Step	Action	Owner	Target Time
1	Detect data anomaly	Automated/Operations	Variable
2	Stop application writes	Infrastructure	5 min
3	Identify corruption timestamp	Infrastructure	30 min
4	Initiate point-in-time recovery to new instance	Infrastructure	15 min
5	RDS restores to target time	RDS	30-60 min
6	Verify data integrity on new instance	Operations	30 min
7	Update application connection string	Infrastructure	5 min
8	Resume operations	Operations	5 min
9	Post-incident review	All teams	24 hours
Total			~2 hours

6.5 Bad Deployment Rollback

Step	Action	Owner	Target Time
1	Detect application issue	Monitoring/Operations	Variable
2	Decision to rollback	Operations	5 min
3	Launch instance from previous AMI	Infrastructure	2-3 min
4	Instance boots, tunnel reconnects	Automatic	1-2 min
5	Terminate bad instance	Infrastructure	1 min
6	Verify rollback successful	Operations	5 min
Total			< 15 min

6.6 Cloudflare Service Disruption

Step	Action	Owner	Target Time
1	Detect Cloudflare unavailability	Monitoring	< 5 min
2	Assess scope (DNS vs Tunnel vs Full)	Infrastructure	5 min
3	Activate backup DNS (Route 53)	Infrastructure	10 min
4	Temporarily assign public IP to EC2	Infrastructure	5 min
5	Update security group for direct access	Infrastructure	5 min
6	Enable AWS WAF rules	Infrastructure	5 min
7	Monitor for attacks	Security	Continuous
8	Revert when Cloudflare restored	Infrastructure	15 min

7. Multi-Tenant Considerations

7.1 Tenant Isolation During Recovery

Requirement	Implementation
Data isolation maintained	Organization-scoped recovery queries
No cross-tenant data exposure	JWT org claims validated post-recovery
Tenant-specific rollback	Supported via org_id partitioning
Audit trail preservation	Tenant-scoped audit logs maintained

7.2 Tenant Communication

Event	Notification	Channel
Planned maintenance	72 hours advance	Email + In-app
Unplanned outage	Within 15 minutes	Status page + Email
Recovery complete	Immediately	Status page + Email
Post-incident report	Within 48 hours	Email

8. Recovery Procedures

8.1 Swift Vapor Application Recovery (AMI-Based)

Recovery Checklist:
[ ] Identify target AMI (latest golden or specific version)
[ ] Verify AMI available in region
[ ] Launch new EC2 instance from AMI
    - Instance type: r6g.xlarge
    - Subnet: private subnet with NAT
    - Security group: app-server-sg (no inbound, egress to RDS + Secrets Manager)
    - IAM role: fleetlink-app-role (Secrets Manager read)
[ ] Wait for instance status checks to pass
[ ] Verify Cloudflare Tunnel connectivity (check Cloudflare dashboard)
[ ] Run health check: curl https://api.fleetlink.com/health
[ ] Verify JWT validation: test auth endpoint
[ ] Confirm database connectivity: check /health/db endpoint
[ ] Verify WebSocket connections accepting
[ ] Monitor error rates in CloudWatch
[ ] Terminate old instance (if applicable)

8.2 PostgreSQL Recovery (RDS Multi-AZ)

For Multi-AZ Failover (automatic):

No action required - RDS handles automatically
[ ] Monitor RDS events for failover completion
[ ] Verify application reconnected (check logs)
[ ] Confirm no data loss

For Point-in-Time Recovery:

Recovery Checklist:
[ ] Identify target recovery timestamp
[ ] Initiate PITR via RDS console/CLI
    - New instance identifier: fleetlink-db-recovered-YYYYMMDD
    - Target time: [specific timestamp]
    - Instance class: db.t3.small or db.t3.medium
    - Multi-AZ: Yes
[ ] Wait for new instance availability (30-60 min)
[ ] Run data integrity queries:
    - SELECT COUNT(*) FROM users;
    - SELECT COUNT(*) FROM rides WHERE created_at > '[recovery_point]';
    - SELECT COUNT(DISTINCT org_id) FROM users;
[ ] Update Secrets Manager with new endpoint
[ ] Restart application to pick up new connection
[ ] Test read operations
[ ] Test write operations
[ ] Verify org_id constraints intact
[ ] Resume normal operations
[ ] Delete old instance after verification period (24-48 hours)

8.3 Authentication Recovery

Component	Recovery Method
JWT signing keys	Restore from Secrets Manager (versioned)
Role definitions	Restored with AMI (compiled into app)
Active sessions	Stateless - no recovery needed
Refresh tokens	Re-authentication required
Organization configs	Restored with database

8.4 WebSocket Connection Recovery

Scenario	Client Behavior	Server Behavior
Instance restart	Clients auto-reconnect	Accept new connections
Brief network blip	WebSocket ping/pong timeout	Connection cleanup
Cloudflare tunnel restart	Transparent to clients	Re-establishes tunnel

Client reconnection policy:

Exponential backoff: 1s, 2s, 4s, 8s, max 30s
GPS updates queued locally during disconnection
Batch upload on reconnection

9. Testing Schedule

9.1 DR Test Calendar

Test Type	Frequency	Scope	Duration
AMI launch test	Weekly	Launch from latest AMI, verify health	30 min
RDS failover test	Monthly	Trigger Multi-AZ failover	15 min
RDS PITR test	Quarterly	Point-in-time recovery to test instance	2 hours
Full DR drill	Quarterly	AMI recovery + RDS restore	4 hours
Tabletop exercise	Bi-annually	All scenarios	Half day

9.2 Test Documentation

Each DR test must document:

Field	Required
Test date and time	Yes
Scenario tested	Yes
Participants	Yes
Actual recovery time	Yes
Issues encountered	Yes
Remediation actions	Yes
Sign-off	Yes

10. Roles and Responsibilities

10.1 DR Team

Role	Primary	Backup
DR Coordinator	Infrastructure Director	Operations Director
Database Recovery	Senior DBA	Infrastructure Engineer
Application Recovery	Lead Developer	DevOps Engineer
Network Recovery	Infrastructure Engineer	Security Engineer
Communications	Operations Director	Support Lead

10.2 Escalation Matrix

Severity	Response Time	Escalation Path
Critical (full outage)	Immediate	DR Coordinator > CTO > CEO
High (partial outage)	15 minutes	On-call > DR Coordinator
Medium (degraded)	1 hour	On-call > Team Lead
Low (non-impacting)	4 hours	Standard ticket

11. Dependencies

11.1 External Dependencies

Service	Criticality	Fallback	Recovery
AWS EC2	Critical	Launch from AMI	< 5 min
AWS RDS	Critical	Multi-AZ automatic failover	60-120 sec
Cloudflare	Critical	Route 53 + direct access	15 min
Secrets Manager	High	Values cached in app memory	Restart required

11.2 Internal Dependencies

System	Depends On	Impact if Unavailable
API	RDS, Secrets Manager	Full outage
WebSocket	EC2 instance	GPS updates paused, client queue
Auth	Secrets Manager (JWT keys)	No new sessions
Voice Agent	API, Twilio	Voice unavailable
Notifications	API, Email provider	Delayed communications

11.3 Single Point of Failure Analysis

Component	SPOF?	Mitigation
EC2 instance	Yes	AMI-based recovery < 5 min
RDS primary	No	Multi-AZ automatic failover
Cloudflare Tunnel	No	Auto-reconnect, can use direct access
Secrets Manager	No	Regional service, multi-AZ
S3 (AMIs)	No	Regional service, 11 9s durability

12. Document References

Document	Relevance
PLCY-INC-001 Incident Response	Incident declaration procedures
PLCY-SEC-001 Security Controls	Security requirements during recovery
PLCY-AUD-001 Audit Trail Specs	Audit requirements for DR events
PLCY-RET-001 Records Retention	Backup retention requirements

13. Review and Maintenance

Activity	Frequency	Owner
Policy review	Annual	Infrastructure Director
Contact list update	Quarterly	Operations
Procedure validation	After each DR test	DR Coordinator
Technology review	Semi-annual	Infrastructure

14. Revision History

Version	Date	Author	Changes
1.0	December 22, 2025	Infrastructure Director	Initial release
1.1	December 22, 2025	Infrastructure Director	Updated to single-box AMI architecture, RDS Multi-AZ, capacity sizing

Disaster Recovery Plan ​

1. Purpose ​

2. Scope ​

3. Infrastructure Architecture ​

3.1 Network Architecture ​

3.2 Capacity Sizing ​

3.3 Security Posture ​

3.4 Single-Instance Architecture Rationale ​

4. Recovery Objectives ​

4.1 Recovery Time Objective (RTO) ​

4.2 Recovery Point Objective (RPO) ​

5. Backup Strategy ​

5.1 RDS PostgreSQL (Multi-AZ) ​

5.2 EC2 Application (AMI-Based) ​

5.3 Application State ​

5.4 Backup Verification ​

6. Disaster Scenarios ​

6.1 Scenario Matrix ​

6.2 EC2 Instance Failure ​

6.3 RDS Multi-AZ Failover ​

6.4 Database Corruption ​

6.5 Bad Deployment Rollback ​

6.6 Cloudflare Service Disruption ​

7. Multi-Tenant Considerations ​

7.1 Tenant Isolation During Recovery ​

7.2 Tenant Communication ​

8. Recovery Procedures ​

8.1 Swift Vapor Application Recovery (AMI-Based) ​

8.2 PostgreSQL Recovery (RDS Multi-AZ) ​

8.3 Authentication Recovery ​

8.4 WebSocket Connection Recovery ​

9. Testing Schedule ​

9.1 DR Test Calendar ​

9.2 Test Documentation ​

10. Roles and Responsibilities ​

10.1 DR Team ​

10.2 Escalation Matrix ​

11. Dependencies ​

11.1 External Dependencies ​

11.2 Internal Dependencies ​

11.3 Single Point of Failure Analysis ​

12. Document References ​

13. Review and Maintenance ​

14. Revision History ​