Skip to content

Disaster Recovery Plan

Document ID: PLCY-DRP-001
Version: 1.1
Effective Date: December 22, 2025
Last Review: December 22, 2025
Owner: Hop And Haul Team


CONFIDENTIAL

This document is CONFIDENTIAL and for internal use only. Do not distribute outside the organization.

1. Purpose

This document establishes the disaster recovery (DR) procedures for Hop And Haul's production infrastructure, ensuring business continuity and data integrity in the event of system failures, natural disasters, or security incidents.


2. Scope

This policy applies to all Hop And Haul production systems including:

  • AWS infrastructure (compute, storage, networking)
  • Cloudflare DNS and Zero Trust services
  • PostgreSQL databases
  • Swift Vapor application servers
  • Authentication and authorization systems

3. Infrastructure Architecture

3.1 Network Architecture

LayerTechnologyPurpose
DNSCloudflareDNS resolution, DDoS protection
Edge SecurityCloudflare Zero TrustIdentity-aware access, tunnel ingress
TunnelsCloudflare TunnelNo public ports exposed
ComputeAWS EC2 (single instance)Application hosting
DatabaseAWS RDS PostgreSQL Multi-AZPersistent data storage with automatic failover
StorageAWS S3Object storage, AMI backups

3.2 Capacity Sizing

ResourceSpecificationJustification
EC2 Instancer6g.xlarge (32GB RAM, 4 vCPU)Headroom for 5000 concurrent connections
RDS Instancedb.t3.small (2GB) or db.t3.medium (4GB)Connection pooling limits actual DB connections to ~50
Max Users5,000 drivers/ridersSame user pool (drivers are riders)
Max WebSocket Connections5,000GPS updates via persistent WebSocket
Max RPS5,000Well within single-instance Swift Vapor capacity

Sizing Analysis:

MetricCalculationResult
WebSocket memory5,000 connections x 50KB250MB
Swift Vapor overheadBase + connection handling~2GB
Available headroom32GB - 3GB used29GB buffer
RDS connectionsVapor connection pool20-50 actual connections
RDS memory pressure50 connections x 10MB work_mem500MB

3.3 Security Posture

ControlImplementation
Public Port ExposureZero (all traffic via Cloudflare Tunnel)
AuthenticationStateless JWT with role-based access
Multi-tenancyOrganization-scoped data isolation
Application RuntimeSwift Vapor (compiled, memory-safe)
Database AccessApplication-only (no direct access)

3.4 Single-Instance Architecture Rationale

Given the workload characteristics (5,000 max concurrent users, WebSocket-based GPS), a single EC2 instance provides:

BenefitDescription
SimplicityNo load balancer, no session affinity concerns
WebSocket affinityAll connections to single instance, no sticky session routing
Cost efficiencyOne instance vs. cluster overhead
Reduced failure modesFewer components to fail
RDS Multi-AZDatabase HA handled by AWS, not application

Trade-off acknowledged: Single point of failure for compute. Mitigation: AMI-based rapid recovery (see Section 5.2).


4. Recovery Objectives

4.1 Recovery Time Objective (RTO)

TierSystemsRTO
Tier 1 - CriticalAuthentication, Core API, Driver matching1 hour
Tier 2 - EssentialReporting, Notifications, Voice agent4 hours
Tier 3 - StandardAnalytics, Admin dashboards24 hours

4.2 Recovery Point Objective (RPO)

Data CategoryRPOBackup Method
Transaction data5 minutesContinuous replication
User/Driver data1 hourHourly snapshots
Audit logs1 hourContinuous streaming
Configuration24 hoursDaily snapshots
Analytics24 hoursDaily exports

5. Backup Strategy

5.1 RDS PostgreSQL (Multi-AZ)

RDS Multi-AZ provides automatic failover with no application changes required.

FeatureConfiguration
Multi-AZEnabled (synchronous standby in separate AZ)
Automatic failoverYes (60-120 seconds)
Backup window03:00-04:00 UTC
Maintenance windowSun 05:00-06:00 UTC
Backup TypeFrequencyRetentionStorage
Automated snapshotsDaily30 daysRDS snapshots
Point-in-time recoveryContinuous (5-min granularity)30 daysRDS transaction logs
Manual snapshotsBefore major changesIndefiniteRDS snapshots
Monthly archiveMonthly7 yearsS3 Glacier Deep Archive

5.2 EC2 Application (AMI-Based)

Application instances are backed up via Amazon Machine Images (AMIs), not container images.

Backup TypeFrequencyRetentionStorage
Golden AMIAfter each deployment30 days (last 10 versions)EC2 AMI
Pre-change AMIBefore config changes7 daysEC2 AMI
Weekly AMISunday 02:00 UTC90 daysEC2 AMI

AMI Contents:

ComponentIncluded in AMI
Swift Vapor binaryYes (compiled application)
System configurationYes (systemd, cloudflared, etc.)
Cloudflare Tunnel daemonYes (cloudflared)
Environment variablesNo (pulled from Secrets Manager at boot)
SSL/TLS certificatesNo (Cloudflare-managed)

Recovery from AMI:

StepActionTime
1Launch new EC2 from latest AMI2-3 min
2Instance boots, pulls secrets1 min
3Cloudflare Tunnel reconnects30 sec
4Health check passes30 sec
Total< 5 min

5.3 Application State

ComponentBackup MethodFrequency
Swift Vapor sourceGit repositoryOn commit
Compiled binaryIncluded in AMIOn deploy
Environment variablesAWS Secrets ManagerVersioned
SSL/TLS certificatesCloudflare managedAutomatic
Cloudflare Tunnel configIncluded in AMI + Cloudflare dashboardOn change

5.4 Backup Verification

TestFrequencyOwner
AMI launch testWeeklyInfrastructure
RDS snapshot restore testMonthlyInfrastructure
Point-in-time recovery testQuarterlyInfrastructure
Full DR drill (AMI + RDS)QuarterlyInfrastructure + Operations

6. Disaster Scenarios

6.1 Scenario Matrix

ScenarioLikelihoodImpactRecovery Procedure
EC2 instance failureLowHighLaunch from AMI (< 5 min)
RDS primary failureLowMediumAutomatic Multi-AZ failover (60-120 sec)
RDS AZ failureVery LowMediumAutomatic Multi-AZ failover (60-120 sec)
Database corruptionLowCriticalPoint-in-time recovery
Cloudflare outageVery LowCriticalDirect access failover
Security breachLowCriticalIsolation + forensics
Application bug (data loss)MediumMediumPoint-in-time recovery
Bad deploymentMediumHighRollback to previous AMI (< 5 min)

6.2 EC2 Instance Failure

Single-box architecture means EC2 failure causes full outage until recovery.

StepActionOwnerTarget Time
1Detect instance unavailabilityCloudWatch alarm< 1 min
2Confirm failure (not transient)Auto-recovery or operator2 min
3Launch new instance from latest AMIInfrastructure2-3 min
4Instance boots, pulls secrets from Secrets ManagerAutomatic1 min
5Cloudflare Tunnel reconnectsAutomatic30 sec
6Health check passes, traffic resumesAutomatic30 sec
7Notify stakeholdersOperations5 min
Total< 10 min

Automation option: EC2 Auto Recovery can automatically recover instance on hardware failure.

6.3 RDS Multi-AZ Failover

RDS handles this automatically with no operator intervention.

StepActionOwnerTarget Time
1Primary instance failure detectedRDSImmediate
2DNS CNAME updated to standbyRDS30-60 sec
3Standby promoted to primaryRDS30-60 sec
4Application reconnects automaticallyVapor connection pool< 30 sec
5New standby provisioned in backgroundRDSMinutes (non-blocking)
Total60-120 sec (automatic)

Application behavior during failover:

  • Vapor connection pool detects closed connections
  • Automatic reconnection to new primary
  • Brief error responses during 60-120 second window
  • No data loss (synchronous replication)

6.4 Database Corruption

StepActionOwnerTarget Time
1Detect data anomalyAutomated/OperationsVariable
2Stop application writesInfrastructure5 min
3Identify corruption timestampInfrastructure30 min
4Initiate point-in-time recovery to new instanceInfrastructure15 min
5RDS restores to target timeRDS30-60 min
6Verify data integrity on new instanceOperations30 min
7Update application connection stringInfrastructure5 min
8Resume operationsOperations5 min
9Post-incident reviewAll teams24 hours
Total~2 hours

6.5 Bad Deployment Rollback

StepActionOwnerTarget Time
1Detect application issueMonitoring/OperationsVariable
2Decision to rollbackOperations5 min
3Launch instance from previous AMIInfrastructure2-3 min
4Instance boots, tunnel reconnectsAutomatic1-2 min
5Terminate bad instanceInfrastructure1 min
6Verify rollback successfulOperations5 min
Total< 15 min

6.6 Cloudflare Service Disruption

StepActionOwnerTarget Time
1Detect Cloudflare unavailabilityMonitoring< 5 min
2Assess scope (DNS vs Tunnel vs Full)Infrastructure5 min
3Activate backup DNS (Route 53)Infrastructure10 min
4Temporarily assign public IP to EC2Infrastructure5 min
5Update security group for direct accessInfrastructure5 min
6Enable AWS WAF rulesInfrastructure5 min
7Monitor for attacksSecurityContinuous
8Revert when Cloudflare restoredInfrastructure15 min

7. Multi-Tenant Considerations

7.1 Tenant Isolation During Recovery

RequirementImplementation
Data isolation maintainedOrganization-scoped recovery queries
No cross-tenant data exposureJWT org claims validated post-recovery
Tenant-specific rollbackSupported via org_id partitioning
Audit trail preservationTenant-scoped audit logs maintained

7.2 Tenant Communication

EventNotificationChannel
Planned maintenance72 hours advanceEmail + In-app
Unplanned outageWithin 15 minutesStatus page + Email
Recovery completeImmediatelyStatus page + Email
Post-incident reportWithin 48 hoursEmail

8. Recovery Procedures

8.1 Swift Vapor Application Recovery (AMI-Based)

Recovery Checklist:
[ ] Identify target AMI (latest golden or specific version)
[ ] Verify AMI available in region
[ ] Launch new EC2 instance from AMI
    - Instance type: r6g.xlarge
    - Subnet: private subnet with NAT
    - Security group: app-server-sg (no inbound, egress to RDS + Secrets Manager)
    - IAM role: fleetlink-app-role (Secrets Manager read)
[ ] Wait for instance status checks to pass
[ ] Verify Cloudflare Tunnel connectivity (check Cloudflare dashboard)
[ ] Run health check: curl https://api.fleetlink.com/health
[ ] Verify JWT validation: test auth endpoint
[ ] Confirm database connectivity: check /health/db endpoint
[ ] Verify WebSocket connections accepting
[ ] Monitor error rates in CloudWatch
[ ] Terminate old instance (if applicable)

8.2 PostgreSQL Recovery (RDS Multi-AZ)

For Multi-AZ Failover (automatic):

No action required - RDS handles automatically
[ ] Monitor RDS events for failover completion
[ ] Verify application reconnected (check logs)
[ ] Confirm no data loss

For Point-in-Time Recovery:

Recovery Checklist:
[ ] Identify target recovery timestamp
[ ] Initiate PITR via RDS console/CLI
    - New instance identifier: fleetlink-db-recovered-YYYYMMDD
    - Target time: [specific timestamp]
    - Instance class: db.t3.small or db.t3.medium
    - Multi-AZ: Yes
[ ] Wait for new instance availability (30-60 min)
[ ] Run data integrity queries:
    - SELECT COUNT(*) FROM users;
    - SELECT COUNT(*) FROM rides WHERE created_at > '[recovery_point]';
    - SELECT COUNT(DISTINCT org_id) FROM users;
[ ] Update Secrets Manager with new endpoint
[ ] Restart application to pick up new connection
[ ] Test read operations
[ ] Test write operations
[ ] Verify org_id constraints intact
[ ] Resume normal operations
[ ] Delete old instance after verification period (24-48 hours)

8.3 Authentication Recovery

ComponentRecovery Method
JWT signing keysRestore from Secrets Manager (versioned)
Role definitionsRestored with AMI (compiled into app)
Active sessionsStateless - no recovery needed
Refresh tokensRe-authentication required
Organization configsRestored with database

8.4 WebSocket Connection Recovery

ScenarioClient BehaviorServer Behavior
Instance restartClients auto-reconnectAccept new connections
Brief network blipWebSocket ping/pong timeoutConnection cleanup
Cloudflare tunnel restartTransparent to clientsRe-establishes tunnel

Client reconnection policy:

  • Exponential backoff: 1s, 2s, 4s, 8s, max 30s
  • GPS updates queued locally during disconnection
  • Batch upload on reconnection

9. Testing Schedule

9.1 DR Test Calendar

Test TypeFrequencyScopeDuration
AMI launch testWeeklyLaunch from latest AMI, verify health30 min
RDS failover testMonthlyTrigger Multi-AZ failover15 min
RDS PITR testQuarterlyPoint-in-time recovery to test instance2 hours
Full DR drillQuarterlyAMI recovery + RDS restore4 hours
Tabletop exerciseBi-annuallyAll scenariosHalf day

9.2 Test Documentation

Each DR test must document:

FieldRequired
Test date and timeYes
Scenario testedYes
ParticipantsYes
Actual recovery timeYes
Issues encounteredYes
Remediation actionsYes
Sign-offYes

10. Roles and Responsibilities

10.1 DR Team

RolePrimaryBackup
DR CoordinatorInfrastructure DirectorOperations Director
Database RecoverySenior DBAInfrastructure Engineer
Application RecoveryLead DeveloperDevOps Engineer
Network RecoveryInfrastructure EngineerSecurity Engineer
CommunicationsOperations DirectorSupport Lead

10.2 Escalation Matrix

SeverityResponse TimeEscalation Path
Critical (full outage)ImmediateDR Coordinator > CTO > CEO
High (partial outage)15 minutesOn-call > DR Coordinator
Medium (degraded)1 hourOn-call > Team Lead
Low (non-impacting)4 hoursStandard ticket

11. Dependencies

11.1 External Dependencies

ServiceCriticalityFallbackRecovery
AWS EC2CriticalLaunch from AMI< 5 min
AWS RDSCriticalMulti-AZ automatic failover60-120 sec
CloudflareCriticalRoute 53 + direct access15 min
Secrets ManagerHighValues cached in app memoryRestart required

11.2 Internal Dependencies

SystemDepends OnImpact if Unavailable
APIRDS, Secrets ManagerFull outage
WebSocketEC2 instanceGPS updates paused, client queue
AuthSecrets Manager (JWT keys)No new sessions
Voice AgentAPI, TwilioVoice unavailable
NotificationsAPI, Email providerDelayed communications

11.3 Single Point of Failure Analysis

ComponentSPOF?Mitigation
EC2 instanceYesAMI-based recovery < 5 min
RDS primaryNoMulti-AZ automatic failover
Cloudflare TunnelNoAuto-reconnect, can use direct access
Secrets ManagerNoRegional service, multi-AZ
S3 (AMIs)NoRegional service, 11 9s durability

12. Document References

DocumentRelevance
PLCY-INC-001 Incident ResponseIncident declaration procedures
PLCY-SEC-001 Security ControlsSecurity requirements during recovery
PLCY-AUD-001 Audit Trail SpecsAudit requirements for DR events
PLCY-RET-001 Records RetentionBackup retention requirements

13. Review and Maintenance

ActivityFrequencyOwner
Policy reviewAnnualInfrastructure Director
Contact list updateQuarterlyOperations
Procedure validationAfter each DR testDR Coordinator
Technology reviewSemi-annualInfrastructure

14. Revision History

VersionDateAuthorChanges
1.0December 22, 2025Infrastructure DirectorInitial release
1.1December 22, 2025Infrastructure DirectorUpdated to single-box AMI architecture, RDS Multi-AZ, capacity sizing

CONFIDENTIAL - Internal Use Only - Hop And Haul Policy Documentation