Appearance
Infrastructure Sizing Specification
Document ID: PLCY-INF-001
Version: 1.0
Effective Date: December 22, 2025
Last Review: December 22, 2025
Owner: Hop And Haul Team
CONFIDENTIAL
This document is CONFIDENTIAL and for internal use only. Do not distribute outside the organization.
1. Purpose
This document defines the infrastructure sizing requirements for Hop And Haul's production environment, establishing capacity baselines, scaling thresholds, and resource allocation decisions.
2. Workload Characteristics
2.1 User Population
| Metric | Value | Notes |
|---|---|---|
| Maximum users | 5,000 | Drivers and riders are the same user pool |
| Peak concurrent users | 5,000 | All users may be active simultaneously |
| User sessions | Stateless JWT | No server-side session storage |
2.2 Traffic Patterns
| Traffic Type | Characteristics |
|---|---|
| API requests | Bursty, correlated with ride activity |
| WebSocket connections | Persistent, one per active driver |
| GPS updates | 1 update per 5 seconds per active driver |
| Peak hours | 7-9 AM, 4-7 PM local time |
2.3 Request Volume
| Metric | Calculation | Result |
|---|---|---|
| Max concurrent WebSocket | 5,000 drivers | 5,000 connections |
| GPS updates per second | 5,000 / 5 sec interval | 1,000 messages/sec |
| API RPS (peak) | 5,000 users x 1 req/sec | 5,000 RPS |
| API RPS (sustained) | 5,000 users x 0.1 req/sec | 500 RPS |
3. Compute Sizing (EC2)
3.1 Selected Instance
| Attribute | Value |
|---|---|
| Instance type | r6g.xlarge |
| vCPU | 4 |
| Memory | 32 GB |
| Network | Up to 10 Gbps |
| Architecture | ARM64 (Graviton2) |
| Pricing tier | On-demand or Reserved |
3.2 Sizing Justification
Memory Analysis:
| Component | Memory Usage |
|---|---|
| Swift Vapor runtime | 500 MB |
| Application heap | 1-2 GB |
| WebSocket connections (5,000 x 50KB) | 250 MB |
| Connection buffers | 500 MB |
| OS and system | 1 GB |
| Total estimated | ~4 GB |
| Available headroom | 28 GB |
CPU Analysis:
| Workload | CPU Estimate |
|---|---|
| 5,000 RPS API handling | ~20% of 4 vCPU |
| 1,000 msg/sec WebSocket | ~10% of 4 vCPU |
| JWT validation | ~5% of 4 vCPU |
| Database queries | ~10% of 4 vCPU |
| Total estimated | ~45% |
| Available headroom | 55% |
Why r6g.xlarge:
| Reason | Explanation |
|---|---|
| Graviton2 (ARM) | 40% better price/performance vs x86 |
| Memory headroom | 32GB provides buffer for growth |
| Network capacity | 10 Gbps handles WebSocket + API |
| Swift support | Swift fully supports ARM64 Linux |
3.3 Alternative Sizing Options
| Scenario | Instance | Rationale |
|---|---|---|
| Cost-optimized | r6g.large (16GB) | Sufficient for current load |
| Growth headroom | r6g.xlarge (32GB) | Recommended |
| Peak buffer | r6g.2xlarge (64GB) | If approaching limits |
4. Database Sizing (RDS PostgreSQL)
4.1 Selected Instance
| Attribute | Value |
|---|---|
| Instance class | db.t3.small (2GB) or db.t3.medium (4GB) |
| Engine | PostgreSQL 15 |
| Multi-AZ | Enabled |
| Storage | gp3, 100GB initial |
| IOPS | 3,000 baseline (gp3 default) |
4.2 Sizing Justification
Connection Analysis:
| Metric | Value |
|---|---|
| Application connection pool | 20-50 connections |
| Actual concurrent queries | 10-20 |
| RDS max_connections (t3.small) | 112 |
| RDS max_connections (t3.medium) | 225 |
| Utilization | <50% |
Memory Analysis:
| Component | Estimate |
|---|---|
| shared_buffers (25% of RAM) | 500MB (2GB) or 1GB (4GB) |
| work_mem per connection | 10MB |
| Active work_mem (20 queries) | 200MB |
| OS and overhead | 500MB |
| Total pressure | ~1.2-1.7 GB |
Storage Analysis:
| Data Type | Estimated Size |
|---|---|
| User records (5,000) | 10 MB |
| Organization records | 1 MB |
| Ride history (1 year) | 5 GB |
| GPS traces (compressed, 12 months) | 20 GB |
| Audit logs (24 months) | 10 GB |
| Indexes | 10 GB |
| Total | ~50 GB |
| Provisioned | 100 GB |
4.3 Multi-AZ Configuration
| Feature | Setting |
|---|---|
| Multi-AZ deployment | Enabled |
| Synchronous replication | Yes (automatic) |
| Automatic failover | Yes (60-120 seconds) |
| Backup retention | 30 days |
| Point-in-time recovery | Enabled |
4.4 Recommendation
| Load Level | Instance Class | Reasoning |
|---|---|---|
| Conservative | db.t3.small (2GB) | Sufficient for workload |
| Recommended | db.t3.medium (4GB) | Headroom for query complexity |
| Growth | db.t3.large (8GB) | If adding analytics queries |
5. Network Architecture
5.1 Cloudflare Configuration
| Component | Configuration |
|---|---|
| DNS | Cloudflare (proxied) |
| DDoS protection | Included |
| Zero Trust | Enabled for admin access |
| Tunnel | Single tunnel to EC2 |
| WebSocket support | Enabled |
5.2 AWS Networking
| Component | Configuration |
|---|---|
| VPC | Single VPC, single region |
| Subnets | Private (app), Private (RDS) |
| NAT Gateway | For outbound (Secrets Manager, etc.) |
| Security Groups | No inbound, egress restricted |
| VPC Endpoints | Secrets Manager, S3 |
5.3 No Public Exposure
| Layer | Public Access |
|---|---|
| EC2 | No public IP |
| RDS | No public access |
| S3 | VPC endpoint only |
| Ingress | Cloudflare Tunnel only |
6. WebSocket Capacity
6.1 Connection Limits
| Limit | Value | Source |
|---|---|---|
| OS file descriptors | 65,535 default | Increase to 100,000 |
| Swift NIO connections | No hard limit | Memory-bound |
| Cloudflare concurrent | 100+ per tunnel | Well above need |
| Target connections | 5,000 | Workload requirement |
6.2 GPS Update Flow
Driver App → Cloudflare Tunnel → EC2 (Swift Vapor) → RDS (batch write)
↑ WebSocket (persistent) ↑ Every 30 seconds| Stage | Latency Target |
|---|---|
| Client to Cloudflare | < 50ms |
| Cloudflare to EC2 | < 10ms |
| EC2 processing | < 5ms |
| Batch DB write | < 50ms |
| Total | < 115ms |
6.3 Memory per Connection
| Component | Size |
|---|---|
| Swift NIO channel | ~10KB |
| Application state | ~20KB |
| Buffers | ~20KB |
| Total per connection | ~50KB |
| 5,000 connections | ~250MB |
7. Secrets Management
7.1 AWS Secrets Manager
| Secret | Rotation |
|---|---|
| RDS credentials | 30 days (automatic) |
| JWT signing key | Manual (on compromise) |
| Cloudflare API token | Manual |
| Third-party API keys | Per provider policy |
7.2 Application Secret Loading
| Timing | Behavior |
|---|---|
| Boot | Load all secrets from Secrets Manager |
| Runtime | Cached in memory |
| Rotation | Restart required (or implement refresh) |
8. Monitoring and Alerting
8.1 CloudWatch Metrics
| Metric | Warning | Critical |
|---|---|---|
| EC2 CPU | > 70% | > 90% |
| EC2 Memory | > 80% | > 95% |
| RDS CPU | > 70% | > 85% |
| RDS connections | > 80 | > 100 |
| RDS storage | > 80% | > 90% |
| WebSocket connections | > 4,500 | > 4,900 |
8.2 Application Metrics
| Metric | Warning | Critical |
|---|---|---|
| API latency p99 | > 500ms | > 1000ms |
| Error rate | > 1% | > 5% |
| WebSocket disconnects/min | > 100 | > 500 |
| GPS update lag | > 10s | > 30s |
9. Cost Estimation
9.1 Monthly Costs (us-east-1, On-Demand)
| Resource | Specification | Monthly Cost |
|---|---|---|
| EC2 r6g.xlarge | 730 hours | ~$150 |
| RDS db.t3.medium Multi-AZ | 730 hours | ~$100 |
| RDS storage (100GB gp3) | Multi-AZ | ~$25 |
| NAT Gateway | Data transfer | ~$50 |
| Secrets Manager | 5 secrets | ~$3 |
| CloudWatch | Metrics + logs | ~$30 |
| S3 (AMIs, backups) | ~50GB | ~$2 |
| Total | ~$360/month |
9.2 Reserved Instance Savings
| Commitment | EC2 Savings | RDS Savings |
|---|---|---|
| 1-year RI | ~30% | ~30% |
| 3-year RI | ~50% | ~50% |
10. Scaling Triggers
10.1 When to Scale Up
| Metric | Threshold | Action |
|---|---|---|
| Sustained CPU > 80% | 1 hour | Consider larger instance |
| Memory > 90% | Sustained | Upgrade instance class |
| RDS connections > 100 | Sustained | Increase pool or instance |
| WebSocket > 4,500 | Approaching limit | Plan for horizontal scale |
10.2 When to Consider Horizontal Scaling
| Trigger | Description |
|---|---|
| Users > 10,000 | Single-box limits approaching |
| WebSocket > 8,000 | Connection density limits |
| Regulatory | Data residency requirements |
| Availability | RTO < 5 min unacceptable |
10.3 Horizontal Scaling Architecture (Future)
If scaling beyond single-box:
| Component | Strategy |
|---|---|
| API | ALB + multiple EC2 |
| WebSocket | Sticky sessions or Redis pub/sub |
| Database | Read replicas for queries |
| Sessions | Already stateless (JWT) |
11. Document References
| Document | Relevance |
|---|---|
| PLCY-DRP-001 Disaster Recovery Plan | Recovery procedures |
| PLCY-SEC-001 Security Controls | Security requirements |
| PLCY-RSK-001 Risk Assessment | Capacity risks |
12. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | December 22, 2025 | Infrastructure Director | Initial release |