This is a fictional but realistic Solution Architecture Document for Meridian Financial Services’ Customer API Platform. It demonstrates the Architecture Description Standard at Comprehensive depth — the highest level of documentation rigour. Every section is completed with realistic content to show what a mature, well-documented SAD looks like for a Tier 1 Critical, regulated financial services API platform.
Fictional company: Meridian Financial Services (MFS) — a mid-sized UK retail bank.
Fictional solution: Customer API Platform — a cloud-native REST API providing account and transaction data to partner fintech applications under Open Banking regulations.
This SAD describes the architecture of the Customer API Platform (CAP), Meridian Financial Services’ Open Banking and partner API solution. It replaces the legacy SOAP-based Partner Integration Layer (PIL) and provides secure, high-performance RESTful APIs exposing account information and transaction data to authorised third-party providers (TPPs) and partner fintech applications.
In scope:
API Gateway and all microservices (Account, Transaction, Auth, Notification)
AWS infrastructure across all environments (dev, test, staging, production, DR)
Integration with core banking system, fraud detection, and notification services
Security architecture including OAuth 2.0, mTLS, and encryption
The Customer API Platform (CAP) is a cloud-native, microservices-based REST API platform that exposes account information and transaction data to authorised partner fintech applications and third-party providers. It is Meridian Financial Services’ primary channel for Open Banking compliance and strategic partner integrations.
CAP replaces the legacy SOAP-based Partner Integration Layer, which suffered from poor scalability, high latency, and an inability to meet the performance and security requirements of the UK Open Banking standard. The new platform is built on AWS using containerised microservices orchestrated by Amazon EKS, fronted by AWS API Gateway, and secured with OAuth 2.0 and mutual TLS.
The current Partner Integration Layer (PIL) was built in 2016 on Oracle SOA Suite 11g, hosted on-premises in MFS’ Slough data centre. It provides SOAP/XML interfaces to 8 existing partner integrations.
Key limitations:
Performance: Average response time of 1.2 seconds (P95: 3.8 seconds), far exceeding the Open Banking 1-second mandate
Scalability: Vertically scaled on two physical servers; cannot handle projected 5,000 req/s demand
Security: Does not support OAuth 2.0 or mTLS as required by Open Banking security profile
Supportability: Oracle SOA Suite 11g reached end-of-support in 2022; two critical CVEs remain unpatched
Cost: Annual licensing and support costs of GBP 280,000 plus 3 FTEs for manual operations
Onboarding: Partner onboarding requires 4 weeks of manual configuration and testing
What is being retained: Core banking Oracle database (read replicas will be consumed via new integration layer)
What is being replaced: Oracle SOA Suite middleware, SOAP/XML interfaces, on-premises hosting
What is being decommissioned: PIL application servers (post 6-month parallel-run period)
Caching to avoid recomputation / repeated downstream calls
Yes — ElastiCache Redis used for session state, partner JWT verification keys, and short-lived rate-limiter counters; ~85% cache hit rate on partner authentication, eliminating ~12M Cognito calls per month
Batch processes consolidated rather than continuously polling
Yes — transaction enrichment runs as nightly batch (00:30 UTC) rather than per-event; Featurespace ARIC fraud signals consumed via webhook (push) rather than polling
Async / event-driven patterns to flatten peak load
Yes — EventBridge + SQS for transaction events, partner notifications, and audit log shipping; consumer pods scale on queue depth via Karpenter, releasing capacity when idle
Heavy framework choices weighed against lighter alternatives
Considered — Spring Boot retained for the core API (existing team skill, mature ecosystem); Lambda evaluated and rejected for synchronous APIs (cold-start latency would breach P95 < 200ms SLA)
No — production and non-production environments are in separate AWS accounts with no direct connectivity. Data flows between environments only through the CI/CD pipeline (GitHub Actions deploying to each environment in sequence).
Partner applications access the API programmatically; there are no end-user compute or BYOD requirements. Internal administrators use standard corporate Windows 11 laptops via VPN.
eu-west-2 (London) chosen primarily for UK data residency. AWS London is on track for 100% renewable energy matching by 2025 (AWS commitment). DR region eu-west-1 (Ireland) operates at lower carbon intensity than the AWS European average.
Non-production environments auto-shutdown out of hours
Yes — dev and staging EKS clusters scale to zero application pods 19:00-07:00 weekdays and all weekend (system pods remain). Non-prod RDS instances paused on the same schedule. Estimated saving: 62% of non-prod compute and 55% of non-prod RDS spend.
Compute family chosen for performance-per-watt
Yes — Graviton3 (c7g.xlarge / m7g.xlarge) throughout. AWS published data shows ~60% better performance-per-watt vs equivalent x86 m6i; Graviton3 was the dominant factor in the 2025-Q3 cost reduction.
Auto-scaling configured to release capacity when idle
Yes — Karpenter consolidates underutilised nodes within 5 minutes of becoming idle; HPA scales pods on CPU and queue depth; idle workloads return resources to the pool rather than being held.
DR strategy proportionate to recovery objective
Warm standby in eu-west-1 (RDS read replica + S3 cross-region replication; EKS cluster scaled to minimum). Hot active-active was considered and rejected: would have doubled compute footprint for an RTO improvement (4h -> 1h) that the business sponsor confirmed was unnecessary.
Account and transaction data replicated from core banking Oracle DB via CDC (nightly batch + near-real-time CDC for balances); consent records created via Auth Service
Schema validation, data type enforcement, PII field identification and tagging at ingestion
Processing
API requests query PostgreSQL; PII fields decrypted only at point of use within service; response payloads assembled and returned
Column-level decryption in application code; no PII in logs; request/response audit events emitted
Production data used in staging environment only, with all PII fields masked using a deterministic tokenisation approach (Delphix DataVault). Account numbers, names, and addresses are replaced with realistic synthetic data. Test and development environments use entirely synthetic data generated by the API team.
Yes — checksums (SHA-256) are computed for all data replicated from core banking and validated on ingestion. Transaction amounts are verified using double-entry accounting reconciliation jobs that run hourly, comparing aggregated balances against core banking source of truth.
No — no data is stored on end-user devices. All data is served via API and is not cached client-side (Cache-Control: no-store headers applied to all API responses containing customer data).
Yes — all customer data (PII and transaction data) must remain within the United Kingdom (eu-west-2 London region). The DR region (eu-west-1 Ireland) stores only non-PII operational data (metrics, redacted logs). Cross-region replication for RDS is configured to exclude PII columns (custom replication using CDC with PII filtering). Audit logs in S3 are replicated to eu-west-1 with PII fields encrypted using a region-specific KMS key that prevents decryption outside eu-west-2.
Retention periods minimised to regulator + business need
Yes — transaction data retained for 7 years (FCA SYSC requirement); audit logs 7 years; access logs 13 months (regulatory minimum); ephemeral session data ≤ 24 hours. Lifecycle policies enforce expiry automatically; no “indefinite” retention.
Older data tiered to cold/archive storage
Yes — audit logs and transaction archives transition S3 Standard → Standard-IA (30 days) → Glacier Instant Retrieval (90 days) → Glacier Deep Archive (1 year). RDS snapshots > 35 days exported to S3 Glacier. ~78% of historical data sits in archive tiers.
Unused or duplicate replicas identified and removed
Yes — weekly orphaned-snapshot job; quarterly review of read replicas (currently 2, justified by read traffic distribution). No legacy unused buckets (verified via AWS Trusted Advisor).
Compression applied to reduce storage and transfer
Yes — Brotli compression on HTTPS responses (~70% reduction on JSON payloads); gzip on S3 audit log uploads; Parquet (with Snappy) for analytics exports to Snowflake.
Cross-region replication justified by recovery requirement
Yes — only audit logs and operational metrics replicate cross-region. Customer PII does not (data sovereignty + reduced cross-region transfer carbon cost). DR for the RDS primary is via daily encrypted backup snapshots restored on-demand, not continuous replication.
Large data transfers scheduled to off-peak windows
Yes — nightly Snowflake export runs 02:00-04:00 UTC; weekly partner reconciliation transfers run Sunday 03:00 UTC; both deliberately scheduled when UK grid carbon intensity is lowest (per carbonintensity.org.uk historical data).
Critical — exposure of customer financial data would trigger mandatory FCA notification, potential regulatory fines (up to 4% of annual turnover under GDPR), and severe reputational damage
Integrity
High — manipulation of transaction or balance data could lead to incorrect financial reporting and partner disputes
Availability
Critical — outage breaches CMA Open Banking mandate and SLA commitments to 25+ partners; estimated GBP 45,000/hour revenue impact
Non-Repudiation
High — inability to prove API request/response authenticity could undermine dispute resolution with partners and regulators
JWTs are signed (RS256) and optionally encrypted (A256GCM); token binding to mTLS certificate thumbprint prevents token replay; refresh tokens are single-use with rotation
What are the session timeout and concurrency limits?
Internal: 30-minute idle timeout, 8-hour absolute; External: access tokens 5-minute absolute, no concurrency limits on stateless API access
VPC with public, private, and isolated subnets across 2 AZs; security groups per service (allow only required ports/protocols); NACLs as secondary layer; EKS pods use Calico network policies for pod-to-pod segmentation
Ingress filtering
AWS WAF v2 (OWASP Top 10 rules, rate limiting, geo-restriction to permitted countries), Shield Advanced, API Gateway throttling; NLB in public subnet routes to API Gateway
Egress filtering
NAT Gateway with fixed Elastic IPs for outbound (partner webhooks, Featurespace); egress security groups restrict destinations to known endpoints; VPC Flow Logs for monitoring
Encryption in transit
TLS 1.3 enforced for partner API traffic; TLS 1.2 minimum for all other connections; certificates managed by AWS Certificate Manager (ACM) for public endpoints; private CA for internal mTLS
All API requests logged with partner ID, IP, timestamp, requested scopes, response status; authentication events (success/failure); authorisation decisions; admin actions. Logs forwarded to Splunk via Fluent Bit
SIEM integration
Splunk Enterprise (corporate instance) — all security events forwarded via HTTP Event Collector (HEC); custom Splunk correlation rules for anomaly detection
Partner app sends GET /accounts/{accountId}/balance request
Pre-conditions
Partner has valid OAuth 2.0 access token with accounts:read scope; customer has granted consent to this TPP for this account
Main Flow
1. Partner sends HTTPS request with Bearer token and mTLS client certificate to API Gateway. 2. API Gateway validates request structure and routes to Auth Service. 3. Auth Service validates OAuth token, verifies mTLS certificate binding, checks consent record in PostgreSQL. 4. Auth Service returns authorisation decision to API Gateway. 5. API Gateway routes to Account Service. 6. Account Service checks Redis cache for balance (60s TTL). 7. Cache hit: return cached balance. Cache miss: Account Service queries core banking read replica via JDBC, caches result, returns balance. 8. API Gateway returns JSON response to partner. 9. Audit event emitted to EventBridge.
Post-conditions
Partner receives account balance; audit log records the access; cache updated if miss occurred
Views Involved
Logical (services), Integration & Data Flow (API flow), Physical (EKS, RDS, Redis, Direct Connect), Data (account data, cache), Security (OAuth, mTLS, consent, audit)
UC-02: Rate Limit Exceeded
Attribute
Detail
Actor(s)
Partner fintech application
Trigger
Partner exceeds 100 req/s rate limit
Pre-conditions
Partner is authenticated and making valid requests
Main Flow
1. Partner sends request to API Gateway. 2. API Gateway rate-limiting check identifies partner has exceeded 100 req/s quota. 3. API Gateway returns HTTP 429 Too Many Requests with Retry-After header. 4. Rate limit event logged and counted. 5. If sustained (>5 min), Splunk alert triggers notification to Partner Manager. 6. Notification Service sends email to partner’s registered technical contact.
Post-conditions
Partner receives 429 response; partner is notified; rate limit event logged for analysis
The platform requires a container orchestration solution to run microservices. Both Amazon EKS (managed Kubernetes) and Amazon ECS (AWS-native container service) were evaluated.
Decision
Use Amazon EKS (Kubernetes).
Alternatives Considered
ECS Fargate: Lower operational overhead, but limited pod-level networking control and no support for Envoy sidecar injection (Istio/Linkerd) needed for mTLS mesh. ECS on EC2: More control but still lacks Kubernetes ecosystem (Helm, Argo CD, Calico network policies). Self-managed Kubernetes on EC2: Maximum control but unacceptable operational burden for a 6-person platform team.
Consequences
Positive: rich ecosystem (Helm, Argo CD, Calico, Prometheus), strong portability to other clouds, existing team Kubernetes skills. Negative: higher operational complexity than ECS Fargate, Kubernetes version upgrade overhead every 12-14 months.
Quality Attribute Tradeoffs
Operational Excellence: increased complexity (negative) offset by richer observability tooling (positive). Reliability: Kubernetes self-healing (positive). Cost: slightly higher than ECS Fargate due to node management (negative). Portability: significantly better (positive).
ADR-002: PostgreSQL over DynamoDB for Primary Data Store
Field
Content
Status
Accepted
Date
2024-10-05
Context
The platform needs a primary data store for account metadata, transaction data, and consent records. The data is relational (accounts have transactions, consent links customers to TPPs and accounts) and requires strong consistency for financial accuracy.
Decision
Use Amazon RDS PostgreSQL 16.
Alternatives Considered
DynamoDB: Excellent scalability and operational simplicity, but poor fit for relational queries (joins across accounts/transactions/consent), no native support for field-level encryption patterns used for PII, and team has limited DynamoDB experience. Aurora PostgreSQL: Considered, but standard RDS PostgreSQL meets performance requirements at lower cost; Aurora’s distributed storage overhead is unnecessary at current data volumes.
Consequences
Positive: strong relational model for financial data, excellent ecosystem (pg_cron, pgcrypto for field-level encryption), team expertise, straightforward backup/recovery. Negative: vertical scaling limits (mitigated by read replicas and Redis caching), operational overhead of PostgreSQL tuning.
Quality Attribute Tradeoffs
Performance: adequate for 5,000 req/s with caching layer (neutral). Reliability: Multi-AZ provides HA (positive). Cost: lower than Aurora at current scale (positive). Portability: standard PostgreSQL, highly portable (positive).
ADR-003: Event-Driven Architecture for Notifications and Audit
Field
Content
Status
Accepted
Date
2024-10-08
Context
The platform must send notifications (partner webhooks, internal alerts, compliance emails) and write audit logs. These operations must not increase API response latency.
Decision
Use Amazon EventBridge with SQS for asynchronous notification and audit processing.
Alternatives Considered
Synchronous processing: Simple but adds 50-100ms to every API response for audit writes and notification dispatch; unacceptable for P95 < 200ms target. Amazon SNS + SQS: Works but lacks EventBridge’s content-based filtering and schema registry. Apache Kafka (MSK): Powerful but over-engineered for current throughput (5,000 events/s); operational overhead of Kafka cluster management not justified.
Consequences
Positive: API response latency unaffected by notification/audit processing, natural decoupling enables independent scaling of Notification Service, EventBridge schema registry aids contract evolution. Negative: eventual consistency for audit logs (acceptable: audit logs are written within seconds), added infrastructure complexity.
Quality Attribute Tradeoffs
Performance: significant improvement in P95 latency (positive). Reliability: event replay capability aids recovery (positive). Cost: EventBridge pricing is consumption-based, cost-effective at current volumes (positive). Operational Excellence: additional component to monitor (negative, mitigated by managed service).
What metrics are collected for capacity monitoring?
CPU utilisation, memory utilisation, pod count, HPA scaling events, RDS connections, RDS storage, Redis memory, API Gateway request count, EKS node count
How are capacity trends analysed?
Weekly automated report from Grafana (30-day trend); monthly capacity review meeting with SRE and Platform team; quarterly projection against growth model
Are capacity thresholds and alerts configured?
Yes — alerts at 70% (warning) and 85% (critical) for CPU, memory, storage, and connection pools
Is there a capacity planning process?
Yes — annual capacity plan updated quarterly; aligned with partner onboarding forecast from business development team
Is the application deployed across multiple hosting venues for continuity?
Yes — primary in eu-west-2 (London) with DR in eu-west-1 (Ireland) using active-passive (pilot light) configuration
What is the DR strategy?
Active-Passive (pilot light): DR region has EKS cluster with minimum nodes (2), RDS read replica (promoted during failover), and pre-configured EventBridge rules. Scaled up during failover.
Are there data sovereignty requirements affecting geographic choices?
Yes — PII must remain in UK (eu-west-2). DR region stores non-PII data only. Failover for PII-containing services requires manual approval from Compliance.
Full auto-scaling (Horizontal Pod Autoscaler on all services; Karpenter for EKS node auto-scaling)
Scaling details
HPA scales pods based on CPU (target 60%) and custom metrics (request queue depth). Karpenter provisions new Graviton nodes within 90 seconds. API Gateway has no scaling limits. RDS: read replicas can be added; vertical scaling requires brief downtime (planned maintenance window). ElastiCache: cluster mode with automatic resharding.
Component failures: Each microservice runs 4+ replicas across 2 AZs; Kubernetes automatically reschedules failed pods. Pod disruption budgets ensure minimum 2 replicas during rolling updates.
Graceful degradation: If core banking is unavailable, Account Service returns cached data from Redis (with staleness indicator). If Featurespace ARIC is unavailable, Transaction Service returns full data without fraud scoring (with logged exception).
Circuit breaker patterns: Resilience4j circuit breakers on Core Banking Adapter (open after 5 consecutive failures, half-open after 30s) and Featurespace client (open after 3 failures, half-open after 15s).
Health checks: Kubernetes liveness probes (HTTP /health/live, 10s interval), readiness probes (HTTP /health/ready, 5s interval, checks DB connectivity). Failed readiness removes pod from service.
Testing practices: Monthly chaos testing with Gremlin (pod kill, AZ failure simulation, network latency injection). Quarterly DR failover drill. Annual game day exercise simulating multi-component failure.
All backups encrypted with AWS KMS CMK (same key as source data); cross-region copies re-encrypted with region-specific CMK
Access control
Backup operations restricted to DBA IAM role and AWS Backup service role; snapshot sharing disabled; cross-account backup vault in isolated security account
Will the current design scale to accommodate projected growth?
Yes for 3-year horizon. At the 5-year mark, PostgreSQL vertical scaling may reach limits; migration to Aurora PostgreSQL or introduction of read replica sharding will be evaluated at the 3-year review.
Are there known seasonal or cyclical demand patterns?
Yes — 30% traffic increase on salary payment dates (25th-28th of month), 50% increase in January (financial year activities), and 20% reduction during UK bank holidays. Auto-scaling handles these patterns.
Graviton3 instances (m7g.xlarge) selected for best price-performance; pod resource requests set based on 6 months of production metrics; quarterly rightsizing review using AWS Compute Optimizer
Caching
Redis cache-aside pattern for account balances (60s TTL); API Gateway response caching for partner metadata (5-min TTL); DNS caching for internal service discovery (30s TTL)
Connection pooling
HikariCP connection pools per service: Account Service (max 20), Transaction Service (max 30); PgBouncer considered but not needed at current scale
Asynchronous processing
Audit logging and notifications fully asynchronous via EventBridge + SQS; no synchronous writes in API response path except primary query
Content delivery
Not applicable (API-only, no static assets); API Gateway edge-optimised endpoint provides global edge routing
Database optimisation
Composite indexes on frequently queried columns (account_id + date range); partitioned transaction table by month; EXPLAIN ANALYSE review for all new queries; pg_stat_statements monitoring for slow queries
Most cost-effective options intentionally not selected
[x]
Graviton instances are more cost-effective than x86 equivalents (20% saving); however, Multi-AZ RDS and Redis cluster mode were chosen for reliability over single-AZ (30% cost premium justified by Tier 1 criticality)
Yes — detailed cost modelling performed using AWS Pricing Calculator and validated against 6 months of production billing data. TCO comparison conducted against legacy PIL (on-premises Oracle SOA Suite) showing 45% reduction in total annual operating cost.
No — the design fully meets all requirements. The primary cost decision was reserving capacity (1-year reserved instances for RDS and ElastiCache) which reduced annual cost by GBP 38,000 compared to on-demand pricing.
1-year reserved instances for RDS (db.r7g.xlarge) and ElastiCache (cache.r7g.large); EKS nodes use Savings Plans (1-year, partial upfront)
Rightsizing reviews
Monthly review of AWS Compute Optimizer recommendations; quarterly review of pod resource requests vs actual utilisation
Waste elimination
Automated shutdown of dev and test EKS clusters at 19:00 weekdays and all weekend (Lambda-based scheduler); Spot instances for non-production node groups
Budget governance
AWS Budget alerts at 80% and 100% of monthly forecast; approval required from Platform Lead for any change > GBP 500/month
Has the hosting location been chosen to reduce environmental impact?
Partially — eu-west-2 (London) was chosen primarily for data sovereignty, but AWS London region operates at a lower carbon intensity than some other European regions. AWS is committed to 100% renewable energy by 2025 for all regions.
What is the expected workload demand pattern?
Variable — significant peaks during UK business hours (08:00-18:00) and month-end; lower demand evenings and weekends
Yes — regulatory obligation for 24x7 availability (Open Banking). However, traffic drops significantly outside UK business hours.
Can the solution be shut down or scaled down during off-peak hours?
Partially — auto-scaling reduces pod count during off-peak (minimum 2 replicas maintained for HA); EKS nodes scale down from 8 to 4 overnight
Are non-production environments configured to downscale or shut down when not in use?
Yes — dev and test clusters shut down at 19:00 weekdays and fully off at weekends (saves approximately GBP 3,000/month); staging runs 24x7 only during release weeks
How do the language and framework choices contribute to efficiency?
Java 21 with virtual threads (Project Loom) reduces memory overhead for concurrent request handling; GraalVM Native Image evaluated but deferred due to reflection-heavy Spring Boot framework
Has the code been optimised for the target platform and workload?
Yes — connection pooling (HikariCP), efficient JSON serialisation (Jackson with afterburner module), lazy database fetching to avoid unnecessary data transfer
Are efficient algorithms and data structures used?
Yes — database queries use indexed lookups; pagination enforced on all list endpoints to prevent unbounded result sets; Redis cache reduces redundant core banking queries by 85%
Is the number of vCPU hours per job/request minimised?
Yes — average request processing time is 15ms CPU time; async offloading of audit/notification reduces per-request compute by approximately 40% compared to synchronous design
Is data held close to compute to reduce network transfer?
Yes — Redis cache co-located in same VPC/AZ as application pods; PostgreSQL in same region; core banking read replicas in same AWS region connected via Direct Connect
Are data replicas minimised?
Replicas are justified: RDS Multi-AZ (HA requirement), Redis replicas (HA), DR read replica (regulatory DR requirement). No unnecessary copies.
Is old or unused data removed to reduce storage?
Yes — S3 lifecycle policies transition audit logs to Glacier (1 year) then Deep Archive (3 years); transaction data purged after 2 years; Redis TTL evicts stale cache entries
Are efficient data formats and compression used?
Yes — gzip compression on API responses; PostgreSQL TOAST compression for large text fields; S3 objects compressed before archival
Are jobs prioritised and distributed to optimise resource usage?
Yes — nightly batch jobs (data replication from core banking) scheduled during off-peak hours (02:00-05:00 UTC) to use capacity freed by auto-scaling
Are efficient networking patterns used?
Yes — VPC endpoints for S3, SQS, EventBridge, and Secrets Manager to avoid NAT Gateway charges and Internet transit; Direct Connect for high-volume core banking traffic
Yes — all microservices are developed internally by the API Team.
Attribute
Detail
Source control platform
GitHub Enterprise (MFS organisation)
CI/CD platform
GitHub Actions (corporate standard)
Build automation
GitHub Actions workflows triggered on push and PR; Maven builds for Java services, npm for Node.js; Docker multi-stage builds for container images
Deployment automation
Argo CD (GitOps) for Kubernetes deployments; Terraform for infrastructure changes; Helm charts for all services
Test automation
Unit tests (JUnit 5, Jest), integration tests (Testcontainers), contract tests (Pact), security scanning, and container image scanning — all in CI pipeline
Strangler Fig — partner traffic gradually migrated from legacy PIL to new CAP using API Gateway routing rules; both systems run in parallel during transition
Data migration mode
Continuous Sync — core banking data replicated to CAP’s PostgreSQL via CDC; no bulk data migration required (CAP reads from core banking, not PIL)
Data migration method
CDC (Change Data Capture) from Oracle GoldenGate to PostgreSQL via Debezium + Kafka Connect
Data volume to migrate
0 GB (no data migrated from PIL; CAP builds its own data store from core banking source)
End-user cutover approach
Phased — partners migrated individually over 3-month window; each partner given 4-week notice and 2-week parallel-run period
External system cutover
Phased — partners cut over individually; legacy PIL endpoints deprecated with 6-month sunset notice
Maximum acceptable downtime
Zero — parallel run ensures no downtime; partners switch DNS/config to new endpoints at their convenience during migration window
Rollback plan
API Gateway routing rules can redirect traffic back to legacy PIL within 5 minutes; partner-specific rollback possible without affecting other partners
Acceptance criteria
All 8 legacy partners migrated and confirmed; PIL traffic at zero for 30 consecutive days; PIL decommission approval from all stakeholders
Transient infrastructure needed?
Yes — Debezium + Kafka Connect cluster for initial CDC setup (decommissioned after steady-state CDC established via direct Oracle-to-PostgreSQL replication)
Weekly (every Tuesday); hotfixes as needed (emergency change process)
Release process
Feature branch —> PR (automated tests + 2 approvals) —> merge to main —> automated deploy to staging —> manual approval gate —> blue-green deploy to production via Argo CD
Release validation
Automated smoke tests post-deploy (5-minute suite); canary analysis (10% traffic for 15 minutes); automated rollback if error rate > 0.1%
Feature flags / toggles
LaunchDarkly for feature flags; used for partner-specific feature rollouts and kill switches for new functionality
L1: MFS Service Desk (basic triage); L2: SRE team (6 engineers, dedicated to CAP and 2 other platform services); L3: API development team; L4: Solution Architect / CTO
Support hours
24x7 (SRE on-call rotation); development team: UK business hours (09:00-17:30) with on-call for P1 escalations
Karpenter scale-to-zero on dev/stage EKS clusters 19:00-07:00 weekdays + all weekend; non-prod RDS paused via Lambda cron; enforced by AWS Config rule (alerts FinOps if a non-prod resource runs continuously > 24h without a documented exception)
Periodic right-sizing review cadence
Quarterly via AWS Compute Optimizer + Datadog. Last review (Q1 2026) downgraded 18 over-provisioned pods, recovering ~£2,400/month
Unused / orphaned resource reclamation
Weekly Lambda job tags resources idle > 14 days; FinOps reviews and confirms before deletion. Scope: snapshots, EBS volumes, ELB targets, unused security groups
Carbon footprint reported alongside cost
Yes — monthly FinOps review includes AWS Customer Carbon Footprint Tool output; reported to ARB and Sustainability committee quarterly
Yes — decommissioning runbook requires Terraform destroy + S3 bucket emptying + KMS key scheduled-deletion; CMDB entry marked Retired only after AWS Cost Explorer confirms zero spend for 30 days
EKS cluster and node groups are always running (managed by Karpenter auto-scaling).
RDS PostgreSQL instances are always running (Multi-AZ).
ElastiCache Redis cluster is always running (cluster mode).
Kubernetes deployments are managed by Argo CD; pods start in order: Auth Service first (dependency for other services), then Account Service and Transaction Service (parallel), then Notification Service.
Kubernetes readiness probes ensure services are only added to the load balancer after successful health checks (database connectivity, Redis connectivity, configuration loaded).
API Gateway is always available (managed service); no start-up required.
Full start-up from cold (e.g., after a DR failover scale-up) takes approximately 8 minutes.
EKS: upgraded within 60 days of new minor release; RDS PostgreSQL: minor versions applied in monthly maintenance window; Java/Node.js: upgraded within 90 days of LTS release; all dependencies tracked by Snyk
Hardware lifecycle management
N/A — all cloud-managed; Graviton instance generations reviewed annually for cost/performance improvements
7-10 years; major architecture review planned at 5 years (2030)
End-of-life triggers
Replacement by next-generation API platform; regulatory change removing Open Banking obligation (unlikely); AWS service deprecation
Decommissioning blockers
25+ partner integrations dependent on the platform; 7-year audit log retention obligation
Data disposal
Customer data: secure deletion from RDS (NIST 800-88 compliant); audit logs: retained in S3 Glacier until 7-year obligation met, then lifecycle-expired; encryption keys: scheduled for deletion after data disposal
Infrastructure disposal
Terraform destroy for all AWS resources; DNS records removed; IAM roles deleted; GitHub repositories archived (not deleted, for audit trail)
All microservices are containerised with standard Kubernetes manifests (Helm charts); PostgreSQL is standard (no AWS-specific extensions); data exportable via pg_dump; audit logs in S3 exportable via standard S3 API
Data portability
PostgreSQL: pg_dump/pg_restore to any PostgreSQL host; S3 audit logs: standard object download; Redis: cache can be rebuilt from source data (no persistent data); EventBridge schemas documented in JSON Schema
Vendor lock-in assessment
Overall: Low-Moderate. Primary lock-in is AWS IAM/KMS (High) and EventBridge (Moderate). All other components use standard, portable technologies. Estimated exit effort: 3-4 months for a 6-person team.
Contract testing against core banking schema (Pact); advance notification agreement with DBA team (60-day notice for schema changes); schema compatibility layer in Core Banking Adapter
Medium
2025-11-01
R-002
Mitigate
Self-service partner onboarding portal (Phase 2, delivered); automated API key provisioning; partner onboarding runbook; escalation to additional support resource if queue > 5 partners
Low
2025-11-01
R-003
Mitigate
Snyk continuous monitoring with P1 alert on critical CVEs; pre-built patched base images maintained in ECR; emergency deployment pipeline (bypasses staging for security patches); rollback capability
Medium
2025-11-01
R-004
Accept (with mitigation)
Active-passive DR in eu-west-1; quarterly DR drills; RTO 1 hour validated through testing; accept 15-minute RPO for async replication lag
Does the design materially change the organisation’s technology risk profile?
No — the design reduces risk by replacing unsupported legacy middleware with a modern, actively maintained platform. The introduction of cloud-hosted customer data is covered by the existing AWS risk assessment (MFS-TRA-2023-012).
If yes, has this been evaluated with Risk and Controls teams?
This SAD was assessed at Comprehensive depth. The scores below reflect a mature, well-documented architecture for a Tier 1 Critical, regulated financial services platform.
Section
Score
Justification
0. Document Control
5
Full version history, multiple contributors and approvers, clear scope, related documents referenced
1. Executive Summary
5
Clear business drivers with priority, strategic alignment with reuse assessment, current-state architecture documented, business criticality justified with revenue impact
2. Stakeholders & Concerns
5
Comprehensive stakeholder register including external parties, concerns matrix fully mapped to sections, regulatory context with five applicable regulations
3.1 Logical View
5
Full component decomposition with technology choices, design patterns documented with rationale, vendor lock-in assessed for all components, service-to-capability mapping complete
3.2 Integration & Data Flow
5
All internal and external integrations documented with protocols and authentication, API contracts versioned, end user access patterns documented, SLAs defined per interface
3.3 Physical View
5
Deployment diagram described, compute fully specified (Graviton instances, pod sizing), full networking documented including Direct Connect, environments listed with sizing, security agents deployed
3.4 Data View
5
All data stores classified with retention and encryption, field-level encryption for PII, data sovereignty addressed with cross-region filtering, DPIA completed, data integrity controls evidenced
3.5 Security View
5
STRIDE threat model with 7 threats and mitigations, comprehensive IAM (internal + external + privileged), mTLS and OAuth 2.0 FAPI, HSM-backed encryption, SIEM integration with correlation rules
3.6 Scenarios
5
Three architecturally significant use cases crossing all views, three ADRs with alternatives and quality attribute tradeoffs
4.1 Operational Excellence
5
Centralised logging with Splunk, Grafana dashboards, PagerDuty alerting with escalation, Jaeger distributed tracing, comprehensive runbooks, capacity planning process
4.2 Reliability
5
Multi-AZ with active-passive DR, RTO 1hr / RPO 15min validated through quarterly testing, chaos testing with Gremlin, fault tolerance with circuit breakers, immutable backups
Graviton instances for energy efficiency, non-prod auto-shutdown, auto-scaling for demand matching. Score reduced from 5: no carbon metrics baselined, no formal sustainability KPIs.
5. Lifecycle
5
Full CI/CD with security scanning, Strangler Fig migration plan, test strategy covering all types, weekly releases with blue-green and canary, team skills assessed, exit plan documented