feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5
feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5mabadir wants to merge 16 commits into
Conversation
…en/linear/canary) Replace the CodeDeploy/external-controller blue/green wiring with the ECS deployment controller's built-in strategies: - deployment_type now accepts rolling | blue_green | linear | canary; the deployment controller is always ECS (CODE_DEPLOY mapping removed) - new deployment_strategy_config seeds bake time + canary/linear tuning at create time; deployment_configuration is in ignore_changes because the Flightcontrol deploy manager passes the authoritative config (including pause lifecycle hooks) on every UpdateService call - native strategies wire load_balancer.advanced_configuration: alternate target group (tg-2), production listener rule (first ALB rule or the NLB listener), optional test_listener_rule_arn, and a new ECS infrastructure role (AmazonECSInfrastructureRolePolicyForLoadBalancers) - target groups tg-1/tg-2 are now production/alternate and gate on any native traffic-shift strategy, not just blue_green; outputs renamed accordingly (production_/alternate_target_group_*) and ecs_infrastructure_role_arn added - provider floor bumped to aws >= 6.21 (linear/canary configuration) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Always provision the production/alternate target-group pair, the ECS infrastructure role, and the service's load_balancer advanced_configuration whenever a load balancer is attached — not just for native traffic-shift deployment_types. Rolling deployments serve from the production target group only, so any service can now switch between rolling / blue_green / linear / canary on a single UpdateService call with zero Terraform changes; deployment_type is purely the create-time seed. Also fixes the module test suite: mock the iam_policy_document / partition / region / vpc data sources and computed ARNs so all 11 runs execute (basic_service was failing on main and skipping the rest). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| production_listener_rule = ( | ||
| local.enable_nlb_listener | ||
| ? aws_lb_listener.nlb[0].arn | ||
| : aws_lb_listener_rule.alb["0"].arn | ||
| ) |
There was a problem hiding this comment.
Only the first ALB listener rule participates in traffic shifting
production_listener_rule is hardcoded to alb["0"].arn. The AWS ECS advanced_configuration API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (alb["1"], alb["2"], …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires length(var.load_balancer_attachment.listener_rules) <= 1 when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 102-106
Comment:
**Only the first ALB listener rule participates in traffic shifting**
`production_listener_rule` is hardcoded to `alb["0"].arn`. The AWS ECS `advanced_configuration` API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (`alb["1"]`, `alb["2"]`, …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires `length(var.load_balancer_attachment.listener_rules) <= 1` when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.
How can I resolve this? If you propose a fix, please make it concise.| dynamic "load_balancer" { | ||
| for_each = local.enable_load_balancer && var.deployment_type == "blue_green" ? [1] : [] | ||
| for_each = local.enable_load_balancer ? [1] : [] | ||
| content { | ||
| target_group_arn = aws_lb_target_group.tg_1[0].arn | ||
| container_name = local.lb_container_name | ||
| container_port = local.lb_container_port | ||
|
|
||
| advanced_configuration { | ||
| alternate_target_group_arn = aws_lb_target_group.tg_2[0].arn | ||
| production_listener_rule = ( | ||
| local.enable_nlb_listener | ||
| ? aws_lb_listener.nlb[0].arn | ||
| : aws_lb_listener_rule.alb["0"].arn | ||
| ) | ||
| test_listener_rule = var.test_listener_rule_arn | ||
| role_arn = aws_iam_role.ecs_infrastructure[0].arn | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
advanced_configuration wired unconditionally for rolling deployments — not validated against real AWS API
The whole design bet is that AWS silently accepts advanced_configuration (alternate TG, production listener rule, infrastructure role) when deployment_configuration is absent (the rolling default). All tests run with command = plan against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects advanced_configuration on a CreateService call that carries no deployment_configuration, every rolling service with a load balancer (the module default) would fail to provision. A real tofu apply against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 93-111
Comment:
**`advanced_configuration` wired unconditionally for rolling deployments — not validated against real AWS API**
The whole design bet is that AWS silently accepts `advanced_configuration` (alternate TG, production listener rule, infrastructure role) when `deployment_configuration` is absent (the rolling default). All tests run with `command = plan` against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects `advanced_configuration` on a `CreateService` call that carries no `deployment_configuration`, every rolling service with a load balancer (the module default) would fail to provision. A real `tofu apply` against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.
How can I resolve this? If you propose a fix, please make it concise.| output "target_group_arn" { | ||
| description = "The ARN of the target group (null if load balancer disabled or blue/green deployment)." | ||
| value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn : null | ||
| description = "The ARN of the production target group the service serves from (alias of production_target_group_arn; null if load balancer disabled)." | ||
| value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn : null | ||
| } | ||
|
|
||
| output "target_group_arn_suffix" { | ||
| description = "The ARN suffix of the target group for CloudWatch metrics." | ||
| value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn_suffix : null | ||
| description = "The ARN suffix of the production target group for CloudWatch metrics." | ||
| value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn_suffix : null | ||
| } |
There was a problem hiding this comment.
target_group_name dropped without a backward-compatible alias
target_group_arn was deliberately kept as an alias for production_target_group_arn, but target_group_name — the natural companion — was removed entirely. Callers that reference module.xxx.target_group_name (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null } would match the alias pattern used for the ARN.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/outputs.tf
Line: 96-104
Comment:
**`target_group_name` dropped without a backward-compatible alias**
`target_group_arn` was deliberately kept as an alias for `production_target_group_arn`, but `target_group_name` — the natural companion — was removed entirely. Callers that reference `module.xxx.target_group_name` (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding `output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null }` would match the alias pattern used for the ARN.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- Reject >1 ALB listener rule for native traffic-shift strategies: advanced_configuration only rewrites a single production listener rule, so extra rules would keep serving the old revision for the whole deployment. Documented the constraint on listener_rules since the strategy is a per-deployment decision the deploy manager can change without Terraform. - Restore the target_group_name output as an alias of production_target_group_name, matching the target_group_arn alias. - Make TestEcsServiceWithAlb assert the deployed rolling service carries advanced_configuration (alternate TG, production listener rule, infrastructure role), validating that the real AWS API accepts it on CreateService without a deployment_configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ravion Module Publish PlanDry run only. No Ravion API mutations were made.
Diffsrvn-ecs-cluster 0.1.1 -> 0.2.0--- remote
+++ compiled
- **Public and private Network Load Balancers** for TCP/UDP and static IP use cases
- **CloudWatch Container Insights** dashboard metrics for production visibility
- Terraform source: [flightcontrolhq/modules/compute/ecs_cluster](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-cluster@0.1.1/compute/ecs_cluster)
+ Terraform source: [flightcontrolhq/modules/compute/ecs_cluster](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-cluster@0.2.0/compute/ecs_cluster)
## Use cases
@@
Select an **EC2 instance type** to enable EC2 capacity. Leave it blank for a Fargate-only cluster.
+ ### Default capacity provider
+
+ Services that do not set their own capacity provider strategy use the cluster default. AWS does not allow mixing Fargate and EC2 capacity providers in that default, so it commits to a single one. Leave **Default capacity provider** blank to pick automatically (EC2 when enabled, then Fargate, then Fargate Spot), or choose one of the enabled providers explicitly.
+
## Load balancers
### Public application load balancer
@@
| Fargate | No | `true` | Allow services to use Fargate capacity |
| Fargate Spot | No | `true` | Allow services to use lower-cost interruptible Fargate Spot |
| EC2 instance type | No | — | Enables EC2 capacity when selected |
+ | Default capacity provider | No | Automatic | Provider used by services without their own strategy |
| Public load balancer | No | `false` | Create an internet-facing ALB |
| Private load balancer | No | `false` | Create an internal ALB |
| Public NLB | No | `false` | Create an internet-facing Network Load Balancer |
@@
base_path: compute/ecs_cluster
branch: main
execution_environment_id: << module.input.execution_environment_id >>
- ref: rvn-ecs-cluster@0.1.1
+ ref: rvn-ecs-cluster@0.2.0
repo: https://github.com/flightcontrolhq/modules
stack_id: <<stack.id>>
terraform_variables:
...overrides: << module.input.advanced_terraform_variables >>
+ capacity_provider_default: << module.input.capacity_provider_default >>
container_insights_enabled: << module.input.container_insights_enabled >>
ec2_ami_id: << module.input.ec2_ami_id >>
ec2_instance_type: << module.input.ec2_instance_type >>rvn-ecs-web 0.3.0 -> 0.4.0--- remote
+++ compiled
queue_overflow: oldest
queue_size: 1
infrastructure:
+ ecs_alternate_target_group_arn: <<stack.output.alternate_target_group_arn>>
ecs_cluster_arn: <<stack.output.service_cluster>>
+ ecs_infrastructure_role_arn: <<stack.output.ecs_infrastructure_role_arn>>
+ ecs_production_listener_rule_arn: <<stack.output.production_listener_rule_arn>>
ecs_service_arns:
- <<stack.output.service_arn>>
+ ecs_test_listener_rule_arn: <<stack.output.test_listener_rule_arn>>
inputs:
- description: Pass the image tag or digest to deploy. For Nixpacks or Dockerfile builds, this is resolved in the Ravion-created ECR repository. For Prebuilt image from registry mode, this is resolved in the repository configured on the module. Do not pass a full image URI.
id: image_ref
@@
max: 50000
min: 1
type: number
+ - collapsible: true
+ default: false
+ description: Create a dedicated ALB listener rule that routes test traffic to the green (alternate) target group during blue/green, linear, and canary deployments, so the new revision can be validated before production traffic shifts. Has no effect for rolling deployments.
+ id: green_alb_listener_rule_enabled
+ label: Create a listener rule to test green deployments
+ type: boolean
+ - collapsible: true
+ default: X-Ravion-Test
+ description: HTTP header that routes a request to the green (test) target group during a deployment. Send this header with the value below to reach the new revision before production traffic shifts. Requests without it keep hitting production.
+ id: test_header_name
+ label: Test traffic header name
+ required: true
+ show_when:
+ green_alb_listener_rule_enabled: true
+ type: string
+ - collapsible: true
+ default: "1"
+ description: Value the test traffic header must carry to be routed to the green (test) target group.
+ id: test_header_value
+ label: Test traffic header value
+ required: true
+ show_when:
+ green_alb_listener_rule_enabled: true
+ type: string
- id: section_task
label: Container resources
type: section
@@
The module is intentionally focused on web services behind an Application Load Balancer. It uses the selected ECS cluster to inherit AWS account, region, VPC, subnets, capacity providers, load balancer listeners, and load balancer security groups.
- Terraform source: [flightcontrolhq/modules/compute/ecs_service](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-web@0.3.0/compute/ecs_service)
+ Terraform source: [flightcontrolhq/modules/compute/ecs_service](https://github.com/flightcontrolhq/modules/tree/rvn-ecs-web@0.4.0/compute/ecs_service)
## Use cases
@@
base_path: compute/ecs_service
branch: main
execution_environment_id: << module.input.execution_environment_id >>
- ref: rvn-ecs-web@0.3.0
+ ref: rvn-ecs-web@0.4.0
repo: https://github.com/flightcontrolhq/modules
stack_id: <<stack.id>>
terraform_variables:
@@
execution_role_arn: << module.input.execution_role_arn || nil >>
execution_role_policies: << module.input.execution_role_policies >>
force_new_deployment: << module.input.force_new_deployment >>
+ green_alb_listener_rule_enabled: << module.input.green_alb_listener_rule_enabled >>
health_check_grace_period_seconds: << module.input.health_check_grace_period_seconds >>
launch_type: '<< module.input.capacity_provider == "ec2" ? "EC2" : "FARGATE" >>'
load_balancer_attachment:
@@
task_role_arn: << module.input.task_role_arn || nil >>
task_role_inline_policies: << module.input.task_role_inline_policies >>
task_role_policies: << module.input.task_role_policies >>
+ test_header_name: << module.input.test_header_name >>
+ test_header_value: << module.input.test_header_value >>
volumes: '<< module.input.volumes != nil ? map(module.input.volumes, #.volume_type == "efs" ? {"name": #.name, "efs_volume_configuration": #.efs_volume_configuration} : {"name": #.name, "docker_volume_configuration": #.docker_volume_configuration}) : [] >>'
vpc_id: << module.input.vpc_id >>
wait_for_steady_state: false |
AWS rejects explicit from_port/to_port when ip_protocol is "-1", which broke updates to existing rules (the sg-module refactor changed the ECS instance ingress rule from -1/-1 to 0/0). Null the ports in the shared security-groups module whenever the protocol is "-1" or "all", and use -1 in the ecs_cluster caller for clarity. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts: # compute/ecs_cluster/README.md # compute/ecs_cluster/locals.tf # compute/ecs_cluster/rvn-ecs-cluster-definition.yml # compute/ecs_service/README.md # compute/ecs_service/rvn-ecs-web-definition.yml
777e8bd to
bd9ac38
Compare
Summary
Replaces the dual-service / CodeDeploy-era blue/green design with the native ECS deployment controller strategies:
rolling,blue_green,linear, andcanary— and makes the strategy a per-deployment decision that requires no Terraform changes to switch.Native strategies (b4ee71a)
deployment_controller.type = "ECS"always — the CodeDeploy/EXTERNAL mapping is removeddeployment_typevalueslinear/canaryalongsiderolling/blue_greendeployment_strategy_configvariable (bake time, canary %, linear step %) that seedsdeployment_configurationat create timetest_listener_rule_arnvariable for blue/green test-traffic validation (TEST_TRAFFIC_SHIFTstages)AmazonECSInfrastructureRolePolicyForLoadBalancers— ECS assumes it to rewrite listener rules and (de)register targets during traffic shiftsload_balancer.advanced_configurationwiring (production/alternate target group, production + optional test listener rule, role)deployment_configurationadded toignore_changes: the Flightcontrol deploy manager passes the authoritative configuration (strategy, bake times, pause lifecycle hooks) on everyUpdateServicecall, so Terraform must not fight itlinear_configuration/canary_configuration)Strategy is per-deployment (3c72a84)
Whenever a load balancer is attached, the module now always provisions:
tg-1) + alternate (tg-2) target-group pair (the rolling-onlyaws_lb_target_group.thisis removed)advanced_configurationRolling deployments serve from the production target group only and never touch the alternate, so any service can switch between all four strategies on a single
UpdateServicecall.deployment_typeis purely the create-time seed.Trade-off: every load-balanced rolling service carries one idle target group and one unused IAM role.
Outputs
production_target_group_arn/alternate_target_group_arn(+ names),ecs_infrastructure_role_arntarget_group_arnretained as an alias of the production target groupTests
Fixed the module test suite so it actually runs:
basic_servicewas failing onmain(mock provider emits invalid JSON policies / ARNs), which skipped every other run. Addedmock_data/mock_resourcedefaults and fixed two stale assertions.tofu validate✅ —tofu test✅ 11/11 passedBreaking changes
deployment_controlleris alwaysECS; services created with the old CodeDeploy blue/green mapping are not migrated (pre-GA, clean removal)aws_lb_target_group.thisremoved — existing load-balanced services will see target-group recreation on upgrade🤖 Generated with Claude Code
Greptile Summary
This PR replaces the CodeDeploy/dual-service blue-green design with native ECS deployment controller strategies (rolling, blue_green, linear, canary) and restructures target-group provisioning so the strategy is a per-deployment decision rather than a Terraform-level choice.
load_balancer.advanced_configurationare provisioned for every LB-attached service, regardless ofdeployment_type;deployment_configurationandload_balancerare placed inignore_changesso the Flightcontrol deploy manager owns the authoritative configuration.deployment_strategy_config,test_listener_rule_arn, and new named outputs (production_target_group_arn,alternate_target_group_arn,ecs_infrastructure_role_arn) alongside backward-compatible aliases for existing callers;deployment_typenow acceptslinearandcanaryin addition torollingandblue_green.canary_deploymentandlinear_deploymenttest cases added (all plan-only against mocked providers).Confidence Score: 3/5
Two unresolved questions about real-world AWS API behavior make this risky to merge without additional validation.
The core architectural bet — that AWS accepts
advanced_configurationon aCreateServicecall wheredeployment_configurationis absent (rolling default) — is not exercised by any real apply; all 11 tests run against mocked providers withcommand = plan. If AWS rejects this combination, every load-balanced rolling service (the module default) would fail to provision. Separately,production_listener_ruleis hardcoded to the first ALB rule, so services with multiple listener rules silently get incomplete traffic shifting during blue_green/linear/canary deployments without any validation or warning. The infrastructure and IAM changes themselves are clean and the structural refactoring is well-thought-out.compute/ecs_service/ecs_service.tf — the
advanced_configurationblock and the hardcodedalb["0"]reference are the two areas that need the most scrutiny before this lands in production.Important Files Changed
advanced_configurationnow always wired for every LB-attached service;production_listener_rulehardcoded to first ALB rule (alb["0"]), which silently excludes additional rules from traffic shiftingaws_lb_target_group.this; tg_1/tg_2 now always provisioned when LB is attached — clean and consistentAmazonECSInfrastructureRolePolicyForLoadBalancers; correctly gated onenable_load_balancer; trust policy and attachment look correcttarget_group_arnretained as alias;target_group_namedropped without a backward-compatible alias despite being the matching pair to the retained ARN outputdeployment_strategy_config,test_listener_rule_arnvariables and expandeddeployment_typevalidation look correct;optional()defaults are saneplanonly with mocked providers, so no real AWS API validation is exercisedFlowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[terraform apply\necs_service module] --> B{enable_load_balancer?} B -- No --> C[No TGs, no\nadvanced_configuration] B -- Yes --> D[Create tg-1 production\n+ tg-2 alternate] D --> E[Create ECS\ninfrastructure IAM role] E --> F[Wire advanced_configuration\nalways — all strategies] F --> G{deployment_type seed} G -- rolling --> H[No deployment_configuration block\nignore_changes covers it] G -- blue_green / linear / canary --> I[deployment_configuration block\nstrategy + bake/canary/linear tuning] H --> J[ECS service created\nserves from tg-1 only] I --> J J --> K[Deploy Manager calls UpdateService\nwith authoritative strategy\n+ pause lifecycle hooks] K -- rolling --> L[ECS: task replacement\nvia min/max healthy pct\ntg-1 only] K -- blue_green/linear/canary --> M[ECS: rewrites production_listener_rule\nbetween tg-1 and tg-2\nvia infrastructure role]Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat(ecs_service): make deployment strat..." | Re-trigger Greptile