Terraform Stacks - Part 2 - Deferred Planning

In part 1 of our look at Terraform stacks, we touched on its capability to assist with a dependency planning issue that was prevalent in Kubernetes use-cases.

In fact, I did a really rubbish job of explaining it as I didn’t properly understand it at the time of writing.

Imagine you are building a Kubernetes cluster in one of the cloud providers, but you also wanted to manage your Kubernetes environment using Terraform within that same configuration. Your Kubernetes provider block needed details that aren’t available until the cloud provider had finished building resources.

These unknown provider attributes often meant that Kubernetes resources were moved into separate configuration/workspace, so that the order of execution can be controlled manually and/or by pipelines.

This is totally incorrect - I just didn’t have my head screwed on at the time of writing.

When Dependencies Are Not A Problem

The kind of dependency I alluded to in my previous article is not an issue. This is easily handled by regular Terraform dependency graphs. For example, let’s say I want to create a K8s cluster in Azure and create a namespace in one run - not a problem. My Kubernetes provider block just needs to rely on outputs from the relevant Azure resources.

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "4.6.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "2.33.0"
    }
  }
}

provider "azurerm" {
  features {}
  subscription_id = "e98088f7-5cd3-4c77-8dc7-7468bacdd6a5"
}

provider "kubernetes" {
  host                   = azurerm_kubernetes_cluster.this.kube_config.0.host
  client_certificate     = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.client_certificate)
  client_key             = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.client_key)
  cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.cluster_ca_certificate)
}

resource "azurerm_resource_group" "this" {
  name     = "rg-k8s"
  location = "uksouth"
}

resource "azurerm_kubernetes_cluster" "this" {
  name                = "k8s-cluster"
  location            = azurerm_resource_group.this.location
  resource_group_name = azurerm_resource_group.this.name
  dns_prefix          = "k8s"

  default_node_pool {
    name       = "default"
    node_count = 1
    vm_size    = "Standard_DS2_v2"
  }

  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "azure"
  }
}

resource "kubernetes_namespace" "this" {
  metadata {
    name = "example-namespace"
  }
}

Running a plan (or apply) against this yields no issues at all…

However, this behaviour is not the case for all resources…

When Dependencies Are A Problem

Ok, so if the provider block can handle these dependencies, what is the problem? Certain resources need to perform server-side checks as part of planning. For example, the kubernetes_manifest resource.

The kubernetes_manifest resource has a nice warning in the documentation that states:

This resource requires API access during planning time. This means the cluster has to be accessible at plan time and thus cannot be created in the same apply operation. We recommend only using this resource for custom resources or resources not yet fully supported by the provider.

So, if we had a config that was using this, what would happen then?

# Example code

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "4.6.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "2.33.0"
    }
  }
}

provider "azurerm" {
  features {}
  subscription_id = "e98088f7-5cd3-4c77-8dc7-7468bacdd6a5"
}

provider "kubernetes" {
  host                   = azurerm_kubernetes_cluster.this.kube_config.0.host
  client_certificate     = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.client_certificate)
  client_key             = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.client_key)
  cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.this.kube_config.0.cluster_ca_certificate)
}

resource "azurerm_resource_group" "this" {
  name     = "rg-k8s"
  location = "uksouth"
}

resource "azurerm_kubernetes_cluster" "this" {
  name                = "k8s-cluster"
  location            = azurerm_resource_group.this.location
  resource_group_name = azurerm_resource_group.this.name
  dns_prefix          = "k8s"

  default_node_pool {
    name       = "default"
    node_count = 1
    vm_size    = "Standard_DS2_v2"
  }

  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "azure"
  }
}

resource "kubernetes_manifest" "example_crd" {
  manifest = {
    "apiVersion" = "apiextensions.k8s.io/v1"
    "kind"       = "CustomResourceDefinition"
    "metadata" = {
      "name" = "examplecrds.mikeguy.co.uk"
    }
    "spec" = {
      "group" = "mikeguy.co.uk"
      "versions" = [
        {
          "name"    = "v1"
          "served"  = true
          "storage" = true
          "schema" = {
            "openAPIV3Schema" = {
              "type" = "object"
              "properties" = {
                "spec" = {
                  "type" = "object"
                  "properties" = {
                    "foo" = {
                      "type" = "string"
                    }
                    "bar" = {
                      "type" = "integer"
                    }
                  }
                }
              }
            }
          }
        }
      ]
      "scope" = "Namespaced"
      "names" = {
        "plural"     = "examplecrds"
        "singular"   = "examplecrd"
        "kind"       = "ExampleCRD"
        "shortNames" = ["ecrd"]
      }
    }
  }
}

Bugger. It doesn’t work! So, what do we need to do? Probably something a bit clunky like comment it out, run an apply, then re-run again with the code uncommented. Or split into separate workspaces, and execute in order manually or with a custom pipeline. These both work, but even when the cluster is built, we may still have issues.

Let’s say our cluster is built, and we want to add a custom resource definition and a custom resource at the same time…

The same issue! This time the APIs required for the custom resource are not yet available (as the custom resource definition has not yet been created), so again, we will have to do a stepped approach to deployment. Depending how you are using the provider, you could see this could become painful.

This is the main use-case that HashiCorp have published for the benefit of deferred planning, though I’m sure it will exist for others out there (do let me know any you’ve come across - it’s always good to know).

There be Gremlins!

As you may expect, with Stacks being in public preview, there are some issues (or “features” perhaps) that caused me some pain when playing. I’ll start by providing you working config in the following sections.

Of course, some of these could just be me being an idiot. I’ve still not actually gone and read the docs yet! I’m confident where there are bugs, that HashiCorp will iron them out. I’ve summarised some of the problems I had in the issues section.

How Does Terraform Stacks Help?

By utilising Stacks, we can break things up into a logic order, without having to create a load of custom pipelines (or manual work)to handle the ordering. The HashiCorp docs state the following:

When you deploy a Stack that includes resources that depend on the availability of APIs provisioned by other components in your Stack, HCP Terraform recognizes the dependency between components, and automatically defers the plan and apply steps for your components until they can complete successfully.

Nice! Let’s re-jig our previous example to use a stack instead. Unlike part 1, I’m just going to use a couple of quick local modules to illustrate the concept. We will create two simple local modules - one called “aks_cluster” and one called “k8s_crds” (not shown for brevity).

Let’s configure our stacks configuration files:

# components.tfstack.hcl

required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "4.6.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "2.33.0"
    }
}

provider "azurerm" "this" {
  config {
    features {}
    tenant_id       = "f8667506-a537-4c81-842a-41fd0e547e43"
    subscription_id = var.subscription_id
    use_cli         = false
    use_oidc        = true
    oidc_token      = var.identity_token
    client_id       = var.client_id
  }
}

provider "kubernetes" "this" {
  config {
    host                   = component.aks_cluster.kube_config_host
    client_certificate     = component.aks_cluster.kube_config_client_certificate
    client_key             = component.aks_cluster.kube_config_client_key
    cluster_ca_certificate = component.aks_cluster.kube_config_cluster_ca_certificate
  }
}


component "aks_cluster" {
  source  = "./modules/aks_cluster"
  inputs = {
    cluster_name         = var.cluster_name
    dns_prefix           = var.dns_prefix
    node_count           = var.node_count
    node_size            = var.node_size
    resource_group_name  = var.resource_group_name

  }
  providers = {
    azurerm = provider.azurerm.this
  }
}

component "crds" {
  source  = "./modules/k8s_crds"
  inputs = {}
  providers = {
    kubernetes = provider.kubernetes.this
  }
}

# variables.tfstack.hcl

variable "cluster_name" {
  description = "The name of the AKS cluster"
  type        = string
}

variable "dns_prefix" {
  description = "The DNS prefix of the AKS cluster"
  type        = string
}

variable "resource_group_name" {
  description = "The name of the resource group"
  type        = string
}

variable "location" {
  description = "The location of the resources"
  type        = string
}

variable "node_count" {
  description = "The number of nodes in the AKS cluster"
  type        = number
}

variable "node_size" {
  description = "The size of the nodes in the AKS cluster"
  type        = string
}

variable "subscription_id" {
  description = "The subscription ID for the Azure account"
  type        = string
}

variable "identity_token" {
  description = "The OIDC token for authentication"
  ephemeral   = true
  type        = string
}

variable "client_id" {
  description = "The client ID for the Azure service principal"
  type        = string
}

# deployments.tfdeploy.hcl

identity_token "azurerm" {
  audience = ["api://AzureADTokenExchange"]
}

deployment "prd" {
  inputs = {
    identity_token      = identity_token.azurerm.jwt
    subscription_id     = "e98088f7-5cd3-4c77-8dc7-7468bacdd6a5"
    client_id           = "5d1c9802-3fc3-49e4-a1f9-320156a72d1c"
    cluster_name        = "my-aks-cluster"
    dns_prefix          = "myaks"
    resource_group_name = "my-resource-group"
    location            = "uksouth"
    node_count          = 1
    node_size           = "Standard_DS2_v2"
  }
}

Notice how our Kubernetes provider block is now referencing the outputs from the aks_cluster component? This is similar to what you’d do in a single configuration file with a regular Terraform workspace, the difference is Stacks will recognise this and realise that it needs to defer planning until the aks_cluster component has been created. No more dodgy phased deployments.

Once I had a working config (which took a few attempts!) I could see a successful plan was awaiting my review.

Clicking into the quick view, we can see now that the crds component of the stack has been marked as has deferred changes. Once I approve the initial apply, it is only going to roll out the resources in the aks_cluster component.

Approving the plan, we can see it goes through deployment, then automatically replans. At this point, it waits for my review and approval again.

Ignore the additional changes to the aks_cluster - I’d got Cursor AI to quickly knock up the AKS config, and it isn’t idempotent (I do love Cursor though, maybe one for another article).

Clicking the view replan button, I can see the replan has been successful, and I can approve it.

Finally, a quick check in the Azure portal (I couldn’t be bothered to setup kubectl with the Kubeconfig - don’t judge me 😉), and I can see the custom CRD has been created.

Reducing Manual Intervention

In this case, we obviously had to review and approve plans twice. If you recall in the previous article, I mentioned that there were other types of orchestrate rules that could help us. In this case, there is an action of deferral_replan.

I ran out of time to play with this today, but my assumption is that we can include this and skip the manual approval step. I’ll continue to play with this and share with you in articles as I learn more.

Issues

As stated, I did have a few issues crop up. I’m sharing them so you’re aware, not to shame the product. It’s still in public preview after all, and I’m confident HashiCorp will iron these out and give us a great solution.

Missing Functions and Weird Errors

Originally, I was going to pass my CRDs in as a variable input but ended up hard coding them in the module. When I tried to use the file() function within my deployment, I got an error that there is no function named file.

On top of this, the error was screwed up - it was suggesting crd1 = file... was up with my identity_token block. It absolutely wasn’t, and there were no missing braces etc. I’ve no idea what was going on here!

Instance Count Unknown

Ok, so I can’t use file(). Fine. I’ll just put the CRD directly into my deployment code. Something like this…

Yeah, this didn’t work either. Despite the module using a for_each over a map (which should be fine), I got the following error:

No idea why. I double checked my code, and it was ok. So, either I’m more tired than I realised, or this is a bug/limitation.

Can’t Destroy

I spotted this when I was writing part 1, but completely forgot to mention it. When I wanted to clean up my environment, I went to queue a destroy plan (as you can in a workspace), but lo and behold, you can’t! You can only destroy the workspace, which doesn’t delete the resources.

Not a huge issue, and probably just something that isn’t part of the preview for now. I’m sure this will be added in due course.

Conclusion

I’m still enjoying getting stuck into Stacks. Whilst there are some teething issues, that is to be expected. Hopefully they will be squashed soon, as I can definitely see the benefits of this to many organisations.

Stay tuned for more parts as I continue to document my learnings. Not sure how many more there will be. One? Two? Who knows. I’m edgy like that.

When Dependencies Are Not A Problem#

When Dependencies Are A Problem#

There be Gremlins!#

How Does Terraform Stacks Help?#

Reducing Manual Intervention#

Issues#

Missing Functions and Weird Errors#

Instance Count Unknown#

Can’t Destroy#

Conclusion#