Tanzu Kubernetes Grid 1.2.1 deployment is stuck due to cert-manager pod creation issue

Tanzu Kubernetes Grid aka TKG provides a consistent, upstream-compatible implementation of Kubernetes, that is tested, signed, and supported by VMware. Recently I was working on a TKGm 1.2.1 airgapped installation on vSphere environment and faced an issue. The installation got stuck at one point for about 30 mins and eventually failed. I am discussing it here if someone else face the same and want to know how to workaround the issue and proceed with the TKG installation.

Symptoms:

  • Tanzu Kubernetes Grid (TKG) 1.2.1 deployment gets stuck at 4/8 step
  • The TKG deployment is taking place in an air-gapped environment
  • You see messages similar to the following in the tkg init output: 
I0120 12:17:46.445757 cert_manager.go:453] Waiting for cert-manager to be available...
I0120 12:17:46.456910 cert_manager.go:419] Updating Namespace="cert-manager-test"
I0120 12:17:46.603389 cert_manager.go:411] Creating Issuer="test-selfsigned" Namespace="cert-manager-test"
.
.
I0120 12:47:48.637468 client.go:150] Deleting kind cluster: tkg-kind-c03t3ku440qoiqq7imrg
E0120 12:47:52.174348 common.go:40]
Error: : unable to set up management cluster: unable to initialize providers: timed out waiting for the condition, this can be possible because of the outbound connectivity issue. Please check deployed nodes for outbound connectivity.
E0120 12:47:52.174648 common.go:44]
Detailed log about the failure can be found at: /tmp/tkg-20210120T121642480345317.log

In the above logs we can see that the install stuck for 30 mins and then failed. This is the default cert-manager-timeout set for the cert manger in the ~/.tkg/config.yaml file. In order to troubleshoot further, we need to get the cert-manager pod logs from the kind cluster.

To connect to the kind cluster run the below command

kubectl get all -A --kubeconfig /root/.kube-tkg/tmp/config_xxxxxxxx

Note: The location of the kubeconfig file to connect to the kind cluster can be found in the init logs(tkg-20210120T121642480345317.log) and the log entry look something like the below.

I0120 13:10:44.465469 init.go:173] Bootstrapper created. Kubeconfig: /root/.kube-tkg/tmp/config_xxxxxxxx
  • You see that the cert-manager pod creation is failing when you run kubectl get all -A on the kind cluster
  • You see imagepullbackoff errors when you run kubectl describe pod against the cert-manager pod
  • You see events similar to the following when you run kubectl describe pod against the cert-manager pod:
Jan 25 07:22:15 tkg-kind-c0772pe440qsvjjovfbg-control-plane kubelet[733]: E0125 07:22:15.698512     733 remote_image.go:113] PullImage "registry.domain.local/newapp/cert-manager/cert-manager-controller:v0.16.1_vmware.1" from image service failed: rpc error: code = Unknown desc = failed to pull and unpack image "registry.domain.local/newapp/cert-manager/cert-manager-controller:v0.16.1_vmware.1": failed to resolve reference "registry.domain.local/newapp/cert-manager/cert-manager-controller:v0.16.1_vmware.1": get TLSConfig for registry "https://registry.domain.local": failed to load CA file: open /etc/containerd/tkg-registry-ca.crt: no such file or directory

Resolution :

This is a known issue affecting Tanzu Kubernetes Grid 1.2.1. There is currently no resolution. To workaround the issue, set the below in the ~/.tkg/config.yaml file.

TKG_CUSTOM_IMAGE_REPOSITORY: <your-harbor-fqdn>/library
TKG_CUSTOM_IMAGE_REPOSITORY_SKIP_TLS_VERIFY: true
TKG_CUSTOM_IMAGE_REPOSITORY_CA_CERTIFICATE: Cg==

OR

TKG_CUSTOM_IMAGE_REPOSITORY: <your-harbor-fqdn>/library
TKG_CUSTOM_IMAGE_REPOSITORY_SKIP_TLS_VERIFY: false
TKG_CUSTOM_IMAGE_REPOSITORY_CA_CERTIFICATE: <base64 encoded harbor-ca-cert>

Note: You may also export this as env var on the bootstrap machine as well and then run tkg init command.

Once the config.yaml file is updated, cleanup the existing failed kind deployment and re-run the tkg init command. This time the installation should proceed and start with the management cluster deployment.


One thought on “Tanzu Kubernetes Grid 1.2.1 deployment is stuck due to cert-manager pod creation issue

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: