Troubleshooting Tanzu Kubernetes Grid Clusters with Crash Diagnostics

Crash Diagnostics aka Crashd is an open source project that makes it easy to diagnose problems with unstable or even unresponsive Kubernetes clusters. If you are a Tanzu admin or a consultant, this tool comes very handy with collecting the diagnostic information about the cluster that you are working on. In this blog post we will see how to make use of Crashd to collect the diagnostics information from the TKG environment.

Install Crashd:

Go to TKG 1.2.1 product download page and download Crash Diagnostics v0.3.2. Install the crashd in your TKG bootstarp machine.

  1. Run the below command to unpack the binary
tar -xvf crashd-linux-amd64-v0.3.2-vmware.1.tar.gz

In the crashd folder created, you will see the below 3 files.

  • args
  • diagnostics.crsh
  • crashd-PLATFORM-amd64-v0.3.2+vmware.1

2. Move the binary file into the /usr/local/bin folder by running

mv crashd-linux-amd64-v0.3.2+vmware.1 /usr/local/bin/crashd

3. Make it executable by running the below command.

chmod +x crashd

Now you should be able to run crashd commands from the bootstrap machine.

Prerequisites

Before you can start running crashd commands and collect logs, you need to make sure the below programs are running in your bootstrap machine.

  • kind
  • kubectl
  • scp
  • ssh

Except kind, all the other three programs are already running in my bootstrap machine.

Install Kind

  1. Download the binary by running
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.10.0/kind-linux-amd64

2. Make the file executable by running

chmod +x ./kind

3. Move the binary into the /usr/local/bin folder.

mv ./kind /usr/local/bin/kind

Update args file:

Based on which cluster logs need to be collected, we need to update the target and the mgmt_cluster_config in the args file. Valid target options are bootstrap, mgmt, or workload. You also need to make sure to update the path to the SSH private key in the args file.

Sample args file should look like the below. Make a note of target, ssh_pk_file and mgmt_cluster_config in the below args file.

# ######################################################
# Crashd script argument file
#
# This file defines CLI argument values that are passed
# Crashd when running scripts for troubleshooting TKG
# clusters.
# ######################################################
# target: specifies cluster to target.
# Valid targets are: {bootstrap, mgmt, workload}
target=mgmt
# infra: the underlying infrastructure used by the TKG cluster.
# Valid values are: {vsphere, aws}
infra=vsphere
# workdir: a local directory where collected files are staged.
workdir=./workdir
# ######################################################
# Management Cluster
# The following arguments are used to collect information
# from a named management cluster.
# ######################################################
# ssh_user: the user ID used for SSH connections to cluster nodes.
ssh_user=capv
# ssh_pk_file: the path to the private key file created to SSH
# into management cluster nodes.
ssh_pk_file=/home/ubuntu/.ssh/id_rsa
# mgmt_cluster_ns: the namespace where the management cluster
# is deployed in the cluster.
mgmt_cluster_ns=tkg-system
# mgmt_cluster_config: the kubeconfig file path for the management cluster.
mgmt_cluster_config=/home/ubuntu/.kube/config
# ######################################################
# Workload Cluster
# The following arguments are used to collect information
# from one or more workload clusters that are managed
# by the management cluster configured above.
# ######################################################
# workload_clusters: a comma separated list of workload cluster names
# [uncomment below]
workload_clusters=vsphere-workload-1
# workload_cluster_ns: the namespace where the workload cluster
# is hosted in the management plane.
# [uncomment below]
workload_cluster_ns=default

Collect mgmt cluster log bundle using crashd

To collect the diagnostics logs from the TKG management cluster, run the below command. Make sure you run the command where we have the args and diagnostics.crsh file present. I have used the above sample args file to collect the mgmt cluster diagnostics logs using crashd.

crashd run --args-file args diagnostics.crsh

Once the command is completed successfully, we will see the management diagnostics file generated as in the below screenshot.

Collect workload cluster log bundle using crashd

To collect the workload cluster log bundle, update the target in the args file as workload. We also need to run few additional commands to generate the workload cluster diagnostics logs.

  1. Run the below command to list all the clusters in the TKG env.
tkg get clusters --include-management-cluster

2. Run the below command and fetch the kube config for the mgmt cluster

tkg get credentials tkg-mgmt-vsphere-20210129224846 --namespace tkg-system

3. Run the below command to change the context to management cluster. This should be there in the output of the second command we ran.

kubectl config use-context tkg-mgmt-vsphere-20210129224846-admin@tkg-mgmt-vsphere-20210129224846

4. Now, simply run the below crashd command to collect the workload cluster logs

crashd run --args-file args diagnostics.crsh

Once the crashd command is run successfully, as in the above screenshot, we will see the crashd diagnostics bundle collected for the workload cluster as well.

Collect kind cluster log bundle using crashd

If we face any issue during the deployment of TKG, we may need to collect the kind cluster logs. This can be helpful when we are troubleshooting management cluster deployment issues.

To collect the kind cluster diagnostics logs, make sure to update,

  1. Target as bootstrap in the args file

2. mgmt_cluster_config as the config file created for the bootstrapper. This can be found in the TKG init logs as below.

Bootstrapper created. Kubeconfig: /home/ubuntu/.kube-tkg/tmp/config_6dE79GcH

3. Once the above two are set, we can run the crashd command to collect the kind cluster logs.

crashd run --args-file args diagnostics.crsh

I hope this blogs helps in understanding all the steps needed to collect the crashd diagnostics for the TKGm environment. If this has helped you, please share it with others too. Happy learning!! 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: