TKGI Cluster creation fails with error “1 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm”

One of the commonly reported issues in TKGI is cluster creation fail with error 1 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. In this blog post, we will see more about the issue.

Symptoms:

Bosh tasks fails with the below error

Task 305253 | 02:58:24 | Updating instance master: master/6415b432-ccc2-4faa-89c8-911914cb9d44 (0) (canary) (00:02:16)
                      L Error: Action Failed get_task: Task a2ce132f-3bcb-414b-4d5d-f17ec4e2e7cc result: 1 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. Successful Jobs: etcd, bpm, bosh-dns, syslog_forwarder, ncp, pks-nsx-t-ncp.
Task 305253 | 03:00:40 | Error: Action Failed get_task: Task a2ce132f-3bcb-414b-4d5d-f17ec4e2e7cc result: 1 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. Successful Jobs: etcd, bpm, bosh-dns, syslog_forwarder, ncp, pks-nsx-t-ncp.

From the above task log, we see that one of the pre-start script has failed and the failed Jobs is pks-nsx-t-prepare-master-vm. If we SSH into the master node of the failed pks cluster and review the pks-nsx-t-prepare-master-vm logs, we see the below.

pks-nsx-t-prepare-master-vm/pre-start.stdout.log:
Creating Load Balancer
[POST /logical-routers][400] createLogicalRouterBadRequest  &{RelatedAPIError:{Details: ErrorCode:10000 ErrorData:<nil> ErrorMessage:Found errors in the request. Please refer to the related errors for details. ModuleName:ROUTING} RelatedErrors:[0xc0000b46e0 0xc0000b4730]}

pks-nsx-t-prepare-master-vm/pre-start.stderr.log
time="2020-10-19T06:01:50Z" level=error msg="Failed to createT1Router: &{ManagedResource:{RevisionedResource:{Resource:{Links:[] Schema: Self:<nil>} Revision:<nil>} CreateTime:0 CreateUser: LastModifiedTime:0 LastModifiedUser: SystemOwned:<nil> Description: DisplayName:lb-pks-999225f5-40af-48f8-8958-5eff0139f4fc-cluster-router ID: ResourceType: Tags:[0xc000210400]} AdvancedConfig:<nil> AllocationProfile:0xc000010068 EdgeClusterID:a4270323-3696-469d-b12b-0ba4b5737ed7 EdgeClusterMemberIndices:[] FailoverMode: FirewallSections:[] HighAvailabilityMode:ACTIVE_STANDBY PreferredEdgeClusterMemberIndex:<nil> RouterType:0xc000488190}" pks-networking=networkManager
Error: [POST /logical-routers][400] createLogicalRouterBadRequest  &{RelatedAPIError:{Details: ErrorCode:10000 ErrorData:<nil> ErrorMessage:Found errors in the request. Please refer to the related errors for details. ModuleName:ROUTING} RelatedErrors:[0xc0000b46e0 0xc0000b4730]}

Cause:

This error is mostly seen when we are hitting any limits on the objects created by PKS/TKGI on the NSX-T. To understand more about the limits, review the configmax guide from VMware NSX-T. Make sure to choose the right product version and the category and then click on VIEW LIMITS.

Validate if you hit the limits:

To check if you have hit any config max limits with your NSX-T, you can run the below API command. You can run it from the NSX-T manager node by going into root mode or from any other Linux machine that has connectivity to the NSX-T.

curl -k -X GET -u 'USERNAME:PASSWORD' https://NSX-T-FQDN/api/v1/loadbalancer/usage-per-node/<edge-node-id>

Make sure to update the username, password, NSX-T FQDN from your environment and the edge-node-id in the command. Edge node ID can be obtained by logging into NSX-T GUI > System > Fabric > Nodes > Edge Transport Node > Edge-vm > Overview.

The output of the command will be like the below.

curl -k -X GET -u 'admin:VMware1!VMware1!' https://nsxmanager.corp.local/api/v1/loadbalancer/usage-per-node/28cdb16d-3d05-4ed3-835c-b18405b3d45b
{
"form_factor": "LARGE_VIRTUAL_MACHINE",
"edge_cluster_id": "8952b9b1-f66c-4ab3-a6cc-74b833f369d8",
"current_credit_number": 40,
"remaining_credit_number": 0,
"usage_percentage": 100.0,
"severity": "RED",
"current_pool_members": 69,
"current_virtual_servers": 33,
"current_pools": 23,
"current_small_load_balancer_services": 10,
"current_medium_load_balancer_services": 3,
"current_large_load_balancer_services": 0,
"remaining_small_load_balancer_services": 0,
"remaining_medium_load_balancer_services": 0,
"remaining_large_load_balancer_services": 0,
"remaining_pool_members": 7431,
"type": "LbEdgeNodeUsage",
"node_id": "28cdb16d-3d05-4ed3-835c-b18405b3d45b"
}

This API output gives details about the current and remaining LB services from that edge node. From the output we can confirm that the remaining LB service is zero.

Run the same API by updating the rest of the edge nodes ID from the edge-cluster.

Resolution:

From the above API output, we confirm that we are hitting the NSX-T limits and to resolve we can either delete some of the older PKS cluster that are not required or add more edge nodes to the edge cluster on the NSX-T.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: