I’d like to thank my colleague Robbie Hancock https://twitter.com/blobbieh for teasing out the troubleshooting/identification of problem steps.

**Symptom** :

If you have VMs that

  1. Stop responding to network requests
  2. Cannot ping a VM from another VM or ESG
  3. Do not have entries in their ARP table
  4. Initiating a ping from the affected VM to ESG or another VM and traffic resumes (ARP table entry appears)
  5. After an amount if inactivity, the problem returns…. (suspect ARP table ages out entry)

To troubleshoot the issue :
Log on to an NSX controller and identify which one is the master for the affected VNI (5001 used here as an example)

show control-cluster logical-switches vni 5001

This will show which controller is the master. Cross reference this IP with the relevant NSX Controller in vCenter and ssh into that one.

Check that all ESXi Hosts have a connection to controllers:

show control-cluster logical-switches connection-table 5001

This will show that only some of the ESXi Hosts are connected to controllers, most likely the missing host is where your missing VM is running.

Check which VNIs are present on the affected host ( used here for example)

show control-cluster logical-switches joined-vnis
If you get “error: not found” on a host with a lot of VMs on it, that’s not right! You should get a get a long list of VNIs

So now we have a pointer to the cause. Test migration of the affected VM onto another host and see if the problem persists if this instantly resolves the problem then you have the issue.

**The Fix**

Putty onto the ESXi Host and restarts the user world agent (netcpa):

Re check to see if the ESXi host has joined the VNIs:

show control-cluster logical-switches joined-vnis

This is a known issue with NSX 6.1.x click here to see the KB