Openstack NFS Backend Causing Total Hangs

Openstack NFS Backend Causing Total Hangs

I'm not a big NFS fan ever since I worked as a Linux/Unix administrator way back in the good old days. Sometimes when the NFS server hung, lost network connectivity or something else happened all clients that had mounts from the NFS completely blocked waiting for it to come back up, because it is so deep in the kernel. Аll commands, even "ls", froze and the only cure was forcibly rebooting them to get them back online. Neat, eh?

When NFS v4.1 emerged, back in 2010 hopes were that it will fix everything. I was a bit sceptical but decided to give it a shot and true, many new fixes in the protocol and implementation were made that enhanced the stability. Some of them were: blocking locks that allow client to ping the server if the lock is released, not only wait for notifications, timeout for server unavailable, parallel access. From what I saw, I couldn’t really break it beyond repair.

As time went by Openstack offered the option to have NFS as a storage backend. We decided to use it for one deployment where we saw this technology as appropriate, because we didn't need highly available storage with replication that occupies twice the space, but we needed Cinder volumes to get mounted across the hypervisors. I had a feeling that something could go wrong while making the installation, because I remembered all those nights rebooting servers from ILo / IMPI.

Anyway, all was good until we got quite a lot of machines using the storage getting heavy rsycns over the Internet. And then it happened. All the compute and storage nodes started to hang. There was nothing we could do to restore the service, restarting nfsd, remounting etc. did not work. Most of the commands hang (ls, mount, df), even if not explicitly, ran against an NFS mount.

The only solution was to restart the compute nodes that were affected, hence all the virtual machines running on them, even the ones that don't use the NFS storage backend. You can imagine this is not very good for a production environment.

Root Cause Investigation On Openstack And NFS Level

We run on CentOs 7.2 with NFS 4.1 protocol and Openstack Kilo.

Deeper investigation didn't show the root cause easily. The information in the logs doesn't really suggest anything and running straces also didn't help. We tested out multiple assumptions, for example there were not enough threads handling requests, so we tried to increase them. It was hard to determine if we had really hit that limit.

 ## to apply immediately
 echo 32 > /proc/fs/nfsd/threads

We also suspected dead locks for some resources. That was because the strace of an ls command showed that the command hangs on particular file -  in this case it was a stat system call to one of the cinder volume files. We couldn't verify a deadlock in the log files even with the highest debugging option. Another unpleasant thing with NFS is that you don't really have a configuration file in which you can edit options. You can just change some of the parameters as mountpoint options and for the server only the variables in sysconfig.
We decreased the grace and lease times for the locking in /etc/sysconfig/nfs.

 # Set V4 grace period in seconds
 # Set V4 lease period in seconds

Then we had a breaktrough when looking on the network packets that were transferred during the hang.
The trace of a tcp dump showed loads of duplicate ACKs followed by retransmissions.

 2016-06-21 17:50:26.598318  12.123375 server -> client TCP 96 [TCP Dup ACK 123984#157] nfs > netconfsoapbeep [PSH, ACK] Seq=524186336 Ack=534174314 Win=33120 Len=0 TSval=3002711016 TSecr=253665 124298 0.000000 524186336
 2016-06-21 17:50:26.598339  12.123396 client -> server TCP 96 [TCP Dup ACK 123985#157] netconfsoapbeep > nfs [ACK] Seq=524186331 Ack=524186336 Win=501 Len=0 TSval=253991 TSecr=3002711338 124299 0.000021 524186331
 2016-06-21 17:50:26.600256  12.123313 server -> client TCP 96 [TCP Dup ACK 123984#158] nfs > netconfsoapbeep [PSH, ACK] Seq=524186336 Ack=534174314 Win=33120 Len=0 TSval=3002711016 TSecr=253665 124300 0.001917 524186336
 2016-06-21 17:50:26.600277  12.123334 client -> server TCP 96 [TCP Dup ACK 123985#158] netconfsoapbeep > nfs [ACK]

 And the retransmission:
 2016-06-21 17:50:26.604125  12.125361 client -> server RPC 1434 [TCP Fast Retransmission] Continuation 124308 0.002836 534174314 3432156765

TCP Fast Retransmission is when the source gets confirmation that the packet wasn't received. According to a bug reported in RedHat that is still unfixed in heavy loads you can get into this retransmission loop that the system can't get out of. This is how Redhat explains it: 

"If TCP data is missing from a TCP stream then the receiver will send a series of duplicate ACKs to initiate a fast retransmission. This is expected in normal TCP operation as long as the connection recovers from the packet drop. But in this case the connection never recovered and there were duplicate ACKs every few milliseconds. "

The Solution Of Getting A Stable NFS Storage Backend For Openstack

A workaround for this issue was to disable TCP retransmission in the kernel parameters:

in the kernel parameters /etc/sysctl.conf change the following parameter from 2 to 0.

 net.ipv4.tcp_frto = 0

Then execute

 sysctl -p

With a smile on my face I can say that for the past 3 months we didn't experience any issue with the NFS and it's being as stable as it could be.

And for a permanent solution I guess we’ll have to wait for NFS 4.2 , fingers crossed that it will fix everything for real this time.

Please feel free to share your experience with NFS in the comments, I would love to hear from you!.