I’m not a big NFS fan ever since I worked as a Linux/Unix administrator way back in the good old days. Sometimes when the NFS server hung, lost network connectivity, or something else happened all clients that had mounts from the NFS were completely blocked waiting for it to come back up, because it is so deep in the kernel.
Аll commands, even “ls”, froze and the only cure was forcibly rebooting them to get them back online. Neat, eh?
When NFS v4.1 emerged, back in 2010 hopes were that it will fix everything. I was a bit skeptical but decided to give it a shot and true, many new fixes in the protocol and implementation were made that enhanced the stability. Some of them were: blocking locks that allow the client to ping the server if the lock is released, not only waiting for notifications, the timeout for server unavailable, and parallel access. From what I saw, I couldn’t really break it beyond repair.
As time went by OpenStack offered the option to have NFS as a storage backend. We decided to use it for one deployment where we saw this technology as appropriate because we didn’t need highly available storage with replication that occupies twice the space, but we needed Cinder volumes to get mounted across the hypervisors. I had a feeling that something could go wrong while making the installation because I remembered all those nights rebooting servers from ILO / IMPI.19AUG
Anyway, all was good until we got quite a lot of machines using the storage getting heavy syncs over the Internet. And then it happened. All the compute and storage nodes started to hang. There was nothing we could do to restore the service, restarting nfsd, remounting, etc. did not work. Most of the commands hang (ls, mount, df), even if not explicitly, ran against an NFS mount.
The only solution was to restart the compute nodes that were affected, hence all the virtual machines running on them, even the ones that don’t use the NFS storage backend. You can imagine this is not very good for a production environment.
ROOT CAUSE INVESTIGATION ON OPENSTACK AND NFS LEVEL
We run on CentOs 7.2 with NFS 4.1 protocol and OpenStack Kilo.
A deeper investigation didn’t show the root cause easily. The information in the logs doesn’t really suggest anything and running traces also didn’t help. We tested out multiple assumptions, for example, there were not enough threads handling requests, so we tried to increase them. It was hard to determine if we had really hit that limit.
/etc/sysconfig/nfs
RPCNFSDCOUNT=32
## to apply immediately
echo 32 > /proc/fs/nfsd/threads
We also suspected deadlocks for some resources. That was because the trace of an ls command showed that the command hangs on a particular file – in this case, it was a stat system call to one of the cinder-volume files. We couldn’t verify a deadlock in the log files even with the highest debugging option. Another unpleasant thing with NFS is that you don’t really have a configuration file in which you can edit options. You can just change some of the parameters as mount point options and for the server only the variables in sys config. We decreased the grace and lease times for the locking in /etc/sysconfig/nfs.
# Set V4 grace period in seconds
NFSD_V4_GRACE=30
#
# Set V4 lease period in seconds
NFSD_V4_LEASE=30
Then we had a breakthrough when looking at the network packets that were transferred during the hang.
The trace of a TCP dump showed loads of duplicate ACKs followed by retransmissions.
2016-06-21 17:50:26.598318 12.123375 server -> client TCP 96 [TCP Dup ACK 123984#157] nfs > netconfsoapbeep [PSH, ACK] Seq=524186336 Ack=534174314 Win=33120 Len=0 TSval=3002711016 TSecr=253665 124298 0.000000 524186336
2016-06-21 17:50:26.598339 12.123396 client -> server TCP 96 [TCP Dup ACK 123985#157] netconfsoapbeep > nfs [ACK] Seq=524186331 Ack=524186336 Win=501 Len=0 TSval=253991 TSecr=3002711338 124299 0.000021 524186331
2016-06-21 17:50:26.600256 12.123313 server -> client TCP 96 [TCP Dup ACK 123984#158] nfs > netconfsoapbeep [PSH, ACK] Seq=524186336 Ack=534174314 Win=33120 Len=0 TSval=3002711016 TSecr=253665 124300 0.001917 524186336
2016-06-21 17:50:26.600277 12.123334 client -> server TCP 96 [TCP Dup ACK 123985#158] netconfsoapbeep > nfs [ACK]
And the retransmission: 2016-06-21 17:50:26.604125 12.125361 client -> server RPC 1434 [TCP Fast Retransmission] Continuation 124308 0.002836 534174314 3432156765
TCP Fast Retransmission is when the source gets confirmation that the packet wasn’t received. According to a bug reported in RedHat that is still unfixed in heavy loads you can get into this retransmission loop that the system can’t get out of. This is how Redhat explains it:
“If TCP data is missing from a TCP stream then the receiver will send a series of duplicate ACKs to initiate a fast retransmission. This is expected in normal TCP operation as long as the connection recovers from the packet drop. But in this case, the connection never recovered and there were duplicate ACKs every few milliseconds. “
THE SOLUTION OF GETTING A STABLE NFS STORAGE BACKEND FOR OPENSTACK
A workaround for this issue was to disable TCP retransmission in the kernel parameters:
in the kernel parameters /etc/sysctl.conf change the following parameter from 2 to 0.
net.ipv4.tcp_frto = 0
Then execute:
sysctl -p
With a smile on my face, I can say that for the past 3 months, we didn’t experience any issues with the NFS and it’s as stable as it could be.
And for a permanent solution, I guess we’ll have to wait for NFS 4.2, fingers crossed that it will fix everything for real this time.
Please feel free to share your experience with NFS in the comments, I would love to hear from you!
2 Responses
The blog post also mentions a permanent solution for the issue, which is to upgrade to NFS 4.2. NFS 4.2 includes a number of improvements that can help to prevent the race condition from occurring. However, NFS 4.2 is not yet widely supported by OpenStack.
Sadly not much new development for the NFS backend in general. While I do agree there are better alternatives, it should still be a viable option for certain small deployments. Now people are forced to use things SDS like Ceph even for the smallest of OS deploys, exactly because NFS is so unreliable.