"Stale file handle" 错误,当进程尝试读取文件时,其他进程已删除

"Stale file handle" error, when process trying read the file, that other process already had deleted

我正在编写压力测试套件以测试基于 NFS 的分布式文件系统。

在某些情况下,当某些进程删除文件,而其他进程试图从中读取文件时,我会收到“陈旧文件句柄”错误 (116)。

在这样的竞争条件下,这种错误是预期的并且可以接受的吗?

测试工作如下:

  1. 开始 x 台客户端机器
  2. 每台客户机运行 y 个进程
  3. 每个进程都可以像stat/read/delete/open
  4. 一样进行任何文件操作
  5. 提到的文件操作是标准 python 方法 - os.stat/read/os.remove/open
  6. 所有文件都是空的0字节数据

文件存在,成功stat操作显示:

controller_debug.log.2:2016-10-26 15:02:30,156;INFO - [LG-E27A-LNX:0xa]: finished 640522b4d94c453ea545cb86568320ca, result: success | stat | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.156

客户端 CLIENT-A 上的进程 0x1 已完成成功删除:

controller_debug.log.2:2016-10-26 15:02:30,164;INFO - [CLIENT-A:0x1]: finished 5f5dfe6a06de495f851745a78857eec1, result: success | delete | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.161

3 毫秒后,客户端 CLIENT-B 上的进程 0xb 由于“陈旧的文件句柄”而导致“读取”操作失败

controller_debug.log.2:2016-10-26 15:02:30,164;INFO - [CLIENT-B:0xb]: finished e84e2064ead042099310af1bd44821c0, result: failed | read | /mnt/DIRSPLIT-node0.b27-1/JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | [errno:116] | Stale file handle | 142 | data: {} | 2016/10/26 15:02:30.160 controller_debug.log.2:2016-10-26 15:02:30,164;ERROR - Operation read FAILED UNEXPECTEDLY on File JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 due to Stale file handle

谢谢

这完全在意料之中。 NFS 规范清楚地说明了对象(无论是文件还是目录)被删除后文件句柄的使用。 Section 4 清楚地解决了这个问题。例如:

The persistent filehandle will become stale or invalid when the file system object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE.

这是一个很常见的问题,它甚至在 NFS FAQ 的 A.10 节中有自己的条目,其中指出 ESTALE 错误的一个常见原因是:

The file handle refers to a deleted file. After a file is deleted on the server, clients don't find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.

预期的解决方案是您的客户端应用程序必须关闭并重新打开文件以查看发生了什么。或者,如常见问题解答所述:

... to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.