Python/Linux: 如何判断移动的文件何时完全可用?
Python/Linux: How to determine when a moved file is fully available?
我有一个不断添加新文件的文件夹。我有一个 python 脚本,它使用 os.listdir() 来查找这些文件,然后自动对它们执行分析。但是,这些文件非常大,因此它们似乎在 os.listdir() 实际上完全 written/copied 之前就出现了。有什么方法可以区分哪些文件不在移动过程中?用 os.path.getsize() 比较大小似乎不起作用。
Raspbian Pi4 上的 Buster Python 3.7.3。我是编程新手 linux.
谢谢!
有关原子和跨文件系统移动的概念性解释,请参阅此moves in Python(确实可以节省您的时间)
您可以采用以下方法来解决您的问题:-
->使用 Pyinotify 监控文件系统事件 usage of Pynotify
-> 使用 flock
锁定文件几秒钟
-> 使用 lsof 我们基本上可以检查正在使用特定文件的进程。
`from subprocess import check_output,Popen, PIPE
try:
lsout=Popen(['lsof',filename],stdout=PIPE, shell=False)
check_output(["grep",filename], stdin=lsout.stdout, shell=False)
except:
#check_output will throw an exception here if it won't find any process using that file`
只需在 except 部分编写日志处理代码即可。
-> 一个守护进程,通过使用例如看门狗库 watchdog implementation
来监视父文件夹的任何更改
-> 您可以通过循环遍历 /proc 中的 PID/s 来检查另一个进程正在使用的文件以获得特定的 id(假设您可以控制正在添加新文件的程序连续文件以识别其 id)。
-> 可以使用 psutil 检查文件是否有句柄。
在编程中这被称为 concurrency, which is when computations happen simultaneously and the order of execution is not guaranteed. In your case, one program begins to read a file before another program has finished writing to it. This particular problem is called the reader-writers problem,实际上在嵌入式系统中相当普遍。
这个问题有很多解决方案,但最简单和最常见的是 lock. The simplest kind of lock protects a resource from being accessed by more than one program at the same time. In effect, it makes sure that operations on the resource happen atomically. A lock is implemented as an object that can be acquired or released (these are usually functions of the object). The program tries to acquire the lock in a loop that iterates for as long as the program does not acquire the lock. When the lock is acquired, it grants the program holding it the ability to execute some block of code (this is usually a simple if-statement), after which the lock is released. Note that what I am referring to as a program is typically called a thread。
在Python中,可以使用threading.Lock
对象。首先,您需要创建一个 Lock 对象。
from threading import Lock
file_lock = Lock()
然后在每个线程中,等待获取锁后再进行。如果设置blocking=True
,会导致整个线程停止运行,直到获取到锁,不需要循环
file_lock.acquire(blocking=True):
# atomic operation
file_lock.release()
注意每个线程中应该使用同一个锁对象。读写文件前需要获取锁,读写文件后需要释放锁.这将确保这些操作不会再次同时发生。
我有一个不断添加新文件的文件夹。我有一个 python 脚本,它使用 os.listdir() 来查找这些文件,然后自动对它们执行分析。但是,这些文件非常大,因此它们似乎在 os.listdir() 实际上完全 written/copied 之前就出现了。有什么方法可以区分哪些文件不在移动过程中?用 os.path.getsize() 比较大小似乎不起作用。
Raspbian Pi4 上的 Buster Python 3.7.3。我是编程新手 linux.
谢谢!
有关原子和跨文件系统移动的概念性解释,请参阅此moves in Python(确实可以节省您的时间)
您可以采用以下方法来解决您的问题:-
->使用 Pyinotify 监控文件系统事件 usage of Pynotify
-> 使用 flock
锁定文件几秒钟-> 使用 lsof 我们基本上可以检查正在使用特定文件的进程。
`from subprocess import check_output,Popen, PIPE
try:
lsout=Popen(['lsof',filename],stdout=PIPE, shell=False)
check_output(["grep",filename], stdin=lsout.stdout, shell=False)
except:
#check_output will throw an exception here if it won't find any process using that file`
只需在 except 部分编写日志处理代码即可。
-> 一个守护进程,通过使用例如看门狗库 watchdog implementation
来监视父文件夹的任何更改-> 您可以通过循环遍历 /proc 中的 PID/s 来检查另一个进程正在使用的文件以获得特定的 id(假设您可以控制正在添加新文件的程序连续文件以识别其 id)。
-> 可以使用 psutil 检查文件是否有句柄。
在编程中这被称为 concurrency, which is when computations happen simultaneously and the order of execution is not guaranteed. In your case, one program begins to read a file before another program has finished writing to it. This particular problem is called the reader-writers problem,实际上在嵌入式系统中相当普遍。
这个问题有很多解决方案,但最简单和最常见的是 lock. The simplest kind of lock protects a resource from being accessed by more than one program at the same time. In effect, it makes sure that operations on the resource happen atomically. A lock is implemented as an object that can be acquired or released (these are usually functions of the object). The program tries to acquire the lock in a loop that iterates for as long as the program does not acquire the lock. When the lock is acquired, it grants the program holding it the ability to execute some block of code (this is usually a simple if-statement), after which the lock is released. Note that what I am referring to as a program is typically called a thread。
在Python中,可以使用threading.Lock
对象。首先,您需要创建一个 Lock 对象。
from threading import Lock
file_lock = Lock()
然后在每个线程中,等待获取锁后再进行。如果设置blocking=True
,会导致整个线程停止运行,直到获取到锁,不需要循环
file_lock.acquire(blocking=True):
# atomic operation
file_lock.release()
注意每个线程中应该使用同一个锁对象。读写文件前需要获取锁,读写文件后需要释放锁.这将确保这些操作不会再次同时发生。