使用 Astropy 打开 FITS 时出现 OSError 24

Question

首先，我已经阅读了以下内容：

https://astropy.readthedocs.io/en/latest/io/fits/appendix/faq.html#i-m-opening-many-fits-files-in-a-loop-and-getting-oserror-too-many-open-files

还有一些来自第一个的链接，但是 none 其中有效...

我的问题是在 Jupyter Notebook 中打开巨大的 (>80 Mb/pc.) 和大量的 (~3000) FITS 文件。相关代码片段如下：

# Dictionary to store NxN data matrices of cropped image tiles
CroppedObjects = {}

# Defining some other, here used variable....
# ...

# Interate over all images ('j'), which contain the current object, indexed by 'i'
for i in range(0, len(finalObjects)):
    for j in range(0, len(containingImages[containedObj[i]])):

        countImages += 1

        # Path to the current image: 'mnt/...'
        current_image_path = ImagePaths[int(containingImages[containedObj[i]][j])]

        # Open .fits images
        with fits.open(current_image_path, memmap=False) as hdul:
            # Collect image data
            image_data = fits.getdata(current_image_path)

            # Collect WCS data from the current .fits's header
            ImageWCS = wcs.WCS(hdul[1].header)

            # Cropping parameters:
            # 1. Sky-coordinates of the croppable object
            # 2. Size of the crop, already defined above
            Coordinates = coordinates.SkyCoord(finalObjects[i][1]*u.deg,finalObjects[i][2]*u.deg, frame='fk5')
            size = (cropSize*u.pixel, cropSize*u.pixel)

            try:
                # Cut out the image tile
                cutout = Cutout2D(image_data, position=Coordinates, size=size, wcs=ImageWCS, mode='strict')

                # Write the cutout to a new FITS file
                cutout_filename = "Cropped_Images_Sorted/Cropped_" + str(containedObj[i]) + current_image_path[-23:]

                # Sava data to dictionary
                CroppedObjects[cutout_filename] = cutout.data

                foundImages += 1

            except:
                pass

            else:
                del image_data
                continue

        # Memory maintainance                
        gc.collect()

        # Progress bar
        sys.stdout.write("\rProgress: [{0}{1}] {2:.3f}%\tElapsed: {3}\tRemaining: {4}  {5}".format(u'\u2588' * int(countImages/allCrops * progressbar_width),
                                                                                                   u'\u2591' * (progressbar_width - int(countImages/allCrops * progressbar_width)),
                                                                                                   countImages/allCrops * 100,
                                                                                                   datetime.now()-starttime,
                                                                                                   (datetime.now()-starttime)/countImages * (allCrops - countImages),
                                                                                                   foundImages))

        sys.stdout.flush()

好吧，它实际上做了三件事：

打开特定的 FITS 文件
从中切出一个正方形（但是 strictly，因此如果数组仅部分重叠，try 语句会跳转到循环中的下一步）
更新进度条

然后转到下一个文件，执行相同的操作并遍历我所有的 FITS 文件。

但是：如果我尝试运行此代码，在找到大约 1000 张图片后，它会停止并给出 OSError: [Errno 24] Too many open files 在线：

image_data = fits.getdata(current_image_path)

我尝试了所有应该解决问题的方法，但没有任何帮助...甚至没有将内存映射设置为 false 或使用 fits.getdata 和 gc.collect()...还尝试了许多小的更改，例如运行没有 try 语句，剪切掉所有图像块，没有任何限制。 else语句里面的del也是我又一次惨痛的尝试。我还能做些什么来使它最终起作用？
另外，如果有不清楚的地方，请随时问我！我也会尽力帮助您理解问题！

Answer 1

我过去遇到过类似的问题（请参阅 here）。最后我让它大致像这样工作：

total = 0
for filename in filenames:
    with fits.open(filename, memmap=False) as hdulist:
        data = hdulist['spam'].data
    total += data.sum()

一些注意事项：

使用fits.open打开文件，使用memmap=False
在 with 块中使用它，使文件关闭可靠
保持 with 块简短，只需将您需要的数据加载到内存中，然后通过退出关闭文件
关闭文件后对数据做你需要做的事情；这可能不是真正需要的，但是如果 Python 对文件中数据的引用是阻止它被关闭的问题，那么这会简化情况。我不认为剪切代码是您示例中的问题，但它可能是 - 尝试取消注释？
不要做额外的事情 fits.getdata 我认为这会再次打开文件
不需要del和gc.collect，如果代码像这里建议的那样简单，就不会有循环引用，Python会可靠地删除对象范围结束

现在这可能无济于事，您仍然会遇到问题。在这种情况下，继续进行的方法是制作一个最小的可重现示例，但 Astropy 开发人员可以运行（就像我所做的 here）对您不起作用，然后向 Astropy 提交问题, 给出你的 Python 版本、Astropy 版本和操作系统，或者 post 在这里。关键是：这很复杂并且可能运行时间/版本相关，因此需要尝试将其固定为任何人都可以运行的示例，但对您来说失败了。

Answer 2

这条线伤害了你：

image_data = fits.getdata(current_image_path)

您刚刚在上一行用 memmap=False 打开了该文件，但是在该行中您用 memmap=True 重新打开了它，并在保留引用时保持文件打开image_data 通过将其包装在 Cutout2D 中，然后使用以下方法保留对数据的引用：

CroppedObjects[cutout_filename] = cutout.data

据我所知，如果没有必要，Cutout2D 不一定会复制数据，因此您仍然有效地持有对 image_data 的引用这是mmap'd。

解决方法：这里不要用fits.getdata。请参阅有关此 in the docs 的警告：

These functions are useful for interactive Python sessions and simple analysis scripts, but should not be used for application code, as they are highly inefficient. For example, each call to getval() requires re-parsing the entire FITS file. Code that makes repeated use of these functions should instead open the file with open() and access the data structures directly.

所以在你的情况下你想替换行：

image_data = fits.getdata(current_image_path)

与

image_data = hdul[1].data

正如@Christoph 在他的回答中所写，摆脱所有 del image_data 和 gc.collect() 的东西，因为它无论如何都没有帮助你。

附录： 来自 Cutout2D 的 API 文档：

If False (default), then the cutout data will be a view into the original data array. If True, then the cutout data will hold a copy of the original data array.

所以这是明确说明（我通过查看代码确认了这一点）Cutout2D 只是查看原始数据数组，这意味着它保留了对它的引用。如果你愿意，你可以通过调用 Cutout2D(..., copy=True) 来避免这种情况。如果你这样做了，你可能也可以取消 memmap=False。使用 mmap 可能有用也可能没用：它部分取决于图像的大小和可用的物理 RAM。在你的情况下，它可能更快，因为你没有使用整个图像，而只是截取它们。这意味着使用 memmap=True 可能会更有效，因为它可以避免将整个图像数组分页到内存中。

但这也可能取决于很多因素，因此您可能希望使用 fits.open(..., memmap=False)+Cutout2D(..., copy=False) 与 fits.open(..., memmap=True)+Cutout2D(..., copy=True) 进行一些性能测试，也许文件数量较少。

使用 Astropy 打开 FITS 时出现 OSError 24

OSError 24 when opening FITS with Astropy

python

fits

astropy

jupyter-notebook