用深拷贝复制 class 会以某种方式导致无限递归

Duplicating class with deep-copy causes infinite recursion somehow

我正在尝试简单地在 python 中制作我的 URL class 的独立副本,这样我就可以在不影响原件的情况下修改副本。

以下是我的问题代码的精简可执行版本:

from bs4 import BeautifulSoup
from copy import deepcopy
from urllib import request

url_dict = {}


class URL:
    def __init__(self, url, depth, log_entry=None, soup=None):
        self.url = url
        self.depth = depth  # Current, not total, depth level
        self.log_entry = log_entry
        self.soup = soup
        self.indent = '    ' * (5 - self.depth)
        self.log_url = 'test.com'

        # Blank squad
        self.parsed_list = []

    def get_log_output(self):
        return self.indent + self.log_url

    def get_print_output(self):
        if self.log_entry is not None:
            return self.indent + self.log_url + ' | ' + self.log_entry

        return self.indent + self.log_url

    def set_soup(self):
        if self.soup is None:
            code = ''

            try:  # Read and store code for parsing
                code = request.urlopen(self.url).read()
            except Exception as exception:
                print(str(exception))

            self.soup = BeautifulSoup(code, features='lxml')


def crawl(current_url, current_depth):
    current_check_link = current_url
    has_crawled = current_check_link in url_dict
    
    if current_depth > 0 and not has_crawled:
        current_crawl_job = URL(current_url, current_depth)
        current_crawl_job.set_soup()
        url_dict[current_check_link] = deepcopy(current_crawl_job)


for link in ['http://xts.site.nfoservers.com']:  # Crawl for each URL the user inputs
    crawl(link, 3)

产生的异常:

Traceback (most recent call last):
File "/home/[CENSORED]/.vscode-oss/extensions/ms-python.python-2020.10.332292344/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_trace_dispatch_regular.py", line 374, in __call__
if cache_skips.get(frame_cache_key) == 1:
RecursionError: maximum recursion depth exceeded in comparison
Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow.
Python runtime state: initialized

我无法判断这个特定的无限递归发生在哪里。我已经阅读了诸如此类的问题 RecursionError when python copy.deepcopy but I'm not even sure that it applies to my use-case. If it does apply, then my brain just can't seem to understand it as I'm under the impression deepcopy() should just take each self variable value and duplicate it to the new class. If that's not the case, then I would love some enlightenment. All the articles in my search results are similar to this 对我的情况不是很有帮助。

请注意,我并不是简单地寻找我的代码的修改片段来解决这个问题。我主要想了解这里到底发生了什么,这样我既可以现在修复它,也可以在将来避免它。

编辑:这似乎是 deepcopyset_soup() 方法之间的冲突。如果我更换

url_dict[current_check_link] = deepcopy(current_crawl_job)

url_dict[current_check_link] = current_crawl_job

上面的代码片段运行没有错误。同样,如果我完全删除 current_crawl_job.set_soup(),我也不会收到任何错误。我就是不能两者兼得。

Edit2:我可以删除任何一个

try:  # Read and store code for parsing
    code = request.urlopen(self.url).read()
except Exception as exception:
    print(str(exception))

self.soup = BeautifulSoup(code, features='lxml')

错误再次消失,程序运行正常。

This Article 表示,

Deep copy is a process in which the copying process occurs recursively. It means first constructing a new collection object and then recursively populating it with copies of the child objects found in the original.

所以我的理解是,

A = [1,2,[3,4],5]

B = deepcopy(A) #This will make 1 level deep recursive call to copy the inner list

C = [1,[2,[3,[4,[5,[6]]]]]]

D = deepcopy(C) #This will make 5 levels deep recursive call (recursively copying inner lists)

我最好的猜测

Python 有最大 递归深度限制 以防止 堆栈溢出 .

您可以使用

找到最大递归深度限制
import sys
print(sys.getrecursionlimit())

在您的例子中,您正在尝试深度复制 Class 对象class object 的递归调用必须 超过 最大递归限制.

可能的解决方案

您可以告诉 python 使用

设置更高的最大递归限制
limit = 2000
sys.setrecursionlimit(limit)

或者您可能会随着程序的进行而增加 limit。有关更多信息,请访问 this link

我不是 100% 确定增加限制会完成这项工作,但我很确定某些 子对象 你的 class 对象 有太多 内部对象 以至于 deepcopy 变得疯狂!

编辑

有人告诉我下面一行是罪魁祸首,

self.soup = BeautifulSoup(code, features='lxml')

当您执行 current_crawl_job.set_soup() 时,您的 class 的 None soup 对象将替换为复杂的 BeautifulSoup目的。这给 deepcopy 方法带来了麻烦。

建议

set_soup方法中,将self.soup属性保留为原始 html 字符串并在您尝试修改它时将其转换为 BeautifulSoup 对象。这将解决您的深层复制问题。