python - 为什么多线程和不同的 functions/scope 共享单个导入过程

python - why multi threads and different functions/scope share single import process

自从我 python 年前开始工作以来,这个陷阱是第一个难以发现的错误。

让我举一个过于简单的例子,我有这个 files/dir:

[xiaobai@xiaobai import_pitfall]$ tree -F -C -a
.
├── import_all_pitall/
│   ├── hello.py
│   └── __init__.py
└── thread_test.py

1 directory, 3 files
[xiaobai@xiaobai import_pitfall]$

thread_test.py的内容:

[xiaobai@xiaobai import_pitfall]$ cat thread_test.py 
import time
import threading

def do_import1():
    print( "do_import 1A" )
    from import_all_pitall import hello
    print( "do_import 1B", id(hello), locals() )

def do_import2():
    print( "do_import 2A" )
    from import_all_pitall import hello as h
    print( "do_import 2B", id(h), locals() )

def do_import3():
    print( "do_import 3A" )
    import import_all_pitall.hello as h2
    #no problem if import different module #import urllib as h2
    print( "do_import 3B", id(h2), locals() )

print( "main 1" )
t = threading.Thread(target=do_import1)
print( "main 2" )
t.start()
print( "main 3" )
t2 = threading.Thread(target=do_import2)
print( "main 4" )
t2.start()
print( "main 5" )
print(globals()) #no such hello
#time.sleep(2) #slightly wait for do_import 1A import finished to test print hello below.
#print( "main 6", id(hello), locals() ) #"name 'hello' not defined" error even do_import1 was success
do_import3()
print( "main -1" )
[xiaobai@xiaobai import_pitfall]$

hello.py的内容:

[xiaobai@xiaobai import_pitfall]$ cat import_all_pitall/hello.py
print( "haha0" )
import time
t = time.time()
print( "haha1" )
def do_task():
    success = 0
    while not success:
        try:
            time.sleep(1)
            undefined_func( "Done haha" )
            success = 1
        except Exception as e:
            print("exception occur", e)
            print( "haha time is ", t )
do_task()
print( "haha -1" )
[xiaobai@xiaobai import_pitfall]$

而import_all_pitall/init.py是一个空文件。

让我们运行它:

[xiaobai@xiaobai import_pitfall]$ python thread_test.py 
main 1
main 2
do_import 1A
 main 3
haha0
haha1
main 4
do_import 2A
main 5
{'do_import1': <function do_import1 at 0x7f9d884760c8>, 'do_import3': <function do_import3 at 0x7f9d884a6758>, 'do_import2': <function do_import2 at 0x7f9d884a66e0>, '__builtins__': <module '__builtin__' (built-in)>, '__file__': 'thread_test.py', 't2': <Thread(Thread-2, started 140314429765376)>, '__package__': None, 'threading': <module 'threading' from '/usr/lib64/python2.7/threading.pyc'>, 't': <Thread(Thread-1, started 140314438158080)>, 'time': <module 'time' from '/usr/lib64/python2.7/lib-dynload/timemodule.so'>, '__name__': '__main__', '__doc__': None}
do_import 3A
('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
^C('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
^C('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
^C^C('exception occur', NameError("global name 'undefined_func' is not defined",))
('haha time is ', 1439451183.753475)
... #Forever

仔细看,"do_import 2B"和"do_import 3B"在哪里?它只是挂在导入指令上,甚至没有转到导入的第一行,因为只有一个 time.time() 会是 运行。它挂起只是因为第一次在 "unfinished" 循环状态下在另一个 thread/function 上导入相同的模块。我的整个系统很大而且是多线程的,在我知道情况之前很难调试。

在我注释掉hello.py中的'#undefined_func("Done haha")'后:

print( "haha0" )
import time
t = time.time()
print( "haha1" )
def do_task():
    success = 0
    while not success:
        try:
            time.sleep(1)
            #undefined_func( "Done haha" )
            success = 1
        except Exception as e:
            print("exception occur", e)
            print( "haha time is ", t )
do_task()
print( "haha -1" )

和运行它:

[xiaobai@xiaobai import_pitfall]$ python3 thread_test.py 
main 1
main 2
do_import 1A
main 3
main 4
do_import 2A
main 5
{'do_import3': <function do_import3 at 0x7f31a462c048>, '__package__': None, 't2': <Thread(Thread-2, started 139851179529984)>, '__name__': '__main__', '__cached__': None, 'threading': <module 'threading' from '/usr/lib64/python3.4/threading.py'>, '__doc__': None, 'do_import2': <function do_import2 at 0x7f31ac1d56a8>, 'do_import1': <function do_import1 at 0x7f31ac2c0bf8>, '__spec__': None, 't': <Thread(Thread-1, started 139851187922688)>, '__file__': 'thread_test.py', 'time': <module 'time' from '/usr/lib64/python3.4/lib-dynload/time.cpython-34m.so'>, '__loader__': <_frozen_importlib.SourceFileLoader object at 0x7f31ac297048>, '__builtins__': <module 'builtins' (built-in)>}
do_import 3A
haha0
haha1
haha -1
do_import 1B 139851188124312 {'hello': <module 'import_all_pitall.hello' from '/home/xiaobai/note/python/import_pitfall/import_all_pitall/hello.py'>}
do_import 2B 139851188124312 {'h': <module 'import_all_pitall.hello' from '/home/xiaobai/note/python/import_pitfall/import_all_pitall/hello.py'>}
do_import 3B 139851188124312 {'h2': <module 'import_all_pitall.hello' from '/home/xiaobai/note/python/import_pitfall/import_all_pitall/hello.py'>}
main -1
[xiaobai@xiaobai import_pitfall]$ 

我打印 id 并找出它们都共享相同的 id 139851188124312。所以 3 个函数共享相同的导入 object/process。但这对我来说没有意义,我认为对象是函数的本地对象,因为如果我尝试在全局范围内打印导入的 "hello" 对象,它会抛出错误:

编辑 thread_test.py 以在全局范围内打印 hello 对象:

...
print( "main 5" )
print(globals()) #no such hello
time.sleep(2) #slightly wait for do_import 1A import finished to test print hello below.
print( "main 6", id(hello), locals() ) #"name 'hello' not defined" error even do_import1 was success
do_import3()
print( "main -1" )

让我们运行它:

[xiaobai@xiaobai import_pitfall]$ python3 thread_test.py 
main 1
main 2
do_import 1A
main 3
main 4
do_import 2A
main 5
{'t': <Thread(Thread-1, started 140404878976768)>, '__spec__': None, 'time': <module 'time' from '/usr/lib64/python3.4/lib-dynload/time.cpython-34m.so'>, '__cached__': None, '__loader__': <_frozen_importlib.SourceFileLoader object at 0x7fb296b87048>, 'do_import2': <function do_import2 at 0x7fb296ac56a8>, 'do_import1': <function do_import1 at 0x7fb296bb0bf8>, '__doc__': None, '__file__': 'thread_test.py', 'do_import3': <function do_import3 at 0x7fb28ef19f28>, 't2': <Thread(Thread-2, started 140404870584064)>, '__name__': '__main__', '__package__': None, '__builtins__': <module 'builtins' (built-in)>, 'threading': <module 'threading' from '/usr/lib64/python3.4/threading.py'>}
haha0
haha1
haha -1
do_import 1B 140404879178392 {'hello': <module 'import_all_pitall.hello' from '/home/xiaobai/note/python/import_pitfall/import_all_pitall/hello.py'>}
do_import 2B 140404879178392 {'h': <module 'import_all_pitall.hello' from '/home/xiaobai/note/python/import_pitfall/import_all_pitall/hello.py'>}
Traceback (most recent call last):
  File "thread_test.py", line 31, in <module>
    print( "main 6", id(hello), locals() ) #"name 'hello' not defined" error even do_import1 was success
NameError: name 'hello' is not defined
[xiaobai@xiaobai import_pitfall]$ 

hello 不是全局的,但为什么它可以被不同函数的不同线程共享?为什么 python 不允许唯一本地导入?为什么 python 共享导入过程,并且它使所有其他线程无缘无故地 "wait" 仅仅因为一个线程在导入过程中挂起?

我建议您打印 threading.current_thread().name 并在所有打印件中命名您的线程。 做了这个动作真的会更容易理解。

Look carefully, where does "do_import 2B" and "do_import 3B" ?

Python 当前正在加载模块,Python 导入进程是线程安全的。这意味着两个线程不能同时加载模块。这不是关于处理 time.time(),而是关于锁定文件。

I print the id and figure they all share the same id 140589697897480

是的,因为Python只加载一个模块一次。将您的 Python 模块视为单例。

Hello is not global, but why it can be share by different thread's in different functions ?

这是因为 hello 是指向共享模块的局部变量。如果如前所述,您将模块视为单例,然后您认为同一进程中线程之间的所有内存都是共享的,那么单例将与所有线程共享。

正如很多人所说,这不是错误,而是一个功能:)


这是另一个例子。让我们考虑 2 个文件:main.py 是执行的文件,other.py 是导入的文件。

这里是 main.py :

import threading
import logging
logging.basicConfig(level=logging.INFO)

def do_import_1():
    import other
    logging.info("I am %s and who did the import job ? %s", threading.current_thread().name, other.who_did_the_job.name)

def do_import_2():
    import other
    logging.info(other.who_did_the_job.name)
    logging.info("I am %s and who did the import job ? %s", threading.current_thread().name, other.who_did_the_job.name)

thread_import_1 = threading.Thread(target=do_import_1, name="Thread import 1")
thread_import_2 = threading.Thread(target=do_import_2, name="Thread import 2")

thread_import_1.start()
thread_import_2.start()

这里是other.py

import threading

who_did_the_job = threading.current_thread()
print "Thread loading the module : ", who_did_the_job.name

我使用 logging 是为了避免当 2 个线程尝试同时写入 stdout 时出现问题。这是我得到的结果(python 2.7):

Thread loading the module :  Thread import 1
INFO:root:I am Thread import 1 and who did the import job ? Thread import 1
INFO:root:Thread import 1
INFO:root:I am Thread import 2 and who did the import job ? Thread import 1

如您所见,该模块只导入了一次。

回答其中一个问题-

I print the id and figure they all share the same id 140589697897480. So 3 functions share the same import object/process.

是的,当您导入模块时,python 导入模块对象并将其缓存在 sys.modules 中。然后对于该模块的任何后续导入,python 从 sys.modules 和 return 中获取模块对象,它不会再次导入。

对于同一个问题的第二部分 -

But this doesn't make sense to me, i though object is local to the function, because if i try to print imported "hello" object on global scope, it will throw error

嗯,sys.modules 不是本地的,但是名称 hello 是函数的本地名称。如上所述,如果您再次尝试导入该模块,python 将首先查找 sys.modules 以查看是否已导入,如果包含该模块,则 return ,否则导入它并添加到 sys.modules.


对于第一个程序,当导入 python 模块时,它是从顶层 运行 开始的,在你的 hello.py 中你有一个无限循环 - while 1: ,因为 1 始终为真。所以导入永远不会完成。

如果你不想无限循环到运行,你应该在导入模块时把你不想运行的代码放在-

if __name__ == '__main__':

上面if语句里面的代码只会运行,如果脚本直接运行,导入模块时不会运行。


我猜你说 -

After i comment out the '#undefined_func( "Done haha" )' in hello.py

你居然注释掉了完整的无限循环,所以导入成功了。