在 Mac OS X 上使用 NLTK 在 Celery 线程中使用 python-javabridge JVM 的致命错误
Fatal error using python-javabridge JVM in Celery thread with NLTK on Mac OS X
我正在使用 Python wrapper for Weka which is based on python-javabridge. I have a long task to perform and, therefore, I am using Celery 这样做。问题是我得到
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007fff91a3c16f, pid=11698, tid=3587
JRE version: (8.0_31-b13) (build )
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.31-b07 mixed mode bsd-amd64 compressed oops)
Problematic frame:
C [libdispatch.dylib+0x616f] _dispatch_async_f_slow+0x18b
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.
在线程内启动 JVM 时。为此使用了这两行代码(来自weka.core.jvm):
javabridge.start_vm(run_headless=True)
javabridge.attach()
据我了解,这可能是由于 JVM 未附加到 Celery 线程。但是,javabridge.attach()
里面确实是运行。
我错过了什么?
编辑: 我确定了导致问题的代码。它与 NLTK tokenizer. The following code (according to ) 有关,将重现错误:
# hello.py
from nltk.tokenize import RegexpTokenizer
import javabridge
from celery import Celery
app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')
started = False
@app.task
def hello():
global started
if not started:
print 'Starting the VM'
javabridge.start_vm(run_headless=True)
started = True
sentence = "This is a sentence with some numbers like 1, 2 or and some weird symbols like @, $ or ! :)"
tokenizer = RegexpTokenizer(r'\w+')
tokenized_sentence = tokenizer.tokenize(sentence.lower())
print "Tokens:", tokenized_sentence
return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
dict(greetee='world'))
在不启动 JVM 的情况下,代码 运行 正确。它也适用于 not 运行ning 作为 Celery 任务。我不明白为什么它会崩溃。
编辑 2: 它实际上可以在干净的 Ubuntu 环境(Dockerized)中工作,但不能在 Mac OS 中工作X Yosemite (v10.3)。
编辑 3: 如评论中所述,如果 from nltk.tokenize import RegexpTokenizer
在任务包装器内部完成,即在 hello()
函数内部完成,它就会起作用。
默认情况下,Celery 启动四个独立的工作进程。 (请参阅 celery worker
的 -c
命令行选项。)您需要确保在所有这些中启动 JVM。这个例子对我有用:
# hello.py
import os
import threading
from celery import Celery
import javabridge
app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')
started = False
@app.task
def hello():
global started
if not started:
print 'Starting the VM'
javabridge.start_vm(run_headless=True)
started = True
return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
dict(greetee='world'))
和
# client.py
from hello import hello
r = hello.delay()
print r.get(timeout=1)
安装在一个处女Ubuntu 14.04 机器上:
$ sudo apt-get update -y
$ sudo apt-get install -y openjdk-7-jdk python-pip python-numpy python-dev rabbitmq-server
$ sudo pip install celery javabridge
$ sudo /etc/init.d/rabbitmq-server start
开始工作:
$ celery -A hello worker
...
-------------- celery@a7cc1bedc40d v3.1.17 (Cipater)
---- **** -----
--- * *** * -- Linux-3.16.7-tinycore64-x86_64-with-Ubuntu-14.04-trusty
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: hello:0x7f5464766b50
- ** ---------- .> transport: amqp://guest:**@localhost:5672//
- ** ---------- .> results: amqp
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
[2015-04-21 10:04:31,262: WARNING/MainProcess] celery@a7cc1bedc40d ready.
在另一个window,运行一个客户五次:
$ python client.py
Hello, world!
$ python client.py
Hello, world!
$ python client.py
Hello, world!
$ python client.py
Hello, world!
$ python client.py
Hello, world!
在 worker 中观察 window JVM 在来自客户端的前四次调用(转到四个不同的进程)中启动,但在第五次调用中没有启动:
[2015-04-21 10:05:53,491: WARNING/Worker-1] Starting the VM
[2015-04-21 10:05:55,028: WARNING/Worker-2] Starting the VM
[2015-04-21 10:05:56,411: WARNING/Worker-3] Starting the VM
[2015-04-21 10:05:57,318: WARNING/Worker-4] Starting the VM
我正在使用 Python wrapper for Weka which is based on python-javabridge. I have a long task to perform and, therefore, I am using Celery 这样做。问题是我得到
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007fff91a3c16f, pid=11698, tid=3587
JRE version: (8.0_31-b13) (build )
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.31-b07 mixed mode bsd-amd64 compressed oops)
Problematic frame:
C [libdispatch.dylib+0x616f] _dispatch_async_f_slow+0x18b
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.
在线程内启动 JVM 时。为此使用了这两行代码(来自weka.core.jvm):
javabridge.start_vm(run_headless=True)
javabridge.attach()
据我了解,这可能是由于 JVM 未附加到 Celery 线程。但是,javabridge.attach()
里面确实是运行。
我错过了什么?
编辑: 我确定了导致问题的代码。它与 NLTK tokenizer. The following code (according to
# hello.py
from nltk.tokenize import RegexpTokenizer
import javabridge
from celery import Celery
app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')
started = False
@app.task
def hello():
global started
if not started:
print 'Starting the VM'
javabridge.start_vm(run_headless=True)
started = True
sentence = "This is a sentence with some numbers like 1, 2 or and some weird symbols like @, $ or ! :)"
tokenizer = RegexpTokenizer(r'\w+')
tokenized_sentence = tokenizer.tokenize(sentence.lower())
print "Tokens:", tokenized_sentence
return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
dict(greetee='world'))
在不启动 JVM 的情况下,代码 运行 正确。它也适用于 not 运行ning 作为 Celery 任务。我不明白为什么它会崩溃。
编辑 2: 它实际上可以在干净的 Ubuntu 环境(Dockerized)中工作,但不能在 Mac OS 中工作X Yosemite (v10.3)。
编辑 3: 如评论中所述,如果 from nltk.tokenize import RegexpTokenizer
在任务包装器内部完成,即在 hello()
函数内部完成,它就会起作用。
默认情况下,Celery 启动四个独立的工作进程。 (请参阅 celery worker
的 -c
命令行选项。)您需要确保在所有这些中启动 JVM。这个例子对我有用:
# hello.py
import os
import threading
from celery import Celery
import javabridge
app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')
started = False
@app.task
def hello():
global started
if not started:
print 'Starting the VM'
javabridge.start_vm(run_headless=True)
started = True
return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
dict(greetee='world'))
和
# client.py
from hello import hello
r = hello.delay()
print r.get(timeout=1)
安装在一个处女Ubuntu 14.04 机器上:
$ sudo apt-get update -y $ sudo apt-get install -y openjdk-7-jdk python-pip python-numpy python-dev rabbitmq-server $ sudo pip install celery javabridge $ sudo /etc/init.d/rabbitmq-server start
开始工作:
$ celery -A hello worker ... -------------- celery@a7cc1bedc40d v3.1.17 (Cipater) ---- **** ----- --- * *** * -- Linux-3.16.7-tinycore64-x86_64-with-Ubuntu-14.04-trusty -- * - **** --- - ** ---------- [config] - ** ---------- .> app: hello:0x7f5464766b50 - ** ---------- .> transport: amqp://guest:**@localhost:5672// - ** ---------- .> results: amqp - *** --- * --- .> concurrency: 4 (prefork) -- ******* ---- --- ***** ----- [queues] -------------- .> celery exchange=celery(direct) key=celery [2015-04-21 10:04:31,262: WARNING/MainProcess] celery@a7cc1bedc40d ready.
在另一个window,运行一个客户五次:
$ python client.py Hello, world! $ python client.py Hello, world! $ python client.py Hello, world! $ python client.py Hello, world! $ python client.py Hello, world!
在 worker 中观察 window JVM 在来自客户端的前四次调用(转到四个不同的进程)中启动,但在第五次调用中没有启动:
[2015-04-21 10:05:53,491: WARNING/Worker-1] Starting the VM [2015-04-21 10:05:55,028: WARNING/Worker-2] Starting the VM [2015-04-21 10:05:56,411: WARNING/Worker-3] Starting the VM [2015-04-21 10:05:57,318: WARNING/Worker-4] Starting the VM