Python re.split() 与 nltk word_tokenize 和 sent_tokenize
Python re.split() vs nltk word_tokenize and sent_tokenize
我正在经历 this question。
我只是想知道 NLTK 在 word/sentence 标记化方面是否比正则表达式更快。
默认 nltk.word_tokenize()
使用 Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer。
请注意,str.split()
没有达到语言学意义上的记号,例如:
>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import word_tokenize
>>> word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']
通常用于以指定的分隔符分隔字符串,例如在制表符分隔的文件中,您可以使用 str.split('\t')
或者当您尝试用换行符 \n
拆分字符串时,当您的文本文件每行只有一个句子时。
让我们在 python3
中做一些基准测试:
import time
from nltk import word_tokenize
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
for _ in range(10):
start = time.time()
for line in data.split('\n'):
line.split()
print ('str.split():\t', time.time() - start)
for _ in range(10):
start = time.time()
for line in data.split('\n'):
word_tokenize(line)
print ('word_tokenize():\t', time.time() - start)
[输出]:
str.split(): 0.05451083183288574
str.split(): 0.054320573806762695
str.split(): 0.05368804931640625
str.split(): 0.05416440963745117
str.split(): 0.05299568176269531
str.split(): 0.05304527282714844
str.split(): 0.05356955528259277
str.split(): 0.05473494529724121
str.split(): 0.053118228912353516
str.split(): 0.05236077308654785
word_tokenize(): 4.056122779846191
word_tokenize(): 4.052812337875366
word_tokenize(): 4.042144775390625
word_tokenize(): 4.101543664932251
word_tokenize(): 4.213029146194458
word_tokenize(): 4.411528587341309
word_tokenize(): 4.162556886672974
word_tokenize(): 4.225975036621094
word_tokenize(): 4.22914719581604
word_tokenize(): 4.203172445297241
如果我们尝试 another tokenizers in bleeding edge NLTK from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:
import time
from nltk.tokenize import ToktokTokenizer
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
toktok = ToktokTokenizer().tokenize
for _ in range(10):
start = time.time()
for line in data.split('\n'):
toktok(line)
print ('toktok:\t', time.time() - start)
[输出]:
toktok: 1.5902607440948486
toktok: 1.5347232818603516
toktok: 1.4993178844451904
toktok: 1.5635688304901123
toktok: 1.5779635906219482
toktok: 1.8177132606506348
toktok: 1.4538452625274658
toktok: 1.5094449520111084
toktok: 1.4871931076049805
toktok: 1.4584410190582275
(注:文本文件来源来自https://github.com/Simdiva/DSL-Task)
如果我们查看本机 perl
实现,python
与 perl
时间对于 ToktokTokenizer
是可比较的。但是在 python 实现中,正则表达式是在 perl 中预编译的,它不是 :
alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36-- https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’
100%[===============================================================================================================================>] 2,690 --.-K/s in 0s
2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]
alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38-- https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’
100%[===============================================================================================================================>] 3,483,550 363KB/s in 7.4s
2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.703s
user 0m1.693s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.715s
user 0m1.704s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.700s
user 0m1.686s
sys 0m0.012s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.727s
user 0m1.700s
sys 0m0.024s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.734s
user 0m1.724s
sys 0m0.008s
(注意:在计时 tok-tok.pl
时,我们必须将输出通过管道传输到文件中,因此这里的计时包括机器输出到文件所花费的时间,而在 nltk.tokenize.ToktokTokenizer
时间,它不包括输出到文件的时间)
关于 sent_tokenize()
,它有点不同,在不考虑准确性的情况下比较速度基准有点古怪。
考虑一下:
如果正则表达式将 textfile/paragraph 拆分为 1 个句子,那么速度几乎是瞬间的,即完成了 0 个工作。但这将是一个可怕的句子分词器...
如果文件中的句子已经被\n
分隔,那么这只是比较str.split('\n')
与re.split('\n')
和[=33=的情况] 与句子标记化无关;P
有关 sent_tokenize()
如何在 NLTK 中工作的信息,请参阅:
- training data format for nltk punkt
因此,为了有效地比较 sent_tokenize()
与其他基于正则表达式的方法(不是 str.split('\n')
),还必须评估准确性并拥有一个包含人工评估句子的标记化格式的数据集。
考虑这个任务:https://www.hackerrank.com/challenges/from-paragraphs-to-sentences
给定文本:
In the third category he included those Brothers (the majority) who
saw nothing in Freemasonry but the external forms and ceremonies, and
prized the strict performance of these forms without troubling about
their purport or significance. Such were Willarski and even the Grand
Master of the principal lodge. Finally, to the fourth category also a
great many Brothers belonged, particularly those who had lately
joined. These according to Pierre's observations were men who had no
belief in anything, nor desire for anything, but joined the Freemasons
merely to associate with the wealthy young Brothers who were
influential through their connections or rank, and of whom there were
very many in the lodge.Pierre began to feel dissatisfied with what he
was doing. Freemasonry, at any rate as he saw it here, sometimes
seemed to him based merely on externals. He did not think of doubting
Freemasonry itself, but suspected that Russian Masonry had taken a
wrong path and deviated from its original principles. And so toward
the end of the year he went abroad to be initiated into the higher
secrets of the order.What is to be done in these circumstances? To
favor revolutions, overthrow everything, repel force by force?No! We
are very far from that. Every violent reform deserves censure, for it
quite fails to remedy evil while men remain what they are, and also
because wisdom needs no violence. "But what is there in running across
it like that?" said Ilagin's groom. "Once she had missed it and turned
it away, any mongrel could take it," Ilagin was saying at the same
time, breathless from his gallop and his excitement.
我们想要得到这个:
In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.
所以简单地做 str.split('\n')
不会给你任何东西。即使不考虑句子的顺序,你也会得到 0 个肯定结果:
>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>>
>>> output = text.split('\n')
>>> sum(1 for sent in text.split('\n') if sent in answer)
0
我正在经历 this question。
我只是想知道 NLTK 在 word/sentence 标记化方面是否比正则表达式更快。
默认 nltk.word_tokenize()
使用 Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer。
请注意,str.split()
没有达到语言学意义上的记号,例如:
>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import word_tokenize
>>> word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']
通常用于以指定的分隔符分隔字符串,例如在制表符分隔的文件中,您可以使用 str.split('\t')
或者当您尝试用换行符 \n
拆分字符串时,当您的文本文件每行只有一个句子时。
让我们在 python3
中做一些基准测试:
import time
from nltk import word_tokenize
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
for _ in range(10):
start = time.time()
for line in data.split('\n'):
line.split()
print ('str.split():\t', time.time() - start)
for _ in range(10):
start = time.time()
for line in data.split('\n'):
word_tokenize(line)
print ('word_tokenize():\t', time.time() - start)
[输出]:
str.split(): 0.05451083183288574
str.split(): 0.054320573806762695
str.split(): 0.05368804931640625
str.split(): 0.05416440963745117
str.split(): 0.05299568176269531
str.split(): 0.05304527282714844
str.split(): 0.05356955528259277
str.split(): 0.05473494529724121
str.split(): 0.053118228912353516
str.split(): 0.05236077308654785
word_tokenize(): 4.056122779846191
word_tokenize(): 4.052812337875366
word_tokenize(): 4.042144775390625
word_tokenize(): 4.101543664932251
word_tokenize(): 4.213029146194458
word_tokenize(): 4.411528587341309
word_tokenize(): 4.162556886672974
word_tokenize(): 4.225975036621094
word_tokenize(): 4.22914719581604
word_tokenize(): 4.203172445297241
如果我们尝试 another tokenizers in bleeding edge NLTK from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:
import time
from nltk.tokenize import ToktokTokenizer
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
toktok = ToktokTokenizer().tokenize
for _ in range(10):
start = time.time()
for line in data.split('\n'):
toktok(line)
print ('toktok:\t', time.time() - start)
[输出]:
toktok: 1.5902607440948486
toktok: 1.5347232818603516
toktok: 1.4993178844451904
toktok: 1.5635688304901123
toktok: 1.5779635906219482
toktok: 1.8177132606506348
toktok: 1.4538452625274658
toktok: 1.5094449520111084
toktok: 1.4871931076049805
toktok: 1.4584410190582275
(注:文本文件来源来自https://github.com/Simdiva/DSL-Task)
如果我们查看本机 perl
实现,python
与 perl
时间对于 ToktokTokenizer
是可比较的。但是在 python 实现中,正则表达式是在 perl 中预编译的,它不是
alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36-- https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’
100%[===============================================================================================================================>] 2,690 --.-K/s in 0s
2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]
alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38-- https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’
100%[===============================================================================================================================>] 3,483,550 363KB/s in 7.4s
2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.703s
user 0m1.693s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.715s
user 0m1.704s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.700s
user 0m1.686s
sys 0m0.012s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.727s
user 0m1.700s
sys 0m0.024s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null
real 0m1.734s
user 0m1.724s
sys 0m0.008s
(注意:在计时 tok-tok.pl
时,我们必须将输出通过管道传输到文件中,因此这里的计时包括机器输出到文件所花费的时间,而在 nltk.tokenize.ToktokTokenizer
时间,它不包括输出到文件的时间)
关于 sent_tokenize()
,它有点不同,在不考虑准确性的情况下比较速度基准有点古怪。
考虑一下:
如果正则表达式将 textfile/paragraph 拆分为 1 个句子,那么速度几乎是瞬间的,即完成了 0 个工作。但这将是一个可怕的句子分词器...
如果文件中的句子已经被
\n
分隔,那么这只是比较str.split('\n')
与re.split('\n')
和[=33=的情况] 与句子标记化无关;P
有关 sent_tokenize()
如何在 NLTK 中工作的信息,请参阅:
- training data format for nltk punkt
因此,为了有效地比较 sent_tokenize()
与其他基于正则表达式的方法(不是 str.split('\n')
),还必须评估准确性并拥有一个包含人工评估句子的标记化格式的数据集。
考虑这个任务:https://www.hackerrank.com/challenges/from-paragraphs-to-sentences
给定文本:
In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.
我们想要得到这个:
In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.
所以简单地做 str.split('\n')
不会给你任何东西。即使不考虑句子的顺序,你也会得到 0 个肯定结果:
>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>>
>>> output = text.split('\n')
>>> sum(1 for sent in text.split('\n') if sent in answer)
0