spaCy:优化标记化
spaCy: optimizing tokenization
我目前正在尝试标记一个文本文件,其中每一行都是推文的正文:
"According to data reported to FINRA, short volume percent for $SALT clocked in at 39.19% on 12-29-17 http://www.volumebot.com/?s=SALT"
"@Good2go @krueb The chart I posted definitely supports ng going lower. Gobstopper' 2.12, might even be conservative."
"@Crypt0Fortune Its not dumping as bad as it used to...."
"$XVG.X LOL. Someone just triggered a cascade of stop-loss orders and scooped up morons' coins. Oldest trick in the stock trader's book."
文件长 59,397 行(一天的数据量),我使用 spaCy pre-processing/tokenization。目前我花了大约 8.5 分钟,我想知道是否有任何方法可以优化以下代码以加快速度,因为 8.5 分钟对于这个过程来说似乎太长了:
def token_loop(path):
store = []
files = [f for f in listdir(path) if isfile(join(path, f))]
start_time = time.monotonic()
for filename in files:
with open("./data/"+filename) as f:
for line in f:
tokens = nlp(line.lower())
tokens = [token.lemma_ for token in tokens if not token.orth_.isspace() and token.is_alpha and not token.is_stop and len(token.orth_) != 1]
store.append(tokens)
end_time = time.monotonic()
print("Time taken to tokenize:",timedelta(seconds=end_time - start_time))
return store
虽然它说的是文件,但它目前只循环播放 1 个文件。
请注意,我只需要这个来标记内容;我不需要任何额外的标记等
听起来你还没有优化流水线。禁用不需要的管道组件会显着加快速度,如下所示:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
这应该可以让您自己缩短到两分钟左右,或者更好。
如果您需要进一步加速,可以使用 nlp.pipe
查看多线程。多线程的文档在这里:
https://spacy.io/usage/processing-pipelines#section-multithreading
您可以使用 nlp.pipe(all_lines)
而不是 nlp(line) 以获得更快的处理速度
查看 Spacy 的文档 - https://spacy.io/usage/processing-pipelines
我目前正在尝试标记一个文本文件,其中每一行都是推文的正文:
"According to data reported to FINRA, short volume percent for $SALT clocked in at 39.19% on 12-29-17 http://www.volumebot.com/?s=SALT"
"@Good2go @krueb The chart I posted definitely supports ng going lower. Gobstopper' 2.12, might even be conservative."
"@Crypt0Fortune Its not dumping as bad as it used to...."
"$XVG.X LOL. Someone just triggered a cascade of stop-loss orders and scooped up morons' coins. Oldest trick in the stock trader's book."
文件长 59,397 行(一天的数据量),我使用 spaCy pre-processing/tokenization。目前我花了大约 8.5 分钟,我想知道是否有任何方法可以优化以下代码以加快速度,因为 8.5 分钟对于这个过程来说似乎太长了:
def token_loop(path):
store = []
files = [f for f in listdir(path) if isfile(join(path, f))]
start_time = time.monotonic()
for filename in files:
with open("./data/"+filename) as f:
for line in f:
tokens = nlp(line.lower())
tokens = [token.lemma_ for token in tokens if not token.orth_.isspace() and token.is_alpha and not token.is_stop and len(token.orth_) != 1]
store.append(tokens)
end_time = time.monotonic()
print("Time taken to tokenize:",timedelta(seconds=end_time - start_time))
return store
虽然它说的是文件,但它目前只循环播放 1 个文件。
请注意,我只需要这个来标记内容;我不需要任何额外的标记等
听起来你还没有优化流水线。禁用不需要的管道组件会显着加快速度,如下所示:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
这应该可以让您自己缩短到两分钟左右,或者更好。
如果您需要进一步加速,可以使用 nlp.pipe
查看多线程。多线程的文档在这里:
https://spacy.io/usage/processing-pipelines#section-multithreading
您可以使用 nlp.pipe(all_lines)
而不是 nlp(line) 以获得更快的处理速度
查看 Spacy 的文档 - https://spacy.io/usage/processing-pipelines