如何在不使用多个循环的情况下检查单词是否在字符串中

Question

所以这个程序的目的是为ner.txt中的每个词找例句。例如，如果单词 apple 在 ner.txt 中，那么我想查找是否有包含单词 apple 的句子并输出类似 apple: you should buy an apple juice.

所以代码的逻辑很简单，因为我只需要在 ner.txt. 中每个单词一个例句。我正在使用 NLTK 来确定它是否是一个句子。

问题出在代码的底部。我正在使用 2 个 for 循环来为每个单词查找例句。这非常慢并且不适用于大文件。我怎样才能使这个有效？还是没有我的逻辑有更好的方法来做到这一点？

from nltk.tokenize import sent_tokenize

news_articles = "test.txt"
oov_ner = "ner.txt"

news_data = ""
with open(news_articles, "r") as inFile:
    news_data = inFile.read()

base_news = sent_tokenize(news_data)

with open(oov_ner, "r") as oovNER:
    oov_ner_content = oovNER.readlines()

oov_ner_data = [x.strip() for x in oov_ner_content]

my_dict = {}

for oovner in oov_ner_data:
    for news in base_news:
        if oovner in news:
            my_dict[oovner] = news
            print(my_dict)

Answer 1

这是我要做的：将过程分为两个步骤，索引创建和查找。

from nltk.tokenize import sent_tokenize, word_tokenize

# 1. create a reusable word index like {'worda': [2, 4, 10], 'wordb': [1, 9]}
with open("test.txt", "r", encoding="utf8") as fp:
    news_sentences = sent_tokenize(fp.read())

index = {}
for i, sentence in enumerate(news_sentences):
    for word in word_tokenize(sentence):
        word = word.lower()
        if word not in index:
            index[word] = []
        index[word].append(i)

# 2. look up words from that index and retrieve the associated sentences
with open("ner.txt", "r", encoding="utf8") as fp:
    oov_ner_data = [l.strip() for l in fp.readlines()]

matches = {}

for word in oov_ner_data:
    word = word.lower()
    if word in index:
        matches[word] = [news_sentences[i] for i in index[word]]

print(matches)

第 1 步需要多长时间才能运行 sent_tokenize() 和 word_tokenize() 处理您的文本。您对此无能为力。但是你只需要做一次，然后就可以非常快速地运行不同的单词列表。

运行同时使用 sent_tokenize() 和 word_tokenize() 的优点是它可以防止由于部分匹配而导致的误报。例如，如果句子包含“embark”，您的解决方案会找到“bark”的正匹配项，而我的则不会。换句话说 - 产生错误结果的更快解决方案并不是改进。

Answer 2

我不会像你现在那样取一个词，即外部 for 循环，而是交换循环并在找到与句子匹配的词时中断 - 这样你会节省一些时间，因为现在您正在使用 'oovner' 并尝试将其与 'base_news' 中的每个句子 'news' 进行匹配。如果你交换循环，你就可以在找到匹配项后离开。

这个：

for oovner in oov_ner_data:
    for news in base_news:
        if oovner in news:
            my_dict[oovner] = news
            print(my_dict)

进入这个：

for news in base_news:
    for oovner in oov_ner:data:
        if oovner in news:
            my_dict[oovner] = news
            print(my_dict)
            break

我不会说它是最优的，但它应该可以加快速度。

如何在不使用多个循环的情况下检查单词是否在字符串中

How to check if a word is in a string without using multiple loops

python

nlp

nltk

python-3.x