提取 words/phrase 后跟一个短语

Extracting words/phrase followed by a phrase

我有一个包含短语列表的文本文件。以下是文件的外观:

文件名:KP.txt

从下面的输入(段落)中,我想提取 KP.txt 短语之后的下两个单词(这些短语可以是我上面 KP.txt 文件中显示的任何内容)。我只需要提取接下来的 2 个单词。

输入:

This is Lee. Thanks for contacting me. I wanted to know the exchange policy at Noriaqer hardware services.

在上面的例子中,我发现短语 " I wanted to know"KP.txt 文件内容匹配。所以如果我想在此之后提取接下来的 2 个单词,我的输出将像 "exchange policy".

如何在 python 中提取它?

你可以使用这个:

with open("KP.txt") as fobj:
    phrases = list(map(lambda sentence : sentence.lower().strip(), fobj.readlines()))

paragraph = input("Enter The Whole Paragraph in one line:\t").lower()

for phrase in phrases:
    if phrase in paragraph:
        temp = paragraph.split(phrase)[1:]
        for clause in temp:
            print(" ".join(clause.split()[:2]))

我认为自然语言处理可能是更好的解决方案,但这段代码会有所帮助:)

def search_in_text(kp,text):
    for line in kp:
        #if a search phrase found in kp lines
        if line in text:
            #the starting index of the two words
            i1=text.find(line)+len(line)
            #the end index of the following two words (first index+50 at maximum)
            i2=(i1+50) if len(text)>(i1+50) else len(text)
            #split the following text to words (next_words) and remove empty spaces
            next_words=[word for word in text[i1:i2].split(' ') if word!='']
            #return  only the next two words from (next_words)
            return next_words[0:2]        
    return [] # return empty list if no phrase matching
        
#read your kp file as list of lines
kp=open("kp.txt").read().split("\n")
#input 1 
text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.
output ->> ['exchange', 'policy']
#input 2
text = 'Boss was very angry and said: I wish to know why you are late?'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> Boss was very angry and said: I wish to know why you are late?
output ->> ['why', 'you']

假设您已经知道如何将输入文件读入列表,可以在正则表达式的帮助下完成。

>>> wordlist = ['I would like to understand', 'I wanted to know', 'I wish to know', 'I am interested to know']
>>> input_text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
>>> def word_extraction (input_text, wordlist):
...     for word in wordlist:
...         if word in input_text:
...             output = re.search (r'(?<=%s)(.\w*){2}' % word, input_text)
...             print (output.group ().lstrip ())
>>> word_extraction(input_text, wordlist)
exchange policy
>>> input_text = 'This is Lee. Thanks for contacting me. I wish to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)
where is
>>> input_text = 'This is Lee. Thanks for contacting me. I\'d like to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)

>>>
  1. 首先我们需要检查我们想要的短语是否在句子中。如果您的列表很大,这不是最有效的方法,但现在可以使用。
  2. 接下来如果它在我们的短语“词典”中,我们使用正则表达式来提取我们想要的关键字。
  3. 最后去掉目标词前面的前导白色space。

正则表达式提示:

  • (?<=%s) 是后视断言。意思是检查以“我想知道”开头的句子后面的单词
  • (.\w*){2} 表示我们的短语后跟一个或多个单词的任何字符,停在关键短语后的 2 个单词。