在应用 ngram 之前理解输入文本的最佳方法
Best way to understand the input text before applying ngram
目前我正在阅读 excel 文件中的文本并对其应用二元语法。 finalList 具有以下示例代码中使用的列表具有从输入 excel 文件中读取的 输入单词 的列表。
在以下库的帮助下从输入中删除了停用词:
from nltk.corpus import stopwords
应用于单词输入文本列表的二元组逻辑
bigram=ngrams(finalList ,2)
输入文本:我完成了我的端到端流程。
当前输出:完成结束,结束结束,结束进程。
期望的输出:完成端到端、端到端的过程。
这意味着像(端到端)这样的一组词应该被视为一个词。
要解决您的问题,您必须使用正则表达式清除停用词。请参阅此示例:
import re
text = 'I completed my end-to-end process..:?'
pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words.
new_text = re.sub(pattern, '', text)
print(new_text)
'I completed my end-to-end process'
# Now you can generate bigrams manually.
# 1. Tokanize the new text
tok = new_text.split()
print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5])
['I', 'completed', 'my', 'end-to-end', 'process']
# 2. Loop over the list and generate bigrams, store them in a var called bigrams
bigrams = []
for i in range(len(tok) - 1): # -1 to avoid index error
bigram = tok[i] + ' ' + tok[i + 1]
bigrams.append(bigram)
# 3. Print your bigrams
for bi in bigrams:
print(bi, end = ', ')
I completed, completed my, my end-to-end, end-to-end process,
希望对您有所帮助!
目前我正在阅读 excel 文件中的文本并对其应用二元语法。 finalList 具有以下示例代码中使用的列表具有从输入 excel 文件中读取的 输入单词 的列表。
在以下库的帮助下从输入中删除了停用词:
from nltk.corpus import stopwords
应用于单词输入文本列表的二元组逻辑
bigram=ngrams(finalList ,2)
输入文本:我完成了我的端到端流程。
当前输出:完成结束,结束结束,结束进程。
期望的输出:完成端到端、端到端的过程。
这意味着像(端到端)这样的一组词应该被视为一个词。
要解决您的问题,您必须使用正则表达式清除停用词。请参阅此示例:
import re
text = 'I completed my end-to-end process..:?'
pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words.
new_text = re.sub(pattern, '', text)
print(new_text)
'I completed my end-to-end process'
# Now you can generate bigrams manually.
# 1. Tokanize the new text
tok = new_text.split()
print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5])
['I', 'completed', 'my', 'end-to-end', 'process']
# 2. Loop over the list and generate bigrams, store them in a var called bigrams
bigrams = []
for i in range(len(tok) - 1): # -1 to avoid index error
bigram = tok[i] + ' ' + tok[i + 1]
bigrams.append(bigram)
# 3. Print your bigrams
for bi in bigrams:
print(bi, end = ', ')
I completed, completed my, my end-to-end, end-to-end process,
希望对您有所帮助!