Python NLTK - 在删除数字的同时将句子标记为单词
Python NLTK - Tokenize sentences into words while removing numbers
希望有人能帮助解决这个问题!我有一个从文本文件中读取的句子列表。我正在尝试将句子标记为单词,同时还删除只包含数字的句子。数字出现的时间没有规律。
我有的句子:
[
[' 1'],
['This is a text file,'],
['to keep the words,'],
[' 2'],
['Another line of the text:'],
[' 3']
]
期望的输出:
[
['This', 'is', 'a', 'text', 'file,'],
['to', 'keep', 'the', 'words,'],
['Another', 'line', 'of', 'the', 'text:'],
]
经过一些预处理后,现在您可以应用分词
import re
a = [
[' 1'],
['This is a text file,'],
['to keep the words,'],
[' 2'],
['Another line of the text:'],
[' 3']
]
def replace_digit(string):
return re.sub(r'\d', '', string).strip()
data = []
process = [replace_digit(i[0]) for i in a]
filtered = filter(lambda x: x, process)
tokenize = map(lambda x: x.split(), filtered)
print(list(tokenize))
希望有人能帮助解决这个问题!我有一个从文本文件中读取的句子列表。我正在尝试将句子标记为单词,同时还删除只包含数字的句子。数字出现的时间没有规律。
我有的句子:
[
[' 1'],
['This is a text file,'],
['to keep the words,'],
[' 2'],
['Another line of the text:'],
[' 3']
]
期望的输出:
[
['This', 'is', 'a', 'text', 'file,'],
['to', 'keep', 'the', 'words,'],
['Another', 'line', 'of', 'the', 'text:'],
]
经过一些预处理后,现在您可以应用分词
import re
a = [
[' 1'],
['This is a text file,'],
['to keep the words,'],
[' 2'],
['Another line of the text:'],
[' 3']
]
def replace_digit(string):
return re.sub(r'\d', '', string).strip()
data = []
process = [replace_digit(i[0]) for i in a]
filtered = filter(lambda x: x, process)
tokenize = map(lambda x: x.split(), filtered)
print(list(tokenize))