将句子标记为单词 python
tokenize sentence into words python
我想从不同的句子中提取信息,所以我使用 nltk 将每个句子划分为单词,我正在使用此代码:
words=[]
for i in range(len(sentences)):
words.append(nltk.word_tokenize(sentences[i]))
words
它工作得很好,但我想要一些不同的东西.. 例如我有这句话:
'[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'
我希望 "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"
是一个词,而不是分成几个词。
更新:
我想要这样的东西:
[
'Jan',
'31',
'19:28:14',
'nginx',
'10.0.0.0',
'31/Jan/2019:19:28:14',
'+0100',
'POST',
'/test/itf/',
'HTTP/x.x',
'404',
'146',
'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
有什么想法可以实现!?
提前谢谢你
首先您需要选择使用 " 或 ',因为两者都不常见并且会导致任何奇怪的行为。之后就是字符串格式化:
s='"[\"Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\"]" i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"'
words = s.split(' ') # break the sentence into spaces
# ['"["Jan', '31', '19:28:14', 'nginx:', '10.0.0.0', '-', '-', '[31/Jan/2019:19:28:14', '+0100]', '"POST', '/test/itf/', 'HTTP/x.x"', '404', '146', '"-"', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)""]"', 'i', 'want', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)"']
# then access your data list
words[0] # '"["Jan'
words[1] # '31'
words[2] # '19:28:14'
您可以使用 parition()
和 space 定界符、正则表达式和递归来做到这一点,如下所示。不过,我不得不说,此解决方案严格遵守您提供的字符串格式。
import re
s_list = []
def str_partition(text):
parts = text.partition(" ")
part = re.sub('[\[\]\"\'\-]', '', parts[0])
if part.startswith("nginx"):
s_list.append(part.replace(":", ''))
elif part != "":
s_list.append(part)
if not parts[2].startswith('"Moz'):
str_partition(parts[2])
else:
part = re.sub('[\"\']', '', parts[2])
part = part[:-1]
s_list.append(part)
return
s = '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'
str_partition(s)
print(s_list)
输出:
['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100',
'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
您可以 import re
并使用正则表达式解析日志行(这不是自然语言句子):
import re
sentences = ['[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')
words=[]
for sent in sentences:
m = rx.search(sent)
if m:
words.append(list(m.groups()))
else:
words.append(nltk.word_tokenize(sent))
print(words)
参见Python demo。
输出看起来像
[['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]
我想从不同的句子中提取信息,所以我使用 nltk 将每个句子划分为单词,我正在使用此代码:
words=[]
for i in range(len(sentences)):
words.append(nltk.word_tokenize(sentences[i]))
words
它工作得很好,但我想要一些不同的东西.. 例如我有这句话:
'[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'
我希望 "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"
是一个词,而不是分成几个词。
更新: 我想要这样的东西:
[
'Jan',
'31',
'19:28:14',
'nginx',
'10.0.0.0',
'31/Jan/2019:19:28:14',
'+0100',
'POST',
'/test/itf/',
'HTTP/x.x',
'404',
'146',
'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
有什么想法可以实现!? 提前谢谢你
首先您需要选择使用 " 或 ',因为两者都不常见并且会导致任何奇怪的行为。之后就是字符串格式化:
s='"[\"Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\"]" i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"'
words = s.split(' ') # break the sentence into spaces
# ['"["Jan', '31', '19:28:14', 'nginx:', '10.0.0.0', '-', '-', '[31/Jan/2019:19:28:14', '+0100]', '"POST', '/test/itf/', 'HTTP/x.x"', '404', '146', '"-"', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)""]"', 'i', 'want', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)"']
# then access your data list
words[0] # '"["Jan'
words[1] # '31'
words[2] # '19:28:14'
您可以使用 parition()
和 space 定界符、正则表达式和递归来做到这一点,如下所示。不过,我不得不说,此解决方案严格遵守您提供的字符串格式。
import re
s_list = []
def str_partition(text):
parts = text.partition(" ")
part = re.sub('[\[\]\"\'\-]', '', parts[0])
if part.startswith("nginx"):
s_list.append(part.replace(":", ''))
elif part != "":
s_list.append(part)
if not parts[2].startswith('"Moz'):
str_partition(parts[2])
else:
part = re.sub('[\"\']', '', parts[2])
part = part[:-1]
s_list.append(part)
return
s = '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'
str_partition(s)
print(s_list)
输出:
['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100',
'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
您可以 import re
并使用正则表达式解析日志行(这不是自然语言句子):
import re
sentences = ['[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')
words=[]
for sent in sentences:
m = rx.search(sent)
if m:
words.append(list(m.groups()))
else:
words.append(nltk.word_tokenize(sent))
print(words)
参见Python demo。
输出看起来像
[['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]