标记 url
tokenize the url
我有一个句子(日志)看起来像这样['GET http://10.0.0.0:1000/ HTTP/X.X']
我想以这种形式拥有它:
['GET', 'http://10.0.0.0:1000/', 'HTTP/X.X']
但这不是秋天。我用过这段代码:
import re
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)')
words=[]
for sent in sentences:
m = rx.search(sent)
if m:
words.append(list(m.groups()))
else:
words.append(nltk.word_tokenize(sent))
print(words)
我得到一个输出:
[['GET', 'http', ':', '//10.0.0.0:1000/', 'HTTP/X.X']]
有人可以知道错误在哪里,或者为什么它没有按我想要的那样工作。
谢谢
import re
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
words=[]
for sent in sentences:
words.append(list(sent.split(' ')))
print(words)
你能使用简单的 space 拆分吗?我认为 nltk.word_tokenize 给了你错误的输出!
看起来,您想使用 space 拆分它。所以,
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
words = [x.split(" ") for x in sentences]
print(words)
我有一个句子(日志)看起来像这样['GET http://10.0.0.0:1000/ HTTP/X.X']
我想以这种形式拥有它:
['GET', 'http://10.0.0.0:1000/', 'HTTP/X.X']
但这不是秋天。我用过这段代码:
import re
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)')
words=[]
for sent in sentences:
m = rx.search(sent)
if m:
words.append(list(m.groups()))
else:
words.append(nltk.word_tokenize(sent))
print(words)
我得到一个输出:
[['GET', 'http', ':', '//10.0.0.0:1000/', 'HTTP/X.X']]
有人可以知道错误在哪里,或者为什么它没有按我想要的那样工作。 谢谢
import re
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
words=[]
for sent in sentences:
words.append(list(sent.split(' ')))
print(words)
你能使用简单的 space 拆分吗?我认为 nltk.word_tokenize 给了你错误的输出!
看起来,您想使用 space 拆分它。所以,
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
words = [x.split(" ") for x in sentences]
print(words)