标记 url

tokenize the url

我有一个句子(日志)看起来像这样['GET http://10.0.0.0:1000/ HTTP/X.X'] 我想以这种形式拥有它:

['GET', 'http://10.0.0.0:1000/', 'HTTP/X.X'] 

但这不是秋天。我用过这段代码:

import re

sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)')

words=[]
for sent in sentences:
    m = rx.search(sent)
    if m:
        words.append(list(m.groups()))
    else:
        words.append(nltk.word_tokenize(sent))  

print(words)

我得到一个输出:

[['GET', 'http', ':', '//10.0.0.0:1000/', 'HTTP/X.X']]

有人可以知道错误在哪里,或者为什么它没有按我想要的那样工作。 谢谢

import re

sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']

words=[]

for sent in sentences:
    words.append(list(sent.split(' ')))

print(words)

你能使用简单的 space 拆分吗?我认为 nltk.word_tokenize 给了你错误的输出!

看起来,您想使用 space 拆分它。所以,

sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
words = [x.split(" ") for x in sentences]
print(words)