重新忽略一些行

Question

我试图将我的数据转换成字典列表，例如

example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", #note: sometimes the user name is missing! In this case, use '-' as the value for the username.**)
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"} #note: not everything is a POST

我的数据：

86.187.99.249 - tillman6650 [21/Jun/2019:15:46:03 -0700] "POST /efficient/unleash HTTP/1.1" 405 22390
76.72.133.93 - carroll1056 [21/Jun/2019:15:46:05 -0700] "POST /morph/optimize/plug-and-play HTTP/2.0" 400 27172
73.162.151.229 - dubuque3528 [21/Jun/2019:15:46:08 -0700] "DELETE /transition/holistic/e-business HTTP/2.0" 301 13923
13.112.8.80 - rau5026 [21/Jun/2019:15:46:09 -0700] "HEAD /ubiquitous/transparent HTTP/1.1" 200 16928
159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149
219.194.113.255 - - [21/Jun/2019:15:46:12 -0700] "PATCH /next-generation/niches/mindshare HTTP/1.0" 503 20246
59.101.239.174 - brekke3293 [21/Jun/2019:15:46:13 -0700] "DELETE /ubiquitous/seize/web-enabled HTTP/2.0" 302 14017

我的代码：

pattern = """
(?P<host>.*)           #User host
(-\ )                  #Separator
(?P<user_name>\w*) #User name
(\ \[)                  #Separator for pharanteses and space
(?P<time>\S*\ -0700) #time
(\]\ )                  #Separator for pharanteses and space
(?P<request>.*")
"""
for user in re.finditer(pattern,logdata,re.VERBOSE):
    print(user.groupdict())

输出：

{'host': '86.187.99.249 ', 'user_name': 'tillman6650', 'time': '21/Jun/2019:15:46:03 -0700', 'request': '"POST /efficient/unleash HTTP/1.1"'}
{'host': '76.72.133.93 ', 'user_name': 'carroll1056', 'time': '21/Jun/2019:15:46:05 -0700', 'request': '"POST /morph/optimize/plug-and-play HTTP/2.0"'}
{'host': '73.162.151.229 ', 'user_name': 'dubuque3528', 'time': '21/Jun/2019:15:46:08 -0700', 'request': '"DELETE /transition/holistic/e-business HTTP/2.0"'}
{'host': '13.112.8.80 ', 'user_name': 'rau5026', 'time': '21/Jun/2019:15:46:09 -0700', 'request': '"HEAD /ubiquitous/transparent HTTP/1.1"'}
{'host': '136.195.158.6 ', 'user_name': 'feeney9464', 'time': '21/Jun/2019:15:46:11 -0700', 'request': '"HEAD /open-source/markets HTTP/2.0"'}
{'host': '59.101.239.174 ', 'user_name': 'brekke3293', 'time': '21/Jun/2019:15:46:13 -0700', 'request': '"DELETE /ubiquitous/seize/web-enabled HTTP/2.0"'}

在给定的数据中，一些用户名是“-”，在我的代码中，它只是跳过这些行。我也必须添加这些行并使用“-”作为用户名的值。

Answer 1

您可以将当前的 username 正则表达式更改为

(?P<user_name>[\w\-]*)

由于 - 符号在正则表达式中具有特殊含义（它表示匹配从 0 到 9 的任何数字的范围）以按字面匹配它，您需要使用 \[=14 转义它=]

重新忽略一些行

Re - ignoring some lines

python

python-re