将 log.txt 文件转换为 JSON 文件

Convert a log.txt file to JSON file

我必须将日志文件转换为 json 文件才能训练无监督模型。 日志文件的格式为 -

40.77.167.191, 172.16.30.15 - - [08/May/2018:03:29:15 +0530] "GET /speedwav-full-chrome-side-beading-for-tata-indigo-cs-46901.html HTTP/1.1" 403 162 <0.000> <-> "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
66.249.79.25, 172.16.30.15 - - [08/May/2018:03:29:17 +0530] "GET /schneider-dc-control-relays-ca4kn31-t008000721.html HTTP/1.1" 200 14443 <0.445> <0.445> "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.25, 172.16.30.15 - - [08/May/2018:03:29:19 +0530] "GET /ajax/pdp/recentlyviewed/1184932 HTTP/1.1" 200 2 <0.089> <0.089> "https://www.tolexo.com/orient-18w-eternal-surface-panel-square-led-light-18w01-t14ori0043.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

我想获取格式为 -

的文件
40.77.167.191, 172.16.30.15 - - [08/May/2018:03:29:15 +0530] "GET /speedwav-full-chrome-side-beading-for-tata-indigo-cs-46901.html HTTP/1.1" 403 162 <0.000> <-> "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

66.249.79.25, 172.16.30.15 - - [08/May/2018:03:29:17 +0530] "GET /schneider-dc-control-relays-ca4kn31-t008000721.html HTTP/1.1" 200 14443 <0.445> <0.445> "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.79.25, 172.16.30.15 - - [08/May/2018:03:29:19 +0530] "GET /ajax/pdp/recentlyviewed/1184932 HTTP/1.1" 200 2 <0.089> <0.089> "https://www.tolexo.com/orient-18w-eternal-surface-panel-square-led-light-18w01-t14ori0043.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

然后为它创建一个 json 文件。

使用re.split

例如:

import re

s = """40.77.167.191, 172.16.30.15 - - [08/May/2018:03:29:15 +0530] "GET /speedwav-full-chrome-side-beading-for-tata-indigo-cs-46901.html HTTP/1.1" 403 162 <0.000> <-> "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 66.249.79.25, 172.16.30.15 - - [08/May/2018:03:29:17 +0530] "GET /schneider-dc-control-relays-ca4kn31-t008000721.html HTTP/1.1" 200 14443 <0.445> <0.445> "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.79.25, 172.16.30.15 - - [08/May/2018:03:29:19 +0530] "GET /ajax/pdp/recentlyviewed/1184932 HTTP/1.1" 200 2 <0.089> <0.089> "https://www.tolexo.com/orient-18w-eternal-surface-panel-square-led-light-18w01-t14ori0043.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"""
val = re.split(r"(\d+\.\d+\.\d+\.\d+, \d+\.\d+\.\d+\.\d+)", s)[1:]
for v, w in zip(val[::2], val[1::2]):
    print(v, w)

输出:

('40.77.167.191, 172.16.30.15', ' - - [08/May/2018:03:29:15 +0530] "GET /speedwav-full-chrome-side-beading-for-tata-indigo-cs-46901.html HTTP/1.1" 403 162 <0.000> <-> "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" ')
('66.249.79.25, 172.16.30.15', ' - - [08/May/2018:03:29:17 +0530] "GET /schneider-dc-control-relays-ca4kn31-t008000721.html HTTP/1.1" 200 14443 <0.445> <0.445> "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ')
('66.249.79.25, 172.16.30.15', ' - - [08/May/2018:03:29:19 +0530] "GET /ajax/pdp/recentlyviewed/1184932 HTTP/1.1" 200 2 <0.089> <0.089> "https://www.tolexo.com/orient-18w-eternal-surface-panel-square-led-light-18w01-t14ori0043.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)')