python 修复多行日志条目
python fix multi line log entries
我需要修复一些多行日志条目,目前使用的是 perl,但我需要将功能移至 python。
示例多行条目:
2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (
This is the IgnitionOne Company Bot for Web Crawling.
IgnitionOne Company Site: http://www.example.com/
;
rong2 dot huang at ignitionone dot com
)" - -
当前修复这些问题的 perl 脚本是:
while (my $row = <$fh>) {
chomp $row;
if ( $row =~ /^(\d{4})-(\d\d)-(\d\d)T(\d)/ ) {
print "\n" if $. != 1;
}
print $row;
输出更正后的单行条目:
2015-12-02T17:56:13.783276Z telepictures-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 ( This is the IgnitionOne Company Bot for Web Crawling. IgnitionOne Company Site: http://www.example.com/ ; rong2 dot huang at ignitionone dot com )" - -
所以简而言之,我们基本上是在寻找任何不以日期正则表达式开头的行,如果它们匹配,我们将它们添加到没有 \n 的第一行。
我见过使用 awk 等实现此目的的其他方法,但需要它是纯粹的 python。我看过 ,看起来 itertools 可能是解决这个问题的首选方法?
您可以在 python 中通过 re
模块使用基于负先行的正则表达式实现此目的。
>>> import re
>>> s = '''2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (
This is the IgnitionOne Company Bot for Web Crawling.
IgnitionOne Company Site: http://www.example.com/
;
rong2 dot huang at ignitionone dot com
)" - -'''
>>> re.sub(r'\n(?!\d{4}-\d{2}-\d{2}T\d)', '', s)
'2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (This is the IgnitionOne Company Bot for Web Crawling.IgnitionOne Company Site: http://www.example.com/ ; rong2 dot huang at ignitionone dot com )" - -'
即
import re
with open(file) as f:
fil = f.read()
print re.sub(r'\n(?!\d{4}-\d{2}-\d{2}T\d)', '', s)
我需要修复一些多行日志条目,目前使用的是 perl,但我需要将功能移至 python。
示例多行条目:
2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (
This is the IgnitionOne Company Bot for Web Crawling.
IgnitionOne Company Site: http://www.example.com/
;
rong2 dot huang at ignitionone dot com
)" - -
当前修复这些问题的 perl 脚本是:
while (my $row = <$fh>) {
chomp $row;
if ( $row =~ /^(\d{4})-(\d\d)-(\d\d)T(\d)/ ) {
print "\n" if $. != 1;
}
print $row;
输出更正后的单行条目:
2015-12-02T17:56:13.783276Z telepictures-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 ( This is the IgnitionOne Company Bot for Web Crawling. IgnitionOne Company Site: http://www.example.com/ ; rong2 dot huang at ignitionone dot com )" - -
所以简而言之,我们基本上是在寻找任何不以日期正则表达式开头的行,如果它们匹配,我们将它们添加到没有 \n 的第一行。
我见过使用 awk 等实现此目的的其他方法,但需要它是纯粹的 python。我看过
您可以在 python 中通过 re
模块使用基于负先行的正则表达式实现此目的。
>>> import re
>>> s = '''2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (
This is the IgnitionOne Company Bot for Web Crawling.
IgnitionOne Company Site: http://www.example.com/
;
rong2 dot huang at ignitionone dot com
)" - -'''
>>> re.sub(r'\n(?!\d{4}-\d{2}-\d{2}T\d)', '', s)
'2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944 10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (This is the IgnitionOne Company Bot for Web Crawling.IgnitionOne Company Site: http://www.example.com/ ; rong2 dot huang at ignitionone dot com )" - -'
即
import re
with open(file) as f:
fil = f.read()
print re.sub(r'\n(?!\d{4}-\d{2}-\d{2}T\d)', '', s)