RegExp 从 Redmine 日志文件中提取数据
RegExp to extract data from a Redmine log file
我有一个包含超过 100 万行的日志文件。我正在尝试根据特定用户名从日志中提取一些数据。
日志示例:
Started POST "/projects/some-project/issues/update_form.js" for 194.176.105.12 at Tue Jun 10 14:58:59 +0200 2014
Processing by IssuesController#update_form as JS
Parameters: {"issue"=>{"is_private"=>"0", "done_ratio"=>"0", "fixed_version_id"=>"", "tracker_id"=>"2", "assigned_to_id"=>"", "due_date"=>"", "custom_field_values"=>{"12"=>[""], "16"=>[""]}, "subject"=>"", "start_date"=>"", "estimated_hours"=>"", "description"=>"", "status_id"=>"1", "priority_id"=>"2"}, "project_id"=>"barnet-and-chase-farm", "attachments"=>{"screenshot"=>{"name"=>"screenshot", "content"=>"", "description"=>""}}, "utf8"=>"✓", "authenticity_token"=>"sometoken"}
Current user: SOME.USERNAME (id=20)
Rendered issues/_form_custom_fields.html.erb (3.7ms)
Rendered issues/_attributes.html.erb (397.9ms)
Rendered plugins/redmine_screenshot_paste/app/views/issues/_screenshot.html.erb (0.6ms)
Rendered issues/_form.html.erb (418.6ms)
Rendered issues/update_form.js.erb (422.3ms)
Completed 200 OK in 1032.4ms (Views: 406.6ms | ActiveRecord: 22.7ms)
日志文件有很多以上的重复块。块中的内容是可变的——即可能有不同的数据、不同的行数等。但是所有的块都以字符串 Started
开头并以字符串 Completed
结尾——两个字符串都在 Column 1 在新行上,总是。
我只需要提取那些包含字符串 Current user: SOME.USERNAME
的块
实现此目标的最佳方法是什么?我猜 RegExp 可以解决问题,但我不确定如何编写它才能达到预期的结果。
我可以使用 linux 命令行(grep 等)或某些软件,例如 Sublime Text 或 Notepad++ 或社区推荐的任何软件,例如 Python 脚本。
您可以使用这个正则表达式:
(?ms)^Started [^\n]*(?:(?!^Completed\b).)*?Current user: SOME\.USERNAME\b.*?^Completed\b[^\n]*
作为一个小 python 片段,您可以做类似
import sys, re
user= sys.argv[2]
pattern= r'(?ms)^Started [^\n]*(?:(?!^Completed\b).)*?Current user: %s\b.*?^Completed\b[^\n]*'%re.escape(user)
with open(sys.argv[1]) as f:
print '\n'.join(re.findall(pattern, f.read()))
并将其命名为
python my_script.py /path/to/log_file.txt SOME.USERNAME
我有一个包含超过 100 万行的日志文件。我正在尝试根据特定用户名从日志中提取一些数据。
日志示例:
Started POST "/projects/some-project/issues/update_form.js" for 194.176.105.12 at Tue Jun 10 14:58:59 +0200 2014
Processing by IssuesController#update_form as JS
Parameters: {"issue"=>{"is_private"=>"0", "done_ratio"=>"0", "fixed_version_id"=>"", "tracker_id"=>"2", "assigned_to_id"=>"", "due_date"=>"", "custom_field_values"=>{"12"=>[""], "16"=>[""]}, "subject"=>"", "start_date"=>"", "estimated_hours"=>"", "description"=>"", "status_id"=>"1", "priority_id"=>"2"}, "project_id"=>"barnet-and-chase-farm", "attachments"=>{"screenshot"=>{"name"=>"screenshot", "content"=>"", "description"=>""}}, "utf8"=>"✓", "authenticity_token"=>"sometoken"}
Current user: SOME.USERNAME (id=20)
Rendered issues/_form_custom_fields.html.erb (3.7ms)
Rendered issues/_attributes.html.erb (397.9ms)
Rendered plugins/redmine_screenshot_paste/app/views/issues/_screenshot.html.erb (0.6ms)
Rendered issues/_form.html.erb (418.6ms)
Rendered issues/update_form.js.erb (422.3ms)
Completed 200 OK in 1032.4ms (Views: 406.6ms | ActiveRecord: 22.7ms)
日志文件有很多以上的重复块。块中的内容是可变的——即可能有不同的数据、不同的行数等。但是所有的块都以字符串 Started
开头并以字符串 Completed
结尾——两个字符串都在 Column 1 在新行上,总是。
我只需要提取那些包含字符串 Current user: SOME.USERNAME
实现此目标的最佳方法是什么?我猜 RegExp 可以解决问题,但我不确定如何编写它才能达到预期的结果。
我可以使用 linux 命令行(grep 等)或某些软件,例如 Sublime Text 或 Notepad++ 或社区推荐的任何软件,例如 Python 脚本。
您可以使用这个正则表达式:
(?ms)^Started [^\n]*(?:(?!^Completed\b).)*?Current user: SOME\.USERNAME\b.*?^Completed\b[^\n]*
作为一个小 python 片段,您可以做类似
import sys, re
user= sys.argv[2]
pattern= r'(?ms)^Started [^\n]*(?:(?!^Completed\b).)*?Current user: %s\b.*?^Completed\b[^\n]*'%re.escape(user)
with open(sys.argv[1]) as f:
print '\n'.join(re.findall(pattern, f.read()))
并将其命名为
python my_script.py /path/to/log_file.txt SOME.USERNAME