提取某个文件中可能出现多次的字符串
Extract a certain string which can appear several times in a file
我有一个文本文件,我想读取并提取某个字符串(可能出现多次)。然后我要打印结果。
我要提取的字符串是 Rule MATCH Name 的值。
文本文件示例:
201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test
201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/76.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: **Rule MATCH Name**: this_is_test1 SUBSCORE:100
201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test
201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/7164.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: **Rule MATCH Name**: this_is_test2 SUBSCORE:90
201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test
201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/764.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: **Rule MATCH Name**: this_is_test3 SUBSCORE:15
您可以使用正则表达式来解决这个问题。 Regexr 是一个创建和测试正则表达式规则的好网站。
一旦您有了适合您的问题的规则,加载文件,使用 readlines() 获取文本,并使用 python 的 re 模块提取值。
我做了一个快速的解决方案(不确定这是否是您要提取的值):
import re
fl = r'201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/76.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test1 SUBSCORE:100 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/7164.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test2 SUBSCORE:90 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/764.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test3 SUBSCORE:15'
re.findall(r'Rule MATCH Name:\s(\w+)\s', fl)
# ['this_is_test1', 'this_is_test2', 'this_is_test3']
如果从文件读取:
import re
with open('f.txt') as f:
found = []
for line in f.readlines():
found += re.findall(r'Rule MATCH Name:\s(\w+)\s', line)
print(found) # ['this_is_test1', 'this_is_test2', 'this_is_test3']
有一个方法很简单"search",请按照伪代码:
import re
import sys
file = open(sys.argv[2], "r")
for line in file:
if re.search(sys.argv[1], line):
print line,
我有一个文本文件,我想读取并提取某个字符串(可能出现多次)。然后我要打印结果。
我要提取的字符串是 Rule MATCH Name 的值。
文本文件示例:
201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/76.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: **Rule MATCH Name**: this_is_test1 SUBSCORE:100 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/7164.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: **Rule MATCH Name**: this_is_test2 SUBSCORE:90 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/764.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: **Rule MATCH Name**: this_is_test3 SUBSCORE:15
您可以使用正则表达式来解决这个问题。 Regexr 是一个创建和测试正则表达式规则的好网站。
一旦您有了适合您的问题的规则,加载文件,使用 readlines() 获取文本,并使用 python 的 re 模块提取值。
我做了一个快速的解决方案(不确定这是否是您要提取的值):
import re
fl = r'201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/76.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test1 SUBSCORE:100 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/7164.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test2 SUBSCORE:90 201819:34:40Z ubuntu : Info: MODULE: FileScan MESSAGE: Scanning test 201809:34:40Z ubuntu: Alert: MODULE: FileScan MESSAGE: FILE: /test/764.bin SCORE: 140 TYPE: EXE AutoUpdates https://www.test.com/files: Rule MATCH Name: this_is_test3 SUBSCORE:15'
re.findall(r'Rule MATCH Name:\s(\w+)\s', fl)
# ['this_is_test1', 'this_is_test2', 'this_is_test3']
如果从文件读取:
import re
with open('f.txt') as f:
found = []
for line in f.readlines():
found += re.findall(r'Rule MATCH Name:\s(\w+)\s', line)
print(found) # ['this_is_test1', 'this_is_test2', 'this_is_test3']
有一个方法很简单"search",请按照伪代码:
import re
import sys
file = open(sys.argv[2], "r")
for line in file:
if re.search(sys.argv[1], line):
print line,