tshark 提取字段及其字符串表示
tshark extract fields with their string representation
我有一个 tshark 的 pcap 文件,其中包含我要分析的数据。我想分析它并导出到 CSV 或 xls 文件。在 tshark documentation 中,我可以看到我可以将 -z
选项与适当的参数一起使用,或者将 -T
与 -E
和 -e
一起使用。我在 Debian 机器上使用 python 3.6。目前,我的命令如下所示:
command="tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
"-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
"Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
"Subscription-Id,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
"Multiple-Services-Credit-Control,Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
"SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
"Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)
稍后我用 pandas 数据帧处理它,如下所示:
# loops adding TCP and/or UDP ports to scan traffic from
if args.tcp:
for port in args.tcp:
command += " -d tcp.port=={},diameter".format(port)
if args.udp:
for port in args.udp:
command += " -d udp.port=={},diameter".format(port)
# calling subprocess with output redirection to task variable
task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
# a loop adding new data dictionaries to data_list
for line in task.stdout:
line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
# secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
line = re.split(r"\s|=", line)
# convert obtained list to ordered dictionary to preserve column order
# transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
data_list.append(dict)
# remove last 4 dictionaries (last 4 lines of task.stdout)
data_list = data_list[:-4]
df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
df.to_excel("{}.xls".format(args.output_file), index=False)
print("Please remember that 'frame' column may not correspond to row index!")
当我打开输出文件时,我可以看到它工作正常,除了在例如CC-Request-Number
我有数值而不是字符串表示,例如在 Wireshark 中,我有这样的数据:
并且在 CC-Request-Number
列的输出 excel 文件中,我可以在与此数据包对应的行中看到 3
,而不是 TERMINATION-REQUEST
。
我的问题是:如何在使用 -z
选项时将此数字转换为其字符串表示形式,或者(正如我从网上看到的内容所猜测的那样)如何获取字段使用 -T
和 -e
命令上面提到的值?我用 tshark -G
列出了所有可用字段,但它们太多了,我想不出任何合理的方法来找到我想要的字段。
奇怪的是,对于 -T fields
和 -e
,tshark 总是打印数字表示,但对于 "Custom Fields" 输出格式,它打印文本表示。好消息是自定义字段模式实际上比 -T fields
模式快 3 倍。坏消息我知道没有办法控制自定义字段之间的分隔字符,因此如果您的字段内容可能包含空格,则似乎无法使用。
而不是 -z
,试试这个:
-o column.format:'"time", "%t", "type", "%Cus:diameter.CC-Request-Number"'
感谢 John Zwick 的建议, and Python documentation on The ElementTree XML API I implemented code presented below (I downloaded dictionary.xml and chargecontrol.xml 来自官方 Wireshark Github 存储库:
chargecontrol_tree = ET.parse("chargecontrol.xml")
dictionary_tree = ET.parse("dictionary.xml")
chargecontrol_root = chargecontrol_tree.getroot()
dictionary_root = dictionary_tree.getroot()
# list that will contain data dictionaries
data_list = []
# base command
command = "tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
"-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
"Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
"Subscription-Id-Data,Subscription-Id-Type,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
"Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
"SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
"Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)
# loops adding tcp and/or udp ports to scan traffic from
if args.tcp:
for port in args.tcp:
command += " -d tcp.port=={},diameter".format(port)
if args.udp:
for port in args.udp:
command += " -d udp.port=={},diameter".format(port)
# calling subprocess with output redirection to task variable
task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
# a loop adding new data dictionaries to data_list
for line in task.stdout:
line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
# secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
line = re.split(r"\s|=", line)
# convert obtained list to ordered dictionary to preserve column order
# transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
data_list.append(dict)
# remove last 4 dictionaries (last 4 lines of task.stdout)
data_list = data_list[:-4]
df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
# values taken from official wireshark repository
# https://github.com/boundary/wireshark/blob/master/diameter/dictionary.xml
# https://github.com/wireshark/wireshark/blob/2832f4e97d77324b4e46aac40dae0ce898ae559d/diameter/chargecontrol.xml
df["Auth-Application-Id"] = df["Auth-Application-Id"].map({node.attrib["code"]:node.attrib["name"] for node in
dictionary_root.findall(".//*[@name='Auth-Application-Id']/enum")})
# list of columns that values of have to be substituted
for col in ["CC-Request-Type", "CC-Session-Failover", "Credit-Control-Failure-Handling", "Subscription-Id-Type"]:
df[col] = df[col].map({node.attrib["code"]: node.attrib["name"] for node in
chargecontrol_root.findall((".//*[@name='{}']/enum").format(col))})
df.to_excel("{}.xls".format(args.output_file), index=False)
print("Please remember that 'frame' column may not correspond to row index!")
我有一个 tshark 的 pcap 文件,其中包含我要分析的数据。我想分析它并导出到 CSV 或 xls 文件。在 tshark documentation 中,我可以看到我可以将 -z
选项与适当的参数一起使用,或者将 -T
与 -E
和 -e
一起使用。我在 Debian 机器上使用 python 3.6。目前,我的命令如下所示:
command="tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
"-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
"Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
"Subscription-Id,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
"Multiple-Services-Credit-Control,Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
"SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
"Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)
稍后我用 pandas 数据帧处理它,如下所示:
# loops adding TCP and/or UDP ports to scan traffic from
if args.tcp:
for port in args.tcp:
command += " -d tcp.port=={},diameter".format(port)
if args.udp:
for port in args.udp:
command += " -d udp.port=={},diameter".format(port)
# calling subprocess with output redirection to task variable
task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
# a loop adding new data dictionaries to data_list
for line in task.stdout:
line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
# secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
line = re.split(r"\s|=", line)
# convert obtained list to ordered dictionary to preserve column order
# transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
data_list.append(dict)
# remove last 4 dictionaries (last 4 lines of task.stdout)
data_list = data_list[:-4]
df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
df.to_excel("{}.xls".format(args.output_file), index=False)
print("Please remember that 'frame' column may not correspond to row index!")
当我打开输出文件时,我可以看到它工作正常,除了在例如CC-Request-Number
我有数值而不是字符串表示,例如在 Wireshark 中,我有这样的数据:
并且在 CC-Request-Number
列的输出 excel 文件中,我可以在与此数据包对应的行中看到 3
,而不是 TERMINATION-REQUEST
。
我的问题是:如何在使用 -z
选项时将此数字转换为其字符串表示形式,或者(正如我从网上看到的内容所猜测的那样)如何获取字段使用 -T
和 -e
命令上面提到的值?我用 tshark -G
列出了所有可用字段,但它们太多了,我想不出任何合理的方法来找到我想要的字段。
奇怪的是,对于 -T fields
和 -e
,tshark 总是打印数字表示,但对于 "Custom Fields" 输出格式,它打印文本表示。好消息是自定义字段模式实际上比 -T fields
模式快 3 倍。坏消息我知道没有办法控制自定义字段之间的分隔字符,因此如果您的字段内容可能包含空格,则似乎无法使用。
而不是 -z
,试试这个:
-o column.format:'"time", "%t", "type", "%Cus:diameter.CC-Request-Number"'
感谢 John Zwick 的建议,
chargecontrol_tree = ET.parse("chargecontrol.xml")
dictionary_tree = ET.parse("dictionary.xml")
chargecontrol_root = chargecontrol_tree.getroot()
dictionary_root = dictionary_tree.getroot()
# list that will contain data dictionaries
data_list = []
# base command
command = "tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
"-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
"Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
"Subscription-Id-Data,Subscription-Id-Type,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
"Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
"SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
"Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)
# loops adding tcp and/or udp ports to scan traffic from
if args.tcp:
for port in args.tcp:
command += " -d tcp.port=={},diameter".format(port)
if args.udp:
for port in args.udp:
command += " -d udp.port=={},diameter".format(port)
# calling subprocess with output redirection to task variable
task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
# a loop adding new data dictionaries to data_list
for line in task.stdout:
line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
# secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
line = re.split(r"\s|=", line)
# convert obtained list to ordered dictionary to preserve column order
# transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
data_list.append(dict)
# remove last 4 dictionaries (last 4 lines of task.stdout)
data_list = data_list[:-4]
df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
# values taken from official wireshark repository
# https://github.com/boundary/wireshark/blob/master/diameter/dictionary.xml
# https://github.com/wireshark/wireshark/blob/2832f4e97d77324b4e46aac40dae0ce898ae559d/diameter/chargecontrol.xml
df["Auth-Application-Id"] = df["Auth-Application-Id"].map({node.attrib["code"]:node.attrib["name"] for node in
dictionary_root.findall(".//*[@name='Auth-Application-Id']/enum")})
# list of columns that values of have to be substituted
for col in ["CC-Request-Type", "CC-Session-Failover", "Credit-Control-Failure-Handling", "Subscription-Id-Type"]:
df[col] = df[col].map({node.attrib["code"]: node.attrib["name"] for node in
chargecontrol_root.findall((".//*[@name='{}']/enum").format(col))})
df.to_excel("{}.xls".format(args.output_file), index=False)
print("Please remember that 'frame' column may not correspond to row index!")