Python: 通过分隔符解析复杂的文本文件
Python: Parsing complex text file by delimiter
我是 Python 的新手,通常习惯 Java。我目前正在尝试解析 Praat 输出的文本文件,该文件始终采用相同的格式并且看起来大致像这样,还有一些其他功能:
-- Voice report for 53. Sound T1_1001501_vowels --
Date: Tue Aug 7 12:15:41 2018
Time range of SELECTION
From 0 to 0.696562 seconds (duration: 0.696562 seconds)
Pitch:
Median pitch: 212.598 Hz
Mean pitch: 211.571 Hz
Standard deviation: 23.891 Hz
Minimum pitch: 171.685 Hz
Maximum pitch: 265.678 Hz
Pulses:
Number of pulses: 126
Number of periods: 113
Mean period: 4.751119E-3 seconds
Standard deviation of period: 0.539182E-3 seconds
Voicing:
Fraction of locally unvoiced frames: 5.970% (12 / 201)
Number of voice breaks: 1
Degree of voice breaks: 2.692% (0.018751 seconds / 0.696562 seconds)
我想输出如下所示的内容:
0.696562,212.598,211.571,23.891,171.685,265.678,126,113,4.751119E-3,0.539182E-3,5.970,1,2.692
所以基本上我想打印出一个字符串,该字符串仅包含每行中冒号及其后面的空格之间的数字,以逗号分隔。我知道这可能是一个愚蠢的问题,但我无法在 Python 中弄清楚;任何帮助将非常感激!
好的,这里有一些简单的东西,您需要稍微调整一下才能为您工作。
import re
with open("file.txt", "r") as f:
lines = [s.strip() for s in f.readlines()]
numbers_list = []
for _ in lines :
numbers_list.append(re.findall(r'\d+', _))
print(numbers_list)
其中 file.txt 是您的文件。
也许:
for line in text.splitlines():
line=line.strip()
head,sepa,tail=line.partition(":")
if sepa:
parts=tail.split(maxsplit=1)
if parts and all( ch.isdigit() or ch in ".eE%-+" for ch in parts[0]):
num=parts[0].replace("%"," ")
try:
print(float(num.strip()))
except ValueError:
print("invalid number:",num)
输出:
0.696562
212.598
211.571
23.891
171.685
265.678
126.0
113.0
0.004751119
0.000539182
5.97
1.0
2.692
谢谢大家的帮助!我实际上想出了这个解决方案:
import csv
input = 't2_5.txt'
input_name = input[:-4]
def parse(filepath):
data = []
with open(filepath, 'r') as file:
file.readline()
file.readline()
file.readline()
for line in file:
if line[0] == ' ':
start = line.find(':') + 2
end = line.find(' ', start)
if line[end - 1] == '%':
end -= 1
number = line[start:end]
data.append(number)
with open(input_name + '_output.csv', 'wb') as csvfile:
wr = csv.writer(csvfile)
wr.writerow(data)
parse(input)
我是 Python 的新手,通常习惯 Java。我目前正在尝试解析 Praat 输出的文本文件,该文件始终采用相同的格式并且看起来大致像这样,还有一些其他功能:
-- Voice report for 53. Sound T1_1001501_vowels --
Date: Tue Aug 7 12:15:41 2018
Time range of SELECTION
From 0 to 0.696562 seconds (duration: 0.696562 seconds)
Pitch:
Median pitch: 212.598 Hz
Mean pitch: 211.571 Hz
Standard deviation: 23.891 Hz
Minimum pitch: 171.685 Hz
Maximum pitch: 265.678 Hz
Pulses:
Number of pulses: 126
Number of periods: 113
Mean period: 4.751119E-3 seconds
Standard deviation of period: 0.539182E-3 seconds
Voicing:
Fraction of locally unvoiced frames: 5.970% (12 / 201)
Number of voice breaks: 1
Degree of voice breaks: 2.692% (0.018751 seconds / 0.696562 seconds)
我想输出如下所示的内容:
0.696562,212.598,211.571,23.891,171.685,265.678,126,113,4.751119E-3,0.539182E-3,5.970,1,2.692
所以基本上我想打印出一个字符串,该字符串仅包含每行中冒号及其后面的空格之间的数字,以逗号分隔。我知道这可能是一个愚蠢的问题,但我无法在 Python 中弄清楚;任何帮助将非常感激!
好的,这里有一些简单的东西,您需要稍微调整一下才能为您工作。
import re
with open("file.txt", "r") as f:
lines = [s.strip() for s in f.readlines()]
numbers_list = []
for _ in lines :
numbers_list.append(re.findall(r'\d+', _))
print(numbers_list)
其中 file.txt 是您的文件。
也许:
for line in text.splitlines():
line=line.strip()
head,sepa,tail=line.partition(":")
if sepa:
parts=tail.split(maxsplit=1)
if parts and all( ch.isdigit() or ch in ".eE%-+" for ch in parts[0]):
num=parts[0].replace("%"," ")
try:
print(float(num.strip()))
except ValueError:
print("invalid number:",num)
输出:
0.696562
212.598
211.571
23.891
171.685
265.678
126.0
113.0
0.004751119
0.000539182
5.97
1.0
2.692
谢谢大家的帮助!我实际上想出了这个解决方案:
import csv
input = 't2_5.txt'
input_name = input[:-4]
def parse(filepath):
data = []
with open(filepath, 'r') as file:
file.readline()
file.readline()
file.readline()
for line in file:
if line[0] == ' ':
start = line.find(':') + 2
end = line.find(' ', start)
if line[end - 1] == '%':
end -= 1
number = line[start:end]
data.append(number)
with open(input_name + '_output.csv', 'wb') as csvfile:
wr = csv.writer(csvfile)
wr.writerow(data)
parse(input)