如何在python中涉及计算的元组中添加数据字段
How to add data fields in a tuple with calculations involved in python
下面的代码片段适用于旧数据格式,但是,我正在尝试读取具有附加数据字段的更新 datasource.txt。我尝试了正则表达式,但似乎无法正常工作。
data = {}
with open('datasource.txt') as f:
for line in f:
parts = line.split()
team, a, b, c = parts if len(parts) == 5 else parts[:-1] + ['([=11=])'] + parts[-1]
data[team] = tuple(map(sum, zip((int(a), float(b.replace(',', '')), float(c[2:-1].replace(',', ''))), data.get(team, (0, 0, 0)))))
data = {t: (a, b, c) for a, b, c, t in reversed(sorted((a, b, c, t) for t, (a, b, c) in data.items()))}
for team, (a, b, c) in data.items():
print(f'{team:8} {a:4} {b:,} (${c:,})')
datasource.txt
alpha 1 54,00.01 ABC DSW2S
bravo 3 500,000.00 ACDEF
charlie 1 27,722.29 (0.45) DGAS-CAS
charlie 10 252,336,733.383 (2.06) DGAS-CAS
delta 2 11 () SWSDSASS-CCSSW
echo 5 143,299.00 (1) ACS34S1
echo 8 145,300 (5.01) ACS34S1
falcon 3 0.1234 DSS2SFS3
falcon 5 9.19 DSS2SFS3
lima 6 45.00181 (.9) FGF5GGD-DDD
romeo 12 980 ASDS SSSS SDSD
预期输出:
echo 13 288,599.0 (6.01) ACS34S1
romeo 12 980.0 ([=13=].0) ASDS SSSS SDSD
charlie 11 252,364,455.67299998 (2.51) DGAS-CAS
falcon 8 9.3134 ([=13=].0) DSS2SFS3
lima 6 45.00181 (.9) FGF5GGD-DDD
bravo 3 500,000.0 ([=13=].0) ACDEF
delta 2 11.0 (.0) SWSDSASS-CCSSW
alpha 1 54,000.01 ([=13=].0) ABC DSW2S
我认为一个问题是:(比如 len(parts) = 5
)
team, a, b, c = parts
其中,左边只有4个变量,但是,由于len(parts)
是5
,parts
有5个元素。这是不匹配,因为您不能将 5 个值放入 4 个变量中。因此,那里需要一个额外的变量。
你可以用 pandas
来做到这一点。
- 首先,我进行了一些预处理以将数据转换为稳定格式,例如转换为 int/floats、添加 $(0)、连接最后一列值等。
- 然后用
pandas
到groupby
求和。
import pandas as pd
dl = []
with open('text.txt') as f:
for line in f:
parts = line.split()
# Cleaning data here.. Conversions to int/float etc,
if not parts[3][:2].startswith('($'):
parts.insert(3,'0')
if len(parts) > 5:
temp = ' '.join(parts[4:])
parts = parts[:4] + [temp]
parts[1] = int(parts[1])
parts[2] = float(parts[2].replace(',', ''))
parts[3] = float(parts[3].strip('($)'))
dl.append(parts)
headers = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.DataFrame(dl,columns=headers)
df = df.groupby(['col1','col5']).sum().reset_index()
df = df.sort_values('col2',ascending=False)
df['col4'] = '($' + df['col4'].astype(str) + ')'
df = df[headers]
print(df)
col1 col2 col3 col4 col5
4 echo 13 2.885990e+05 (6.01) ACS34S1
7 romeo 12 9.800000e+02 ([=11=].0) ASDS SSSS SDSD
2 charlie 11 2.523645e+08 (2.51) DGAS-CAS
5 falcon 8 9.313400e+00 ([=11=].0) DSS2SFS3
6 lima 6 4.500181e+01 (.9) FGF5GGD-DDD
1 bravo 3 5.000000e+05 ([=11=].0) ACDEF
3 delta 2 1.100000e+01 (.0) SWSDSASS-CCSSW
0 alpha 1 5.400010e+03 ([=11=].0) ABC DSW2S
下面的代码片段适用于旧数据格式,但是,我正在尝试读取具有附加数据字段的更新 datasource.txt。我尝试了正则表达式,但似乎无法正常工作。
data = {}
with open('datasource.txt') as f:
for line in f:
parts = line.split()
team, a, b, c = parts if len(parts) == 5 else parts[:-1] + ['([=11=])'] + parts[-1]
data[team] = tuple(map(sum, zip((int(a), float(b.replace(',', '')), float(c[2:-1].replace(',', ''))), data.get(team, (0, 0, 0)))))
data = {t: (a, b, c) for a, b, c, t in reversed(sorted((a, b, c, t) for t, (a, b, c) in data.items()))}
for team, (a, b, c) in data.items():
print(f'{team:8} {a:4} {b:,} (${c:,})')
datasource.txt
alpha 1 54,00.01 ABC DSW2S
bravo 3 500,000.00 ACDEF
charlie 1 27,722.29 (0.45) DGAS-CAS
charlie 10 252,336,733.383 (2.06) DGAS-CAS
delta 2 11 () SWSDSASS-CCSSW
echo 5 143,299.00 (1) ACS34S1
echo 8 145,300 (5.01) ACS34S1
falcon 3 0.1234 DSS2SFS3
falcon 5 9.19 DSS2SFS3
lima 6 45.00181 (.9) FGF5GGD-DDD
romeo 12 980 ASDS SSSS SDSD
预期输出:
echo 13 288,599.0 (6.01) ACS34S1
romeo 12 980.0 ([=13=].0) ASDS SSSS SDSD
charlie 11 252,364,455.67299998 (2.51) DGAS-CAS
falcon 8 9.3134 ([=13=].0) DSS2SFS3
lima 6 45.00181 (.9) FGF5GGD-DDD
bravo 3 500,000.0 ([=13=].0) ACDEF
delta 2 11.0 (.0) SWSDSASS-CCSSW
alpha 1 54,000.01 ([=13=].0) ABC DSW2S
我认为一个问题是:(比如 len(parts) = 5
)
team, a, b, c = parts
其中,左边只有4个变量,但是,由于len(parts)
是5
,parts
有5个元素。这是不匹配,因为您不能将 5 个值放入 4 个变量中。因此,那里需要一个额外的变量。
你可以用 pandas
来做到这一点。
- 首先,我进行了一些预处理以将数据转换为稳定格式,例如转换为 int/floats、添加 $(0)、连接最后一列值等。
- 然后用
pandas
到groupby
求和。
import pandas as pd
dl = []
with open('text.txt') as f:
for line in f:
parts = line.split()
# Cleaning data here.. Conversions to int/float etc,
if not parts[3][:2].startswith('($'):
parts.insert(3,'0')
if len(parts) > 5:
temp = ' '.join(parts[4:])
parts = parts[:4] + [temp]
parts[1] = int(parts[1])
parts[2] = float(parts[2].replace(',', ''))
parts[3] = float(parts[3].strip('($)'))
dl.append(parts)
headers = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.DataFrame(dl,columns=headers)
df = df.groupby(['col1','col5']).sum().reset_index()
df = df.sort_values('col2',ascending=False)
df['col4'] = '($' + df['col4'].astype(str) + ')'
df = df[headers]
print(df)
col1 col2 col3 col4 col5
4 echo 13 2.885990e+05 (6.01) ACS34S1
7 romeo 12 9.800000e+02 ([=11=].0) ASDS SSSS SDSD
2 charlie 11 2.523645e+08 (2.51) DGAS-CAS
5 falcon 8 9.313400e+00 ([=11=].0) DSS2SFS3
6 lima 6 4.500181e+01 (.9) FGF5GGD-DDD
1 bravo 3 5.000000e+05 ([=11=].0) ACDEF
3 delta 2 1.100000e+01 (.0) SWSDSASS-CCSSW
0 alpha 1 5.400010e+03 ([=11=].0) ABC DSW2S