python 中清理列表的更好方法
a better way of cleaning lists in python
我从美丽的汤中提取了一些文字,不幸的是所有文字都在一个里面,我将它保存为一个列表。
示例:
souplist = [
[' Date', '', ' Fri 30th Apr 2021', '', ' 60084096-1', 'Type', '', ' Staff Travel (Rail)', '', 'Description', '', '', ' Stratford International', 'Amount', '', '', '', '', ' £25.10 Paid '],
[' Date', '', ' Tue 27th Apr 2021', '', ' 60084096-3', 'Type', '', ' Office Costs (Stationery & printing)', '', 'Description', '', '', ' AMAZON.CO.UK [***]', 'Amount', '', '', '', '', ' £42.98 Paid '],
[' Date', '', ' Tue 1st Dec 2020', '', ' 90012371-0', 'Type', '', ' Office Costs (Rent)', '', ' Amount', '', '', '', '', ' £3,500.00 Paid '],
[' Date', '', ' Wed 14th Oct 2020', '', ' 60064831-1', 'Type', '', ' Office Costs (Software & applications)', '', 'Description', '', '', ' MAILCHIMP', 'MISC', 'Amount', '', '', '', '', ' £38.13 Paid ']
]
我想将其创建到包含列、日期、ID、类型、描述、数量的数据框中。
我试着做一个 for 循环,比如:
claims= {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for i in range(len(souplist)):
claims['id'].append(newclaims[i][4])
claims['date'].append(newclaims[i][2])
claims['type'].append(newclaims[i][7])
claims['description'].append(newclaims[i][12])
claims['amount'].append(newclaims[i][18])
然而,在较大的数据集中,列表中的位置会发生变化,并且并非所有列表的长度都相同。我不太确定如何清理列表。
请问您能提供更好的方法吗?
TL;DR 使用这个:
souplist = [[val.strip() for val in lst if val] for lst in souplist]
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for lst in souplist:
for key in claims:
try:
claims[key].append(lst[lst.index(key.title())+1])
except ValueError:
if key == 'id':
claims[key].append(lst[2])
else:
claims[key].append(None)
奖励积分:
如果您要使用 pandas
(一个很棒的库!),您可能想将 Datetime
的 date
列转换为 Datetime
列,在 id
列上建立索引,然后将您的amount
到一个数字:
df = pd.DataFrame(claims).set_index('id')
df.date = pd.to_datetime(df.date)
df.amount = df.amount.str.replace(r'£|( Paid)|\.|,', '', regex=True).astype(int)
# treat your amount as an integer to get exact maths; then /100 whenever printing
souplist = [[val.strip() for val in lst if val] for lst in souplist]
将删除所有空字符串 (''
),在 souplist
.
中包含的每个 list
中去除非空字符串中的空格
您的一些预期字段似乎已被标记;例如'Date'
是您想要的值之前的元素。您可以利用这一优势:
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for lst in souplist:
for key in claims:
try:
# using key.title() to translate e.g. 'amount' -> 'Amount'
# get the field from lst immediately after the label matching key.title()
claims[key].append(lst[lst.index(key.title())+1])
except ValueError:
if key == 'id':
# would be better to change code that generates `souplist`
# so this doesn't break
claims[key].append(lst[2])
else:
# the key doesn't exist as a label in the soup list
# append None so that all claims key-lists are same length
claims[key].append(None)
这给出:
>>> claims
{'id': ['60084096-1',
'60084096-3',
'90012371-0',
'60064831-1',
],
'date': ['Fri 30th Apr 2021',
'Tue 27th Apr 2021',
'Tue 1st Dec 2020',
'Wed 14th Oct 2020',
],
'type': ['Staff Travel (Rail)',
'Office Costs (Stationery & printing)',
'Office Costs (Rent)',
'Office Costs (Software & applications)',
],
'description': ['Stratford International',
'AMAZON.CO.UK [***]',
None,
'MAILCHIMP',
],
'amount': ['£25.10 Paid',
'£42.98 Paid',
'£3,500.00 Paid',
'£38.13 Paid',
],
}
我从美丽的汤中提取了一些文字,不幸的是所有文字都在一个里面,我将它保存为一个列表。
示例:
souplist = [
[' Date', '', ' Fri 30th Apr 2021', '', ' 60084096-1', 'Type', '', ' Staff Travel (Rail)', '', 'Description', '', '', ' Stratford International', 'Amount', '', '', '', '', ' £25.10 Paid '],
[' Date', '', ' Tue 27th Apr 2021', '', ' 60084096-3', 'Type', '', ' Office Costs (Stationery & printing)', '', 'Description', '', '', ' AMAZON.CO.UK [***]', 'Amount', '', '', '', '', ' £42.98 Paid '],
[' Date', '', ' Tue 1st Dec 2020', '', ' 90012371-0', 'Type', '', ' Office Costs (Rent)', '', ' Amount', '', '', '', '', ' £3,500.00 Paid '],
[' Date', '', ' Wed 14th Oct 2020', '', ' 60064831-1', 'Type', '', ' Office Costs (Software & applications)', '', 'Description', '', '', ' MAILCHIMP', 'MISC', 'Amount', '', '', '', '', ' £38.13 Paid ']
]
我想将其创建到包含列、日期、ID、类型、描述、数量的数据框中。
我试着做一个 for 循环,比如:
claims= {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for i in range(len(souplist)):
claims['id'].append(newclaims[i][4])
claims['date'].append(newclaims[i][2])
claims['type'].append(newclaims[i][7])
claims['description'].append(newclaims[i][12])
claims['amount'].append(newclaims[i][18])
然而,在较大的数据集中,列表中的位置会发生变化,并且并非所有列表的长度都相同。我不太确定如何清理列表。 请问您能提供更好的方法吗?
TL;DR 使用这个:
souplist = [[val.strip() for val in lst if val] for lst in souplist]
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for lst in souplist:
for key in claims:
try:
claims[key].append(lst[lst.index(key.title())+1])
except ValueError:
if key == 'id':
claims[key].append(lst[2])
else:
claims[key].append(None)
奖励积分:
如果您要使用 pandas
(一个很棒的库!),您可能想将 Datetime
的 date
列转换为 Datetime
列,在 id
列上建立索引,然后将您的amount
到一个数字:
df = pd.DataFrame(claims).set_index('id')
df.date = pd.to_datetime(df.date)
df.amount = df.amount.str.replace(r'£|( Paid)|\.|,', '', regex=True).astype(int)
# treat your amount as an integer to get exact maths; then /100 whenever printing
souplist = [[val.strip() for val in lst if val] for lst in souplist]
将删除所有空字符串 (''
),在 souplist
.
list
中去除非空字符串中的空格
您的一些预期字段似乎已被标记;例如'Date'
是您想要的值之前的元素。您可以利用这一优势:
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for lst in souplist:
for key in claims:
try:
# using key.title() to translate e.g. 'amount' -> 'Amount'
# get the field from lst immediately after the label matching key.title()
claims[key].append(lst[lst.index(key.title())+1])
except ValueError:
if key == 'id':
# would be better to change code that generates `souplist`
# so this doesn't break
claims[key].append(lst[2])
else:
# the key doesn't exist as a label in the soup list
# append None so that all claims key-lists are same length
claims[key].append(None)
这给出:
>>> claims
{'id': ['60084096-1',
'60084096-3',
'90012371-0',
'60064831-1',
],
'date': ['Fri 30th Apr 2021',
'Tue 27th Apr 2021',
'Tue 1st Dec 2020',
'Wed 14th Oct 2020',
],
'type': ['Staff Travel (Rail)',
'Office Costs (Stationery & printing)',
'Office Costs (Rent)',
'Office Costs (Software & applications)',
],
'description': ['Stratford International',
'AMAZON.CO.UK [***]',
None,
'MAILCHIMP',
],
'amount': ['£25.10 Paid',
'£42.98 Paid',
'£3,500.00 Paid',
'£38.13 Paid',
],
}