python 中清理列表的更好方法

a better way of cleaning lists in python

我从美丽的汤中提取了一些文字,不幸的是所有文字都在一个里面,我将它保存为一个列表。

示例:

souplist = [
['   Date', '', '  Fri 30th Apr 2021', '', ' 60084096-1', 'Type', '', '  Staff Travel (Rail)', '', 'Description', '', '', '  Stratford International', 'Amount', '', '', '', '', '   £25.10 Paid  '],

['   Date', '', '  Tue 27th Apr 2021', '', ' 60084096-3', 'Type', '', '  Office Costs (Stationery & printing)', '', 'Description', '', '', '  AMAZON.CO.UK [***]', 'Amount', '', '', '', '', '   £42.98 Paid  '],
['   Date', '', '  Tue 1st Dec 2020', '', ' 90012371-0', 'Type', '', '  Office Costs (Rent)', '', '  Amount', '', '', '', '', '   £3,500.00 Paid  '],
['   Date', '', '  Wed 14th Oct 2020', '', ' 60064831-1', 'Type', '', '  Office Costs (Software & applications)', '', 'Description', '', '', '  MAILCHIMP', 'MISC', 'Amount', '', '', '', '', '   £38.13 Paid  ']
]

我想将其创建到包含列、日期、ID、类型、描述、数量的数据框中。

我试着做一个 for 循环,比如:

claims= {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for i in range(len(souplist)):
      
    claims['id'].append(newclaims[i][4])
    claims['date'].append(newclaims[i][2])
    claims['type'].append(newclaims[i][7])
    claims['description'].append(newclaims[i][12])
    claims['amount'].append(newclaims[i][18])
    

然而,在较大的数据集中,列表中的位置会发生变化,并且并非所有列表的长度都相同。我不太确定如何清理列表。 请问您能提供更好的方法吗?

TL;DR 使用这个:

souplist = [[val.strip() for val in lst if val] for lst in souplist]
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for lst in souplist:
    for key in claims:
        try:
            claims[key].append(lst[lst.index(key.title())+1])
        except ValueError:
            if key == 'id':
                claims[key].append(lst[2])
            else:
                claims[key].append(None)

奖励积分: 如果您要使用 pandas(一个很棒的库!),您可能想将 Datetimedate 列转换为 Datetime 列,在 id 列上建立索引,然后将您的amount 到一个数字:

df = pd.DataFrame(claims).set_index('id')
df.date = pd.to_datetime(df.date)
df.amount = df.amount.str.replace(r'£|( Paid)|\.|,', '', regex=True).astype(int)
# treat your amount as an integer to get exact maths; then /100 whenever printing

souplist = [[val.strip() for val in lst if val] for lst in souplist]

将删除所有空字符串 (''),在 souplist.

中包含的每个 list 中去除非空字符串中的空格

您的一些预期字段似乎已被标记;例如'Date' 是您想要的值之前的元素。您可以利用这一优势:

claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for lst in souplist:
    for key in claims:
        try:
            # using key.title() to translate e.g. 'amount' -> 'Amount'
            # get the field from lst immediately after the label matching key.title()
            claims[key].append(lst[lst.index(key.title())+1])
        except ValueError:
            if key == 'id':
                # would be better to change code that generates `souplist`
                # so this doesn't break
                claims[key].append(lst[2])
            else:
                # the key doesn't exist as a label in the soup list
                # append None so that all claims key-lists are same length
                claims[key].append(None)

这给出:

>>> claims
{'id':          ['60084096-1',
                 '60084096-3',
                 '90012371-0',
                 '60064831-1',
                ],
 'date':        ['Fri 30th Apr 2021',
                 'Tue 27th Apr 2021',
                 'Tue 1st Dec 2020',
                 'Wed 14th Oct 2020',
                ],
 'type':        ['Staff Travel (Rail)',
                 'Office Costs (Stationery & printing)',
                 'Office Costs (Rent)',
                 'Office Costs (Software & applications)',
                ],
 'description': ['Stratford International',
                 'AMAZON.CO.UK [***]',
                 None,
                 'MAILCHIMP',
                ],
 'amount':      ['£25.10 Paid',
                 '£42.98 Paid',
                 '£3,500.00 Paid',
                 '£38.13 Paid',
                ],
}