需要将一列数据拆分为 Pandas 数据框中的不同列

Need to split one Columns data into different columns in Pandas Data frame

我有一个 Csv 文件,其中合并了列,pandas 数据框以相同的方式显示它。需要根据需要拆分列 out put

Csv 样本

我当前来自 csv 文件的输入是:

"Date InformationIdNo.","Date out","Dr.","Cr."
"01 FEB Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0","250 - PQRT14225","","",""
"01 FEB Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0","20000 - DEFG12","","",""
"01 FEB TransferFT22032FQWE3","01 FEB 21","0","7,436.93","","","",""

还需要用第0个索引的信息列索引1

所需输出:

|   |  Date     |  Information                  | IdNo.           | Date out  | Dr.       | Cr.       | Balance    |
|---|-----------|-------------------------------|-----------------|-----------|-----------|-----------|------------|
| 0 | 01 FEB 21 | Mart Purchase 250 - PQRT14225 |  MATRSC203255H  | 30 DEC 21 | -3,535.61 |  0        | -3,978.61  |
| 1 | 01 FEB 21 | Cash Sales 20000 - DEFG1220   |  MATRSC203255H  | 30 DEC 21 | -34.33    |  0        | -3,944.29  |
| 2 | 01 FEB 21 | Transfer                      |  FT22032FQWE3   | 01 FEB 21 | 0         |  7,426.93 |  3,482.65  |

Input CSV file Screenshot

CSV file when opened in notepad

Output required

相信下面的代码思路已经很清晰了。首先,我们需要将 csv 文件中的数据更正为有效的 csv(逗号分隔值)格式。之后我们可以创建数据框。

'data.csv' 文件内容

"Date InformationIdNo."," Date out ","Dr."," Cr."
"01 FEB 21 Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0"
"250 - PQRT14225","","",""
"01 FEB 21 Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0"
"20000 - DEFG12","","",""
"01 FEB 21 TransferFT22032FQWE3"," 01 FEB 21","0","7,426.93"
"","","",""
"","","",""
"","","",""
"","","",""
"","","",""

可能的(快速)解决方案如下:

#pip install pandas

import re
import pandas as pd
from io import StringIO

with open("data.csv", "r", encoding='utf-8') as file:
    raw_data = file.read()
    
# convert txt to valid csv (comma separated values) format
raw_data = raw_data.replace(' - ', '-')
raw_data = raw_data.replace('Date InformationIdNo.', 'Date","Information","IdNo.')
raw_data = raw_data.replace('" Cr."', '"Cr","Information_add"')
raw_data = re.sub('(\d{2} [A-Z]{3} \d{2})', r'","', raw_data)
raw_data = re.sub('\n"([A-Z0-9-]+)","","",""\n', r',""\n', raw_data)
raw_data = re.sub(r',""{2,}', '', raw_data)
raw_data = re.sub('([A-Z0-9]{3,}",")', r'","","', raw_data)
raw_data = re.sub(',""+', r'', raw_data)
raw_data = re.sub('\n""+', r'', raw_data)

# # create dataframe and replace NaN with ""
df = pd.read_csv(StringIO(raw_data), sep=",")
df.fillna("", inplace=True)

# merge columns and drop temporary column
df['Information'] = df['Information'] + df['Information_add']
df.drop(['Information_add'], axis=1, inplace=True)

# cleanup column headers
df.columns = [name.strip() for name in df.columns]

# convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df['Date out'] = pd.to_datetime(df['Date out'].str.title().str.strip(), format="%d %b %y", dayfirst=True)

df

Returns