需要将一列数据拆分为 Pandas 数据框中的不同列
Need to split one Columns data into different columns in Pandas Data frame
我有一个 Csv 文件,其中合并了列,pandas 数据框以相同的方式显示它。需要根据需要拆分列 out put
Csv 样本
我当前来自 csv 文件的输入是:
"Date InformationIdNo.","Date out","Dr.","Cr."
"01 FEB Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0","250 - PQRT14225","","",""
"01 FEB Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0","20000 - DEFG12","","",""
"01 FEB TransferFT22032FQWE3","01 FEB 21","0","7,436.93","","","",""
还需要用第0个索引的信息列索引1
所需输出:
| | Date | Information | IdNo. | Date out | Dr. | Cr. | Balance |
|---|-----------|-------------------------------|-----------------|-----------|-----------|-----------|------------|
| 0 | 01 FEB 21 | Mart Purchase 250 - PQRT14225 | MATRSC203255H | 30 DEC 21 | -3,535.61 | 0 | -3,978.61 |
| 1 | 01 FEB 21 | Cash Sales 20000 - DEFG1220 | MATRSC203255H | 30 DEC 21 | -34.33 | 0 | -3,944.29 |
| 2 | 01 FEB 21 | Transfer | FT22032FQWE3 | 01 FEB 21 | 0 | 7,426.93 | 3,482.65 |
Input CSV file Screenshot
CSV file when opened in notepad
Output required
相信下面的代码思路已经很清晰了。首先,我们需要将 csv 文件中的数据更正为有效的 csv(逗号分隔值)格式。之后我们可以创建数据框。
'data.csv' 文件内容
"Date InformationIdNo."," Date out ","Dr."," Cr."
"01 FEB 21 Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0"
"250 - PQRT14225","","",""
"01 FEB 21 Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0"
"20000 - DEFG12","","",""
"01 FEB 21 TransferFT22032FQWE3"," 01 FEB 21","0","7,426.93"
"","","",""
"","","",""
"","","",""
"","","",""
"","","",""
可能的(快速)解决方案如下:
#pip install pandas
import re
import pandas as pd
from io import StringIO
with open("data.csv", "r", encoding='utf-8') as file:
raw_data = file.read()
# convert txt to valid csv (comma separated values) format
raw_data = raw_data.replace(' - ', '-')
raw_data = raw_data.replace('Date InformationIdNo.', 'Date","Information","IdNo.')
raw_data = raw_data.replace('" Cr."', '"Cr","Information_add"')
raw_data = re.sub('(\d{2} [A-Z]{3} \d{2})', r'","', raw_data)
raw_data = re.sub('\n"([A-Z0-9-]+)","","",""\n', r',""\n', raw_data)
raw_data = re.sub(r',""{2,}', '', raw_data)
raw_data = re.sub('([A-Z0-9]{3,}",")', r'","","', raw_data)
raw_data = re.sub(',""+', r'', raw_data)
raw_data = re.sub('\n""+', r'', raw_data)
# # create dataframe and replace NaN with ""
df = pd.read_csv(StringIO(raw_data), sep=",")
df.fillna("", inplace=True)
# merge columns and drop temporary column
df['Information'] = df['Information'] + df['Information_add']
df.drop(['Information_add'], axis=1, inplace=True)
# cleanup column headers
df.columns = [name.strip() for name in df.columns]
# convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df['Date out'] = pd.to_datetime(df['Date out'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df
Returns
我有一个 Csv 文件,其中合并了列,pandas 数据框以相同的方式显示它。需要根据需要拆分列 out put
Csv 样本
我当前来自 csv 文件的输入是:
"Date InformationIdNo.","Date out","Dr.","Cr."
"01 FEB Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0","250 - PQRT14225","","",""
"01 FEB Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0","20000 - DEFG12","","",""
"01 FEB TransferFT22032FQWE3","01 FEB 21","0","7,436.93","","","",""
还需要用第0个索引的信息列索引1
所需输出:
| | Date | Information | IdNo. | Date out | Dr. | Cr. | Balance |
|---|-----------|-------------------------------|-----------------|-----------|-----------|-----------|------------|
| 0 | 01 FEB 21 | Mart Purchase 250 - PQRT14225 | MATRSC203255H | 30 DEC 21 | -3,535.61 | 0 | -3,978.61 |
| 1 | 01 FEB 21 | Cash Sales 20000 - DEFG1220 | MATRSC203255H | 30 DEC 21 | -34.33 | 0 | -3,944.29 |
| 2 | 01 FEB 21 | Transfer | FT22032FQWE3 | 01 FEB 21 | 0 | 7,426.93 | 3,482.65 |
Input CSV file Screenshot
CSV file when opened in notepad
Output required
相信下面的代码思路已经很清晰了。首先,我们需要将 csv 文件中的数据更正为有效的 csv(逗号分隔值)格式。之后我们可以创建数据框。
'data.csv' 文件内容
"Date InformationIdNo."," Date out ","Dr."," Cr."
"01 FEB 21 Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0"
"250 - PQRT14225","","",""
"01 FEB 21 Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0"
"20000 - DEFG12","","",""
"01 FEB 21 TransferFT22032FQWE3"," 01 FEB 21","0","7,426.93"
"","","",""
"","","",""
"","","",""
"","","",""
"","","",""
可能的(快速)解决方案如下:
#pip install pandas
import re
import pandas as pd
from io import StringIO
with open("data.csv", "r", encoding='utf-8') as file:
raw_data = file.read()
# convert txt to valid csv (comma separated values) format
raw_data = raw_data.replace(' - ', '-')
raw_data = raw_data.replace('Date InformationIdNo.', 'Date","Information","IdNo.')
raw_data = raw_data.replace('" Cr."', '"Cr","Information_add"')
raw_data = re.sub('(\d{2} [A-Z]{3} \d{2})', r'","', raw_data)
raw_data = re.sub('\n"([A-Z0-9-]+)","","",""\n', r',""\n', raw_data)
raw_data = re.sub(r',""{2,}', '', raw_data)
raw_data = re.sub('([A-Z0-9]{3,}",")', r'","","', raw_data)
raw_data = re.sub(',""+', r'', raw_data)
raw_data = re.sub('\n""+', r'', raw_data)
# # create dataframe and replace NaN with ""
df = pd.read_csv(StringIO(raw_data), sep=",")
df.fillna("", inplace=True)
# merge columns and drop temporary column
df['Information'] = df['Information'] + df['Information_add']
df.drop(['Information_add'], axis=1, inplace=True)
# cleanup column headers
df.columns = [name.strip() for name in df.columns]
# convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df['Date out'] = pd.to_datetime(df['Date out'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df
Returns