合并单元格,在同一列,在同一个 df- Python
Merging cells, in the same column, in the same df- Python
我正在尝试将两个单元格合并在一起。这样做的原因是 'Chassis' 下的每个单位都应该是字母数字 (ABCD123456),但是提供的 PO 偶尔会将最后一个数字移到下一行(该行上没有其他数据),使数据看起来像这样 Example I initially tried to create a statement that looked at the cell, confirmed it was less one number, then would look at the next cell, and merge the two. Never got that to even come close to manifesting any results. I then decided to replicate the data frame, shift the second data frame(so the missing number is on the same row), and merge them together. This is where I am now. Error Msg 这是我在 Python 中的第一个真正的代码,所以我相当确定我在做低效的事情所以一定要让我知道我可以改进的地方。
目前我有这个...
Col1
Chassis
Other Columns...
Other Columns 2...
Nan
ABCD12345
ABC
123
Nan
6
Nan
Nan
Nan
WXYZ987654
GHI
456
Nan
QRSTU654987
Nan
789
Nan
MNOP999999
XYZ
Nan
最终目标是这个...
Col1
Chassis
Other Columns...
Other Columns 2...
Nan
ABCD123456
ABC
123
Nan
WXYZ987654
GHI
456
Nan
QRSTU654987
Nan
789
Nan
MNOP999999
XYZ
Nan
import PyPDF2 as pdf2
import tabula as tb
import pandas as pd
import re
import csv
import os
os.listdir()
pd.set_option('display.max_columns', None)
#bring in pdf, remove first page, convert to csv
PO = 'PO.pdf'
pages = open(PO, 'rb')
readPDF = pdf2.PdfFileReader(pages)
totalpages = readPDF.numPages
x = '2-' + str(totalpages)
POCSV = tb.convert_into(PO, 'POCSV.csv', output_format = 'csv', pages = x)
#Convert column to string, create second data frame, shift said data frame up 1
df = pd.read_csv('POCSV.csv')
df['Chassis'] = df['Chassis'].astype(str)
dfshift = df.shift(-1)
dfshift.rename(columns=({'Chassis': 'Chassis Shifted'}), inplace = True,)
dfMerged = pd.concat([df, dfshift], axis=1)
#For each row combine rows, create new column
for ind, row in df.iterrows():
dfMerged.loc[ind, 'Complete Chassis'] = row['Chassis'] + row["Chassis Shifted"]
print(dfMerged['Complete Chassis'])
为 Chassis
列创建一个虚拟组并合并该组的行:
# Convert 'NaN' string to pd.NA
df = df.replace('Nan', pd.NA)
cols = df.columns.difference(['Chassis'])
m = df[cols].any(1)
df = df.assign(Chassis=df.groupby(m.cumsum())['Chassis'] \
.transform('sum')).loc[m].reset_index(drop=True)
print(df)
# Output
Col1 Chassis Other Columns Other Columns 2
0 <NA> ABCD123456 ABC 123
1 <NA> WXYZ987654 GHI 456
2 <NA> QRSTU654987 <NA> 789
3 <NA> MNOP999999 XYZ <NA>
设置:
import pandas as pd
import numpy as np
data = {'Col1': ['Nan', 'Nan', 'Nan', 'Nan', 'Nan'],
'Chassis': ['ABCD12345', '6', 'WXYZ987654', 'QRSTU654987', 'MNOP999999'],
'Other Columns': ['ABC', 'Nan', 'GHI', 'Nan', 'XYZ'],
'Other Columns 2': ['123', 'Nan', '456', '789', 'Nan']}
df = pd.DataFrame(data)
print(df)
# Output
Col1 Chassis Other Columns Other Columns 2
0 Nan ABCD12345 ABC 123
1 Nan 6 Nan Nan
2 Nan WXYZ987654 GHI 456
3 Nan QRSTU654987 Nan 789
4 Nan MNOP999999 XYZ Nan
我正在尝试将两个单元格合并在一起。这样做的原因是 'Chassis' 下的每个单位都应该是字母数字 (ABCD123456),但是提供的 PO 偶尔会将最后一个数字移到下一行(该行上没有其他数据),使数据看起来像这样 Example I initially tried to create a statement that looked at the cell, confirmed it was less one number, then would look at the next cell, and merge the two. Never got that to even come close to manifesting any results. I then decided to replicate the data frame, shift the second data frame(so the missing number is on the same row), and merge them together. This is where I am now. Error Msg 这是我在 Python 中的第一个真正的代码,所以我相当确定我在做低效的事情所以一定要让我知道我可以改进的地方。
目前我有这个...
Col1 | Chassis | Other Columns... | Other Columns 2... |
---|---|---|---|
Nan | ABCD12345 | ABC | 123 |
Nan | 6 | Nan | Nan |
Nan | WXYZ987654 | GHI | 456 |
Nan | QRSTU654987 | Nan | 789 |
Nan | MNOP999999 | XYZ | Nan |
最终目标是这个...
Col1 | Chassis | Other Columns... | Other Columns 2... |
---|---|---|---|
Nan | ABCD123456 | ABC | 123 |
Nan | WXYZ987654 | GHI | 456 |
Nan | QRSTU654987 | Nan | 789 |
Nan | MNOP999999 | XYZ | Nan |
import PyPDF2 as pdf2
import tabula as tb
import pandas as pd
import re
import csv
import os
os.listdir()
pd.set_option('display.max_columns', None)
#bring in pdf, remove first page, convert to csv
PO = 'PO.pdf'
pages = open(PO, 'rb')
readPDF = pdf2.PdfFileReader(pages)
totalpages = readPDF.numPages
x = '2-' + str(totalpages)
POCSV = tb.convert_into(PO, 'POCSV.csv', output_format = 'csv', pages = x)
#Convert column to string, create second data frame, shift said data frame up 1
df = pd.read_csv('POCSV.csv')
df['Chassis'] = df['Chassis'].astype(str)
dfshift = df.shift(-1)
dfshift.rename(columns=({'Chassis': 'Chassis Shifted'}), inplace = True,)
dfMerged = pd.concat([df, dfshift], axis=1)
#For each row combine rows, create new column
for ind, row in df.iterrows():
dfMerged.loc[ind, 'Complete Chassis'] = row['Chassis'] + row["Chassis Shifted"]
print(dfMerged['Complete Chassis'])
为 Chassis
列创建一个虚拟组并合并该组的行:
# Convert 'NaN' string to pd.NA
df = df.replace('Nan', pd.NA)
cols = df.columns.difference(['Chassis'])
m = df[cols].any(1)
df = df.assign(Chassis=df.groupby(m.cumsum())['Chassis'] \
.transform('sum')).loc[m].reset_index(drop=True)
print(df)
# Output
Col1 Chassis Other Columns Other Columns 2
0 <NA> ABCD123456 ABC 123
1 <NA> WXYZ987654 GHI 456
2 <NA> QRSTU654987 <NA> 789
3 <NA> MNOP999999 XYZ <NA>
设置:
import pandas as pd
import numpy as np
data = {'Col1': ['Nan', 'Nan', 'Nan', 'Nan', 'Nan'],
'Chassis': ['ABCD12345', '6', 'WXYZ987654', 'QRSTU654987', 'MNOP999999'],
'Other Columns': ['ABC', 'Nan', 'GHI', 'Nan', 'XYZ'],
'Other Columns 2': ['123', 'Nan', '456', '789', 'Nan']}
df = pd.DataFrame(data)
print(df)
# Output
Col1 Chassis Other Columns Other Columns 2
0 Nan ABCD12345 ABC 123
1 Nan 6 Nan Nan
2 Nan WXYZ987654 GHI 456
3 Nan QRSTU654987 Nan 789
4 Nan MNOP999999 XYZ Nan