当 Pandas DataFrame 中下一行的所有列都是 NaN 时更新行索引
Update row index when all columns of the next row ara NaN in a Pandas DataFrame
我有一个从 PDF 中提取的 Pandas DataFrame tabula-py。
PDF是这样的:
+--------------+--------+-------+
| name | letter | value |
+--------------+--------+-------+
| A short name | a | 1 |
+-------------------------------+
| Another | b | 2 |
+-------------------------------+
| A very large | c | 3 |
| name | | |
+-------------------------------+
| other one | d | 4 |
+-------------------------------+
| My name is | e | 5 |
| big | | |
+--------------+--------+-------+
如您所见,A very large name
有一个换行符,并且由于原始 pdf 没有边框,所以在 DataFrame 中创建了一行 ['name', NaN, NaN]
和另一行 ['A very large', 'c', 3]
,当我只想要一个内容为:['A very large name', 'c', 3]
.
My name is big
也是如此
因为这发生在几行中,我试图实现的是当该行中的其余单元格为 NaN
时,将 name
单元格的内容与前一个单元格的内容连接起来。然后删除 NaN 行。
但欢迎任何其他获得相同结果的策略。
import pandas as pd
import numpy as np
data = {
"name": ["A short name", "Another", "A very large", "name", "other one", "My name is", "big"],
"letter": ["a", "b", "c", np.NaN, "d", "e", np.NaN],
"value": [1, 2, 3, np.NaN, 4, 5, np.NaN],
}
df = pd.DataFrame(data)
data_expected = {
"name": ["A short name", "Another", "A very large name", "other one", "My name is big"],
"letter": ["a", "b", "c", "d", "e"],
"value": [1, 2, 3, 4, 5],
}
df_expected = pd.DataFrame(data_expected)
我正在尝试这样的代码,但无法正常工作
# Not works and not very `pandastonic`
nan_indexes = df[df.iloc[:, 1:].isna().all(axis='columns')].index
df.loc[nan_indexes - 1, "name"] = df.loc[nan_indexes - 1, "name"].str.cat(df.loc[nan_indexes, "name"], ' ')
# remove NaN rows
您可以尝试使用 groupby.agg
和 join
或 first
,具体取决于列。通过检查列字母和值中的 notna
和 cumsum
.
的位置来创建组
print (df.groupby(df[['letter', 'value']].notna().any(1).cumsum())
.agg({'name': ' '.join, 'letter':'first', 'value':'first'})
)
name letter value
1 A short name a 1.0
2 Another b 2.0
3 A very large name c 3.0
4 other one d 4.0
5 My name is big e 5.0
我有一个从 PDF 中提取的 Pandas DataFrame tabula-py。
PDF是这样的:
+--------------+--------+-------+
| name | letter | value |
+--------------+--------+-------+
| A short name | a | 1 |
+-------------------------------+
| Another | b | 2 |
+-------------------------------+
| A very large | c | 3 |
| name | | |
+-------------------------------+
| other one | d | 4 |
+-------------------------------+
| My name is | e | 5 |
| big | | |
+--------------+--------+-------+
如您所见,A very large name
有一个换行符,并且由于原始 pdf 没有边框,所以在 DataFrame 中创建了一行 ['name', NaN, NaN]
和另一行 ['A very large', 'c', 3]
,当我只想要一个内容为:['A very large name', 'c', 3]
.
My name is big
因为这发生在几行中,我试图实现的是当该行中的其余单元格为 NaN
时,将 name
单元格的内容与前一个单元格的内容连接起来。然后删除 NaN 行。
但欢迎任何其他获得相同结果的策略。
import pandas as pd
import numpy as np
data = {
"name": ["A short name", "Another", "A very large", "name", "other one", "My name is", "big"],
"letter": ["a", "b", "c", np.NaN, "d", "e", np.NaN],
"value": [1, 2, 3, np.NaN, 4, 5, np.NaN],
}
df = pd.DataFrame(data)
data_expected = {
"name": ["A short name", "Another", "A very large name", "other one", "My name is big"],
"letter": ["a", "b", "c", "d", "e"],
"value": [1, 2, 3, 4, 5],
}
df_expected = pd.DataFrame(data_expected)
我正在尝试这样的代码,但无法正常工作
# Not works and not very `pandastonic`
nan_indexes = df[df.iloc[:, 1:].isna().all(axis='columns')].index
df.loc[nan_indexes - 1, "name"] = df.loc[nan_indexes - 1, "name"].str.cat(df.loc[nan_indexes, "name"], ' ')
# remove NaN rows
您可以尝试使用 groupby.agg
和 join
或 first
,具体取决于列。通过检查列字母和值中的 notna
和 cumsum
.
print (df.groupby(df[['letter', 'value']].notna().any(1).cumsum())
.agg({'name': ' '.join, 'letter':'first', 'value':'first'})
)
name letter value
1 A short name a 1.0
2 Another b 2.0
3 A very large name c 3.0
4 other one d 4.0
5 My name is big e 5.0