Pandas 基于规则的列合并
Pandas rule based column merging
我有一个包含 pubmed
文章的数据集。 DataFrame 看起来像这样:
df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],
["introduction","methods","discussion","other section","one more section","conclusion"]],
"sections":[[["intro text","another sentence"],["some text","some text", "more text"],["some text","some text"],["some text","some text"],["some text","some text"]],
[["intro text","another sentence"],["some text","some text"],["some text","more text","some text","more text"],["some text","some text"],["some text","some text"],["some text","some text"]]]})
所以基本上,section_names
列包含文章中所有部分的名称。在“部分”列中,section_names
中每个部分名称的列表中都有实际文本。作为第一步,我想将每个部分都放在一个列中。所以,我这样做了:
df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])]):
值 NaN
有意义,因为这些部分不适用于特定列,对于每一列,至少有一个非 NaN 值。对于很多不同栏目名称的文章,栏目数会急剧增加。在原始数据集中,我实际上有大约 10,000 列。
我现在想要的是合并列并且最多有 4 列(介绍、方法、讨论、结论)。我想说这样的话:
After a section name methods
, merge all other sections until
discussion
with methods
and after methods
merge all until
conclusion
with discussion
根据我们 df
中的此规则,对于第一篇文章,section1
和 another section
将与 methods
合并。对于第二篇文章,other section
和 one more section
应与 discussion
.
合并
我该怎么做?
一种选择是根据所需列的位置创建列索引,然后将每组的行聚合到列表中:
desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
lambda x: x.agg(
lambda r: list(itertools.chain.from_iterable(r.dropna()))
or np.nan,
axis=1)
)
new_df.columns = desired_columns
new_df
:
introduction methods discussion conclusion
0 [intro text, another sentence] [some text, some text, more text, some text, some text, some text, some text] [some text, some text] NaN
1 [intro text, another sentence] [some text, some text] [some text, more text, some text, more text, some text, some text, some text, some text] [some text, some text]
列索引是使用以下方法创建的:
df.columns.isin(desired_columns).cumsum()
产生如下组:
[1 2 2 2 3 3 3 4]
完整的工作示例:
import itertools
import numpy as np
import pandas as pd
df = pd.DataFrame({
"section_names": [
["introduction", "methods", "section1", "anothersection", "discussion"],
["introduction", "methods", "discussion", "othersection",
"onemoresection", "conclusion"]], "sections": [
[["introtext", "anothersentence"], ["sometext", "sometext", "moretext"],
["sometext", "sometext"], ["sometext", "sometext"],
["sometext", "sometext"]],
[["introtext", "anothersentence"], ["sometext", "sometext"],
["sometext", "moretext", "sometext", "moretext"],
["sometext", "sometext"], ["sometext", "sometext"],
["sometext", "sometext"]]]
})
df = pd.DataFrame(
[dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])])
desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
lambda x: x.agg(
lambda r: list(itertools.chain.from_iterable(r.dropna()))
or np.nan,
axis=1)
)
new_df.columns = desired_columns
print(new_df.to_string())
我有一个包含 pubmed
文章的数据集。 DataFrame 看起来像这样:
df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],
["introduction","methods","discussion","other section","one more section","conclusion"]],
"sections":[[["intro text","another sentence"],["some text","some text", "more text"],["some text","some text"],["some text","some text"],["some text","some text"]],
[["intro text","another sentence"],["some text","some text"],["some text","more text","some text","more text"],["some text","some text"],["some text","some text"],["some text","some text"]]]})
所以基本上,section_names
列包含文章中所有部分的名称。在“部分”列中,section_names
中每个部分名称的列表中都有实际文本。作为第一步,我想将每个部分都放在一个列中。所以,我这样做了:
df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])]):
值 NaN
有意义,因为这些部分不适用于特定列,对于每一列,至少有一个非 NaN 值。对于很多不同栏目名称的文章,栏目数会急剧增加。在原始数据集中,我实际上有大约 10,000 列。
我现在想要的是合并列并且最多有 4 列(介绍、方法、讨论、结论)。我想说这样的话:
After a section name
methods
, merge all other sections untildiscussion
withmethods
and aftermethods
merge all untilconclusion
withdiscussion
根据我们 df
中的此规则,对于第一篇文章,section1
和 another section
将与 methods
合并。对于第二篇文章,other section
和 one more section
应与 discussion
.
我该怎么做?
一种选择是根据所需列的位置创建列索引,然后将每组的行聚合到列表中:
desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
lambda x: x.agg(
lambda r: list(itertools.chain.from_iterable(r.dropna()))
or np.nan,
axis=1)
)
new_df.columns = desired_columns
new_df
:
introduction methods discussion conclusion
0 [intro text, another sentence] [some text, some text, more text, some text, some text, some text, some text] [some text, some text] NaN
1 [intro text, another sentence] [some text, some text] [some text, more text, some text, more text, some text, some text, some text, some text] [some text, some text]
列索引是使用以下方法创建的:
df.columns.isin(desired_columns).cumsum()
产生如下组:
[1 2 2 2 3 3 3 4]
完整的工作示例:
import itertools
import numpy as np
import pandas as pd
df = pd.DataFrame({
"section_names": [
["introduction", "methods", "section1", "anothersection", "discussion"],
["introduction", "methods", "discussion", "othersection",
"onemoresection", "conclusion"]], "sections": [
[["introtext", "anothersentence"], ["sometext", "sometext", "moretext"],
["sometext", "sometext"], ["sometext", "sometext"],
["sometext", "sometext"]],
[["introtext", "anothersentence"], ["sometext", "sometext"],
["sometext", "moretext", "sometext", "moretext"],
["sometext", "sometext"], ["sometext", "sometext"],
["sometext", "sometext"]]]
})
df = pd.DataFrame(
[dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])])
desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
lambda x: x.agg(
lambda r: list(itertools.chain.from_iterable(r.dropna()))
or np.nan,
axis=1)
)
new_df.columns = desired_columns
print(new_df.to_string())