Pandas 基于规则的列合并

Pandas rule based column merging

我有一个包含 pubmed 文章的数据集。 DataFrame 看起来像这样:

df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],
                                ["introduction","methods","discussion","other section","one  more section","conclusion"]],
               "sections":[[["intro text","another sentence"],["some text","some text", "more text"],["some text","some text"],["some text","some text"],["some text","some text"]],
                          [["intro text","another sentence"],["some text","some text"],["some text","more text","some text","more text"],["some text","some text"],["some text","some text"],["some text","some text"]]]})

所以基本上,section_names 列包含文章中所有部分的名称。在“部分”列中,section_names 中每个部分名称的列表中都有实际文本。作为第一步,我想将每个部分都放在一个列中。所以,我这样做了:

df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])]):

NaN 有意义,因为这些部分不适用于特定列,对于每一列,至少有一个非 NaN 值。对于很多不同栏目名称的文章,栏目数会急剧增加。在原始数据集中,我实际上有大约 10,000 列。

我现在想要的是合并列并且最多有 4 列(介绍、方法、讨论、结论)。我想说这样的话:

After a section name methods, merge all other sections until discussion with methods and after methods merge all until conclusion with discussion

根据我们 df 中的此规则,对于第一篇文章,section1another section 将与 methods 合并。对于第二篇文章,other sectionone more section 应与 discussion.

合并

我该怎么做?

一种选择是根据所需列的位置创建列索引,然后将每组的行聚合到列表中:

desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
    lambda x: x.agg(
        lambda r: list(itertools.chain.from_iterable(r.dropna()))
                  or np.nan,
        axis=1)
)
new_df.columns = desired_columns

new_df:

                     introduction                                                                        methods                                                                                discussion              conclusion
0  [intro text, another sentence]  [some text, some text, more text, some text, some text, some text, some text]                                                                    [some text, some text]                     NaN
1  [intro text, another sentence]                                                         [some text, some text]  [some text, more text, some text, more text, some text, some text, some text, some text]  [some text, some text]

列索引是使用以下方法创建的:

df.columns.isin(desired_columns).cumsum()

产生如下组:

[1 2 2 2 3 3 3 4]

完整的工作示例:

import itertools

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "section_names": [
        ["introduction", "methods", "section1", "anothersection", "discussion"],
        ["introduction", "methods", "discussion", "othersection",
         "onemoresection", "conclusion"]], "sections": [
        [["introtext", "anothersentence"], ["sometext", "sometext", "moretext"],
         ["sometext", "sometext"], ["sometext", "sometext"],
         ["sometext", "sometext"]],
        [["introtext", "anothersentence"], ["sometext", "sometext"],
         ["sometext", "moretext", "sometext", "moretext"],
         ["sometext", "sometext"], ["sometext", "sometext"],
         ["sometext", "sometext"]]]
})

df = pd.DataFrame(
    [dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])])

desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
    lambda x: x.agg(
        lambda r: list(itertools.chain.from_iterable(r.dropna()))
                  or np.nan,
        axis=1)
)
new_df.columns = desired_columns
print(new_df.to_string())