Pandas 基于规则的列合并

Question

我有一个包含 pubmed 文章的数据集。 DataFrame 看起来像这样：

df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],
                                ["introduction","methods","discussion","other section","one  more section","conclusion"]],
               "sections":[[["intro text","another sentence"],["some text","some text", "more text"],["some text","some text"],["some text","some text"],["some text","some text"]],
                          [["intro text","another sentence"],["some text","some text"],["some text","more text","some text","more text"],["some text","some text"],["some text","some text"],["some text","some text"]]]})

所以基本上，section_names 列包含文章中所有部分的名称。在“部分”列中，section_names 中每个部分名称的列表中都有实际文本。作为第一步，我想将每个部分都放在一个列中。所以，我这样做了：

df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])]):

值 NaN 有意义，因为这些部分不适用于特定列，对于每一列，至少有一个非 NaN 值。对于很多不同栏目名称的文章，栏目数会急剧增加。在原始数据集中，我实际上有大约 10,000 列。

我现在想要的是合并列并且最多有 4 列（介绍、方法、讨论、结论）。我想说这样的话：

After a section name methods, merge all other sections until discussion with methods and after methods merge all until conclusion with discussion

根据我们 df 中的此规则，对于第一篇文章，section1 和 another section 将与 methods 合并。对于第二篇文章，other section 和 one more section 应与 discussion.

合并

我该怎么做？

Answer 1

一种选择是根据所需列的位置创建列索引，然后将每组的行聚合到列表中：

desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
    lambda x: x.agg(
        lambda r: list(itertools.chain.from_iterable(r.dropna()))
                  or np.nan,
        axis=1)
)
new_df.columns = desired_columns

new_df:

                     introduction                                                                        methods                                                                                discussion              conclusion
0  [intro text, another sentence]  [some text, some text, more text, some text, some text, some text, some text]                                                                    [some text, some text]                     NaN
1  [intro text, another sentence]                                                         [some text, some text]  [some text, more text, some text, more text, some text, some text, some text, some text]  [some text, some text]

列索引是使用以下方法创建的：

df.columns.isin(desired_columns).cumsum()

产生如下组：

[1 2 2 2 3 3 3 4]

完整的工作示例：

import itertools

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "section_names": [
        ["introduction", "methods", "section1", "anothersection", "discussion"],
        ["introduction", "methods", "discussion", "othersection",
         "onemoresection", "conclusion"]], "sections": [
        [["introtext", "anothersentence"], ["sometext", "sometext", "moretext"],
         ["sometext", "sometext"], ["sometext", "sometext"],
         ["sometext", "sometext"]],
        [["introtext", "anothersentence"], ["sometext", "sometext"],
         ["sometext", "moretext", "sometext", "moretext"],
         ["sometext", "sometext"], ["sometext", "sometext"],
         ["sometext", "sometext"]]]
})

df = pd.DataFrame(
    [dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])])

desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
    lambda x: x.agg(
        lambda r: list(itertools.chain.from_iterable(r.dropna()))
                  or np.nan,
        axis=1)
)
new_df.columns = desired_columns
print(new_df.to_string())

Pandas 基于规则的列合并

Pandas rule based column merging

python

pandas

pubmed