如何从 Python 中的 pandas 数据框的列中获取包含唯一 3 个字母集的单词类别?

How to get categories of words containing unique 3-letter set from the columns of pandas dataframe in Python?

我有一个数据框 df 看起来像

     Unnamed: 0 Characters Split    A   B   C   D   Set Names
0   FROKDUWJU   [FRO, KDU, WJU] FRO KDU WJU NaN {WJU, KDU, FRO}
1   IDJWPZSUR   [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {SUR, WPZ, IDJ}
2   UCFURKIRODCQ    [UCF, URK, IRO, DCQ]    UCF URK IRO DCQ {UCF, URK, DCQ, IRO}
3   ORI [ORI]   ORI NaN NaN NaN {ORI}
4   PROIRKIQARTIBPO [PRO, IRK, IQA, RTI, BPO]   PRO IRK IQA RTI {IQA, BPO, PRO, IRK, RTI}
5   QAZWREDCQIBR    [QAZ, WRE, DCQ, IBR]    QAZ WRE DCQ IBR {DCQ, QAZ, IBR, WRE}
6   PLPRUFSWURKI    [PLP, RUF, SWU, RKI]    PLP RUF SWU RKI {PLP, SWU, RKI, RUF}
7   FROIEUSKIKIR    [FRO, IEU, SKI, KIR]    FRO IEU SKI KIR {SKI, IEU, KIR, FRO}
8   ORIUWJZSRFRO    [ORI, UWJ, ZSR, FRO]    ORI UWJ ZSR FRO {UWJ, ORI, ZSR, FRO}
9   URKIFJVUR   [URK, IFJ, VUR] URK IFJ VUR NaN {URK, VUR, IFJ}
10  RUFOFR  [RUF, OFR]  RUF OFR NaN NaN {OFR, RUF}
11  IEU [IEU]   IEU NaN NaN NaN {IEU}
12  PIMIEU  [PIM, IEU]  PIM IEU NaN NaN {PIM, IEU}

                                

第一列包含某些名称。 Characters Split 列包含以列表形式拆分为每 3 个字母的名称。 A、B、C 和 D 列包含这 3 个字母的细分。列集名称具有相同的 3 个字母,但采用集合的形式。

3个字母中的一些在不同的名字中很常见。例如:“FRO”出现在索引 0、7 和 8 的名称中。对于这些共有一个 3 个字母集的名称,我想将它们归为一类,最好以列表的形式。是否可以为每个独特的 3 个字母集设置这些类别?什么是合适的方法?

df.to_dict()如图:

{'Unnamed: 0': {0: 'FROKDUWJU',
  1: 'IDJWPZSUR',
  2: 'UCFURKIRODCQ',
  3: 'ORI',
  4: 'PROIRKIQARTIBPO',
  5: 'QAZWREDCQIBR',
  6: 'PLPRUFSWURKI',
  7: 'FROIEUSKIKIR',
  8: 'ORIUWJZSRFRO',
  9: 'URKIFJVUR',
  10: 'RUFOFR',
  11: 'IEU',
  12: 'PIMIEU'},
 'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
  1: ['IDJ', 'WPZ', 'SUR'],
  2: ['UCF', 'URK', 'IRO', 'DCQ'],
  3: ['ORI'],
  4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
  5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
  6: ['PLP', 'RUF', 'SWU', 'RKI'],
  7: ['FRO', 'IEU', 'SKI', 'KIR'],
  8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
  9: ['URK', 'IFJ', 'VUR'],
  10: ['RUF', 'OFR'],
  11: ['IEU'],
  12: ['PIM', 'IEU']},
 'A': {0: 'FRO',
  1: 'IDJ',
  2: 'UCF',
  3: 'ORI',
  4: 'PRO',
  5: 'QAZ',
  6: 'PLP',
  7: 'FRO',
  8: 'ORI',
  9: 'URK',
  10: 'RUF',
  11: 'IEU',
  12: 'PIM'},
 'B': {0: 'KDU',
  1: 'WPZ',
  2: 'URK',
  3: nan,
  4: 'IRK',
  5: 'WRE',
  6: 'RUF',
  7: 'IEU',
  8: 'UWJ',
  9: 'IFJ',
  10: 'OFR',
  11: nan,
  12: 'IEU'},
 'C': {0: 'WJU',
  1: 'SUR',
  2: 'IRO',
  3: nan,
  4: 'IQA',
  5: 'DCQ',
  6: 'SWU',
  7: 'SKI',
  8: 'ZSR',
  9: 'VUR',
  10: nan,
  11: nan,
  12: nan},
 'D': {0: nan,
  1: nan,
  2: 'DCQ',
  3: nan,
  4: 'RTI',
  5: 'IBR',
  6: 'RKI',
  7: 'KIR',
  8: 'FRO',
  9: nan,
  10: nan,
  11: nan,
  12: nan},
 'Set Names': {0: {'FRO', 'KDU', 'WJU'},
  1: {'IDJ', 'SUR', 'WPZ'},
  2: {'DCQ', 'IRO', 'UCF', 'URK'},
  3: {'ORI'},
  4: {'BPO', 'IQA', 'IRK', 'PRO', 'RTI'},
  5: {'DCQ', 'IBR', 'QAZ', 'WRE'},
  6: {'PLP', 'RKI', 'RUF', 'SWU'},
  7: {'FRO', 'IEU', 'KIR', 'SKI'},
  8: {'FRO', 'ORI', 'UWJ', 'ZSR'},
  9: {'IFJ', 'URK', 'VUR'},
  10: {'OFR', 'RUF'},
  11: {'IEU'},
  12: {'IEU', 'PIM'}}}

您可以 explode 'Set Names',然后 groupby 展开的列并将 'Unnamed: 0' 合并到每个组的列表中:

(df.explode('Set Names')
   .groupby('Set Names')
   ['Unnamed: 0'].apply(list)
)

输出:

Set Names
BPO                          [PROIRKIQARTIBPO]
DCQ               [UCFURKIRODCQ, QAZWREDCQIBR]
FRO    [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IBR                             [QAZWREDCQIBR]
IDJ                                [IDJWPZSUR]
...                                        ...
WJU                                [FROKDUWJU]
WPZ                                [IDJWPZSUR]
WRE                             [QAZWREDCQIBR]
ZSR                             [ORIUWJZSRFRO]

如果您希望过滤输出以使每组的项目数最少(此处 > 1):

(df.explode('Set Names')
   .groupby('Set Names')
   ['Unnamed: 0'].apply(lambda g: list(g) if len(g) > 1 else None)
   .dropna()
)

输出:

Set Names
DCQ               [UCFURKIRODCQ, QAZWREDCQIBR]
FRO    [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IEU                [FROIEUSKIKIR, IEU, PIMIEU]
ORI                        [ORI, ORIUWJZSRFRO]
RUF                     [PLPRUFSWURKI, RUFOFR]
URK                  [UCFURKIRODCQ, URKIFJVUR]