如何从 Python 中的 pandas 数据框的列中获取包含唯一 3 个字母集的单词类别?
How to get categories of words containing unique 3-letter set from the columns of pandas dataframe in Python?
我有一个数据框 df
看起来像
Unnamed: 0 Characters Split A B C D Set Names
0 FROKDUWJU [FRO, KDU, WJU] FRO KDU WJU NaN {WJU, KDU, FRO}
1 IDJWPZSUR [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {SUR, WPZ, IDJ}
2 UCFURKIRODCQ [UCF, URK, IRO, DCQ] UCF URK IRO DCQ {UCF, URK, DCQ, IRO}
3 ORI [ORI] ORI NaN NaN NaN {ORI}
4 PROIRKIQARTIBPO [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI {IQA, BPO, PRO, IRK, RTI}
5 QAZWREDCQIBR [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR {DCQ, QAZ, IBR, WRE}
6 PLPRUFSWURKI [PLP, RUF, SWU, RKI] PLP RUF SWU RKI {PLP, SWU, RKI, RUF}
7 FROIEUSKIKIR [FRO, IEU, SKI, KIR] FRO IEU SKI KIR {SKI, IEU, KIR, FRO}
8 ORIUWJZSRFRO [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO {UWJ, ORI, ZSR, FRO}
9 URKIFJVUR [URK, IFJ, VUR] URK IFJ VUR NaN {URK, VUR, IFJ}
10 RUFOFR [RUF, OFR] RUF OFR NaN NaN {OFR, RUF}
11 IEU [IEU] IEU NaN NaN NaN {IEU}
12 PIMIEU [PIM, IEU] PIM IEU NaN NaN {PIM, IEU}
第一列包含某些名称。 Characters Split 列包含以列表形式拆分为每 3 个字母的名称。 A、B、C 和 D 列包含这 3 个字母的细分。列集名称具有相同的 3 个字母,但采用集合的形式。
3个字母中的一些在不同的名字中很常见。例如:“FRO”出现在索引 0、7 和 8 的名称中。对于这些共有一个 3 个字母集的名称,我想将它们归为一类,最好以列表的形式。是否可以为每个独特的 3 个字母集设置这些类别?什么是合适的方法?
df.to_dict()
如图:
{'Unnamed: 0': {0: 'FROKDUWJU',
1: 'IDJWPZSUR',
2: 'UCFURKIRODCQ',
3: 'ORI',
4: 'PROIRKIQARTIBPO',
5: 'QAZWREDCQIBR',
6: 'PLPRUFSWURKI',
7: 'FROIEUSKIKIR',
8: 'ORIUWJZSRFRO',
9: 'URKIFJVUR',
10: 'RUFOFR',
11: 'IEU',
12: 'PIMIEU'},
'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
1: ['IDJ', 'WPZ', 'SUR'],
2: ['UCF', 'URK', 'IRO', 'DCQ'],
3: ['ORI'],
4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
6: ['PLP', 'RUF', 'SWU', 'RKI'],
7: ['FRO', 'IEU', 'SKI', 'KIR'],
8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
9: ['URK', 'IFJ', 'VUR'],
10: ['RUF', 'OFR'],
11: ['IEU'],
12: ['PIM', 'IEU']},
'A': {0: 'FRO',
1: 'IDJ',
2: 'UCF',
3: 'ORI',
4: 'PRO',
5: 'QAZ',
6: 'PLP',
7: 'FRO',
8: 'ORI',
9: 'URK',
10: 'RUF',
11: 'IEU',
12: 'PIM'},
'B': {0: 'KDU',
1: 'WPZ',
2: 'URK',
3: nan,
4: 'IRK',
5: 'WRE',
6: 'RUF',
7: 'IEU',
8: 'UWJ',
9: 'IFJ',
10: 'OFR',
11: nan,
12: 'IEU'},
'C': {0: 'WJU',
1: 'SUR',
2: 'IRO',
3: nan,
4: 'IQA',
5: 'DCQ',
6: 'SWU',
7: 'SKI',
8: 'ZSR',
9: 'VUR',
10: nan,
11: nan,
12: nan},
'D': {0: nan,
1: nan,
2: 'DCQ',
3: nan,
4: 'RTI',
5: 'IBR',
6: 'RKI',
7: 'KIR',
8: 'FRO',
9: nan,
10: nan,
11: nan,
12: nan},
'Set Names': {0: {'FRO', 'KDU', 'WJU'},
1: {'IDJ', 'SUR', 'WPZ'},
2: {'DCQ', 'IRO', 'UCF', 'URK'},
3: {'ORI'},
4: {'BPO', 'IQA', 'IRK', 'PRO', 'RTI'},
5: {'DCQ', 'IBR', 'QAZ', 'WRE'},
6: {'PLP', 'RKI', 'RUF', 'SWU'},
7: {'FRO', 'IEU', 'KIR', 'SKI'},
8: {'FRO', 'ORI', 'UWJ', 'ZSR'},
9: {'IFJ', 'URK', 'VUR'},
10: {'OFR', 'RUF'},
11: {'IEU'},
12: {'IEU', 'PIM'}}}
您可以 explode
'Set Names',然后 groupby
展开的列并将 'Unnamed: 0' 合并到每个组的列表中:
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(list)
)
输出:
Set Names
BPO [PROIRKIQARTIBPO]
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IBR [QAZWREDCQIBR]
IDJ [IDJWPZSUR]
... ...
WJU [FROKDUWJU]
WPZ [IDJWPZSUR]
WRE [QAZWREDCQIBR]
ZSR [ORIUWJZSRFRO]
如果您希望过滤输出以使每组的项目数最少(此处 > 1):
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(lambda g: list(g) if len(g) > 1 else None)
.dropna()
)
输出:
Set Names
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IEU [FROIEUSKIKIR, IEU, PIMIEU]
ORI [ORI, ORIUWJZSRFRO]
RUF [PLPRUFSWURKI, RUFOFR]
URK [UCFURKIRODCQ, URKIFJVUR]
我有一个数据框 df
看起来像
Unnamed: 0 Characters Split A B C D Set Names
0 FROKDUWJU [FRO, KDU, WJU] FRO KDU WJU NaN {WJU, KDU, FRO}
1 IDJWPZSUR [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {SUR, WPZ, IDJ}
2 UCFURKIRODCQ [UCF, URK, IRO, DCQ] UCF URK IRO DCQ {UCF, URK, DCQ, IRO}
3 ORI [ORI] ORI NaN NaN NaN {ORI}
4 PROIRKIQARTIBPO [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI {IQA, BPO, PRO, IRK, RTI}
5 QAZWREDCQIBR [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR {DCQ, QAZ, IBR, WRE}
6 PLPRUFSWURKI [PLP, RUF, SWU, RKI] PLP RUF SWU RKI {PLP, SWU, RKI, RUF}
7 FROIEUSKIKIR [FRO, IEU, SKI, KIR] FRO IEU SKI KIR {SKI, IEU, KIR, FRO}
8 ORIUWJZSRFRO [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO {UWJ, ORI, ZSR, FRO}
9 URKIFJVUR [URK, IFJ, VUR] URK IFJ VUR NaN {URK, VUR, IFJ}
10 RUFOFR [RUF, OFR] RUF OFR NaN NaN {OFR, RUF}
11 IEU [IEU] IEU NaN NaN NaN {IEU}
12 PIMIEU [PIM, IEU] PIM IEU NaN NaN {PIM, IEU}
第一列包含某些名称。 Characters Split 列包含以列表形式拆分为每 3 个字母的名称。 A、B、C 和 D 列包含这 3 个字母的细分。列集名称具有相同的 3 个字母,但采用集合的形式。
3个字母中的一些在不同的名字中很常见。例如:“FRO”出现在索引 0、7 和 8 的名称中。对于这些共有一个 3 个字母集的名称,我想将它们归为一类,最好以列表的形式。是否可以为每个独特的 3 个字母集设置这些类别?什么是合适的方法?
df.to_dict()
如图:
{'Unnamed: 0': {0: 'FROKDUWJU',
1: 'IDJWPZSUR',
2: 'UCFURKIRODCQ',
3: 'ORI',
4: 'PROIRKIQARTIBPO',
5: 'QAZWREDCQIBR',
6: 'PLPRUFSWURKI',
7: 'FROIEUSKIKIR',
8: 'ORIUWJZSRFRO',
9: 'URKIFJVUR',
10: 'RUFOFR',
11: 'IEU',
12: 'PIMIEU'},
'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
1: ['IDJ', 'WPZ', 'SUR'],
2: ['UCF', 'URK', 'IRO', 'DCQ'],
3: ['ORI'],
4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
6: ['PLP', 'RUF', 'SWU', 'RKI'],
7: ['FRO', 'IEU', 'SKI', 'KIR'],
8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
9: ['URK', 'IFJ', 'VUR'],
10: ['RUF', 'OFR'],
11: ['IEU'],
12: ['PIM', 'IEU']},
'A': {0: 'FRO',
1: 'IDJ',
2: 'UCF',
3: 'ORI',
4: 'PRO',
5: 'QAZ',
6: 'PLP',
7: 'FRO',
8: 'ORI',
9: 'URK',
10: 'RUF',
11: 'IEU',
12: 'PIM'},
'B': {0: 'KDU',
1: 'WPZ',
2: 'URK',
3: nan,
4: 'IRK',
5: 'WRE',
6: 'RUF',
7: 'IEU',
8: 'UWJ',
9: 'IFJ',
10: 'OFR',
11: nan,
12: 'IEU'},
'C': {0: 'WJU',
1: 'SUR',
2: 'IRO',
3: nan,
4: 'IQA',
5: 'DCQ',
6: 'SWU',
7: 'SKI',
8: 'ZSR',
9: 'VUR',
10: nan,
11: nan,
12: nan},
'D': {0: nan,
1: nan,
2: 'DCQ',
3: nan,
4: 'RTI',
5: 'IBR',
6: 'RKI',
7: 'KIR',
8: 'FRO',
9: nan,
10: nan,
11: nan,
12: nan},
'Set Names': {0: {'FRO', 'KDU', 'WJU'},
1: {'IDJ', 'SUR', 'WPZ'},
2: {'DCQ', 'IRO', 'UCF', 'URK'},
3: {'ORI'},
4: {'BPO', 'IQA', 'IRK', 'PRO', 'RTI'},
5: {'DCQ', 'IBR', 'QAZ', 'WRE'},
6: {'PLP', 'RKI', 'RUF', 'SWU'},
7: {'FRO', 'IEU', 'KIR', 'SKI'},
8: {'FRO', 'ORI', 'UWJ', 'ZSR'},
9: {'IFJ', 'URK', 'VUR'},
10: {'OFR', 'RUF'},
11: {'IEU'},
12: {'IEU', 'PIM'}}}
您可以 explode
'Set Names',然后 groupby
展开的列并将 'Unnamed: 0' 合并到每个组的列表中:
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(list)
)
输出:
Set Names
BPO [PROIRKIQARTIBPO]
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IBR [QAZWREDCQIBR]
IDJ [IDJWPZSUR]
... ...
WJU [FROKDUWJU]
WPZ [IDJWPZSUR]
WRE [QAZWREDCQIBR]
ZSR [ORIUWJZSRFRO]
如果您希望过滤输出以使每组的项目数最少(此处 > 1):
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(lambda g: list(g) if len(g) > 1 else None)
.dropna()
)
输出:
Set Names
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IEU [FROIEUSKIKIR, IEU, PIMIEU]
ORI [ORI, ORIUWJZSRFRO]
RUF [PLPRUFSWURKI, RUFOFR]
URK [UCFURKIRODCQ, URKIFJVUR]