Python 中的数据帧值匹配
Dataframe value matching in Python
我有一个问题我已经处理了好几天了。我有 2 个数据框,如下所示。索引按定义是指一组唯一的TEST、NAME 和SEQUENCE 三元组。目标是从匹配的索引文件中获取 'index' 值:
其中一个序列值,例如配置中的[111,222,333](例如可以是111 或 222 或 333)和测试和名称。
Config文件才是最重要的,目标是找到对应的索引值。配置中不存在的任何内容都不应显示在输出文件中。我想要一个最终输出,其中包括:INDEX、TEST、NAME 和 SEQUENCE。所以最终输出将是配置文件的一个子集,但它只包含一个 SEQUENCE(而不是 3 个)和相应的 TEST、NAME 和 INDEX。例如:
示例输出文件:
index TEST NAME SEQUENCE
901922 A john 111
238394 C ashley 555
930293 B sam 444
我尝试编写一个 for 循环,但没有成功建立索引,如下所示:
for x in range(0, config.shape[0]):
find1=eval(config.SEQUENCE[x])
find1='|'.join(str(i) for i in find1)
find1 = '(' + find1 + ')'
第一个数据框:配置
SEQUENCE TEST NAME
[111,222,333] A john
[222,444,888] B sam
[111,222,333] A ashley
[999,777,555] C ashley
[111,222,333] D john
[111,222,333] A john
G kelly
第二个数据帧:index
index TEST NAME SEQUENCE
901922 A john 111
930293 B sam 444
238203 A ashley 888
238394 C ashley 555
483472 D john 777
901922 A john 111
264225 F greg 111
465126 A mary 555
554216 B peter 333
# Your DataFrame contains a column of strings that look like lists,
# but we want to work with a column of actual Python lists.
# Convert strings to lists with this:
from ast import literal_eval
config['SEQUENCE'] = config['SEQUENCE'].apply(literal_eval)
# Split these newly formed lists into separate columns
split = pd.concat([pd.DataFrame(config.SEQUENCE.values.tolist()),
config[['TEST', 'NAME']]], axis=1)
split
0 1 2 TEST NAME
0 111 222 333 A john
1 222 444 888 B sam
2 111 222 333 A ashley
3 999 777 666 C ashley
4 111 222 333 D john
5 111 222 333 A john
# Melt or "unpivot" the DF so that each row holds only one sequence
melted = split.melt(id_vars=['TEST', 'NAME'],
value_name='SEQUENCE').drop('variable', axis=1)
melted
TEST NAME SEQUENCE
0 A john 111
1 B sam 222
2 A ashley 111
3 C ashley 999
4 D john 111
5 A john 111
6 A john 222
7 B sam 444
8 A ashley 222
9 C ashley 777
10 D john 222
11 A john 222
12 A john 333
13 B sam 888
14 A ashley 333
15 C ashley 666
16 D john 333
17 A john 333
# Default behaviour of pd.merge gives us what we want!
# Note that the duplicate row arises from the duplicate in the DF named index.
pd.merge(melted, index)
TEST NAME SEQUENCE index
0 A john 111 901922
1 A john 111 901922
2 B sam 444 930293
pd.merge(melted, index)['index'].unique()
array([901922, 930293])
一种方法是先左连接 test
和 name
上的两个 table,然后删除 [=15] 中 sequence
的行=] 在 config
中找不到:
ind2 = index.set_index(['test', 'name'])
out = config.join(ind2, ['test', 'name'], 'left', lsuffix='_config', rsuffix='_index')
out['sequence_config'] = out.apply(lambda x: x['sequence_index'] in x['sequence_config'] if x['sequence_config'] is not None else False, axis=1)
out = out[out['sequence_config']].set_index('index').drop_duplicates().drop(
'sequence_config', axis=1).rename(columns={'sequence_index': 'sequence'})
这给出:
name test sequence
index
901922.0 john A 111.0
930293.0 sam B 444.0
238394.0 ashley C 555.0
我有一个问题我已经处理了好几天了。我有 2 个数据框,如下所示。索引按定义是指一组唯一的TEST、NAME 和SEQUENCE 三元组。目标是从匹配的索引文件中获取 'index' 值:
其中一个序列值,例如配置中的[111,222,333](例如可以是111 或 222 或 333)和测试和名称。
Config文件才是最重要的,目标是找到对应的索引值。配置中不存在的任何内容都不应显示在输出文件中。我想要一个最终输出,其中包括:INDEX、TEST、NAME 和 SEQUENCE。所以最终输出将是配置文件的一个子集,但它只包含一个 SEQUENCE(而不是 3 个)和相应的 TEST、NAME 和 INDEX。例如:
示例输出文件:
index TEST NAME SEQUENCE
901922 A john 111
238394 C ashley 555
930293 B sam 444
我尝试编写一个 for 循环,但没有成功建立索引,如下所示:
for x in range(0, config.shape[0]):
find1=eval(config.SEQUENCE[x])
find1='|'.join(str(i) for i in find1)
find1 = '(' + find1 + ')'
第一个数据框:配置
SEQUENCE TEST NAME
[111,222,333] A john
[222,444,888] B sam
[111,222,333] A ashley
[999,777,555] C ashley
[111,222,333] D john
[111,222,333] A john
G kelly
第二个数据帧:index
index TEST NAME SEQUENCE
901922 A john 111
930293 B sam 444
238203 A ashley 888
238394 C ashley 555
483472 D john 777
901922 A john 111
264225 F greg 111
465126 A mary 555
554216 B peter 333
# Your DataFrame contains a column of strings that look like lists,
# but we want to work with a column of actual Python lists.
# Convert strings to lists with this:
from ast import literal_eval
config['SEQUENCE'] = config['SEQUENCE'].apply(literal_eval)
# Split these newly formed lists into separate columns
split = pd.concat([pd.DataFrame(config.SEQUENCE.values.tolist()),
config[['TEST', 'NAME']]], axis=1)
split
0 1 2 TEST NAME
0 111 222 333 A john
1 222 444 888 B sam
2 111 222 333 A ashley
3 999 777 666 C ashley
4 111 222 333 D john
5 111 222 333 A john
# Melt or "unpivot" the DF so that each row holds only one sequence
melted = split.melt(id_vars=['TEST', 'NAME'],
value_name='SEQUENCE').drop('variable', axis=1)
melted
TEST NAME SEQUENCE
0 A john 111
1 B sam 222
2 A ashley 111
3 C ashley 999
4 D john 111
5 A john 111
6 A john 222
7 B sam 444
8 A ashley 222
9 C ashley 777
10 D john 222
11 A john 222
12 A john 333
13 B sam 888
14 A ashley 333
15 C ashley 666
16 D john 333
17 A john 333
# Default behaviour of pd.merge gives us what we want!
# Note that the duplicate row arises from the duplicate in the DF named index.
pd.merge(melted, index)
TEST NAME SEQUENCE index
0 A john 111 901922
1 A john 111 901922
2 B sam 444 930293
pd.merge(melted, index)['index'].unique()
array([901922, 930293])
一种方法是先左连接 test
和 name
上的两个 table,然后删除 [=15] 中 sequence
的行=] 在 config
中找不到:
ind2 = index.set_index(['test', 'name'])
out = config.join(ind2, ['test', 'name'], 'left', lsuffix='_config', rsuffix='_index')
out['sequence_config'] = out.apply(lambda x: x['sequence_index'] in x['sequence_config'] if x['sequence_config'] is not None else False, axis=1)
out = out[out['sequence_config']].set_index('index').drop_duplicates().drop(
'sequence_config', axis=1).rename(columns={'sequence_index': 'sequence'})
这给出:
name test sequence
index
901922.0 john A 111.0
930293.0 sam B 444.0
238394.0 ashley C 555.0