Python 中的数据帧值匹配

Question

我有一个问题我已经处理了好几天了。我有 2 个数据框，如下所示。索引按定义是指一组唯一的TEST、NAME 和SEQUENCE 三元组。目标是从匹配的索引文件中获取 'index' 值：

其中一个序列值，例如配置中的[111,222,333]（例如可以是111 或 222 或 333）和测试和名称。

Config文件才是最重要的，目标是找到对应的索引值。配置中不存在的任何内容都不应显示在输出文件中。我想要一个最终输出，其中包括：INDEX、TEST、NAME 和 SEQUENCE。所以最终输出将是配置文件的一个子集，但它只包含一个 SEQUENCE（而不是 3 个）和相应的 TEST、NAME 和 INDEX。例如：

示例输出文件：

index   TEST    NAME    SEQUENCE
901922  A       john       111
238394  C       ashley     555
930293  B       sam        444

我尝试编写一个 for 循环，但没有成功建立索引，如下所示：

for x in range(0, config.shape[0]):
    find1=eval(config.SEQUENCE[x])
    find1='|'.join(str(i) for i in find1)  
    find1 = '(' + find1 + ')'

第一个数据框：配置

SEQUENCE    TEST    NAME
[111,222,333]   A   john
[222,444,888]   B   sam
[111,222,333]   A   ashley
[999,777,555]   C   ashley
[111,222,333]   D   john
[111,222,333]   A   john
                G   kelly

第二个数据帧：index

index   TEST    NAME    SEQUENCE
901922  A       john       111
930293  B       sam        444
238203  A       ashley     888
238394  C       ashley     555
483472  D       john       777
901922  A       john       111
264225  F       greg       111
465126  A       mary       555
554216  B       peter      333

Answer 1

# Your DataFrame contains a column of strings that look like lists,
# but we want to work with a column of actual Python lists.
# Convert strings to lists with this:
from ast import literal_eval
config['SEQUENCE'] = config['SEQUENCE'].apply(literal_eval)

# Split these newly formed lists into separate columns
split = pd.concat([pd.DataFrame(config.SEQUENCE.values.tolist()), 
                   config[['TEST', 'NAME']]], axis=1)
split
     0    1    2 TEST    NAME
0  111  222  333    A    john
1  222  444  888    B     sam
2  111  222  333    A  ashley
3  999  777  666    C  ashley
4  111  222  333    D    john
5  111  222  333    A   john


# Melt or "unpivot" the DF so that each row holds only one sequence
melted = split.melt(id_vars=['TEST', 'NAME'], 
                    value_name='SEQUENCE').drop('variable', axis=1)
melted
   TEST    NAME  SEQUENCE
0     A    john       111
1     B     sam       222
2     A  ashley       111
3     C  ashley       999
4     D    john       111
5     A   john       111
6     A    john       222
7     B     sam       444
8     A  ashley       222
9     C  ashley       777
10    D    john       222
11    A   john       222
12    A    john       333
13    B     sam       888
14    A  ashley       333
15    C  ashley       666
16    D    john       333
17    A   john       333


# Default behaviour of pd.merge gives us what we want!
# Note that the duplicate row arises from the duplicate in the DF named index.
pd.merge(melted, index)
  TEST  NAME  SEQUENCE   index
0    A  john       111  901922
1    A  john       111  901922
2    B   sam       444  930293


pd.merge(melted, index)['index'].unique()
array([901922, 930293])

Answer 2

一种方法是先左连接 test 和 name 上的两个 table，然后删除 [=15] 中 sequence 的行=] 在 config 中找不到：

ind2 = index.set_index(['test', 'name'])
out = config.join(ind2, ['test', 'name'], 'left', lsuffix='_config', rsuffix='_index')
out['sequence_config'] = out.apply(lambda x: x['sequence_index'] in x['sequence_config'] if x['sequence_config'] is not None else False, axis=1)

out = out[out['sequence_config']].set_index('index').drop_duplicates().drop(
    'sequence_config', axis=1).rename(columns={'sequence_index': 'sequence'})

这给出：

            name test  sequence
index                          
901922.0    john    A     111.0
930293.0     sam    B     444.0
238394.0  ashley    C     555.0

Python 中的数据帧值匹配

Dataframe value matching in Python

python

lookup

for-loop

dataframe

pandas