Pandas: 使用列名和匹配字符串的单元格创建一个新列
Pandas: Create a new column with coulmn name and cell of matching string
我正在搜索一个包含 300 列和超过 20 万行的大型电子表格。我想创建一个包含列 header 和匹配单元格值的列。一些看起来像“Column ||Value”的东西。我有搜索词和加入聚合器。我可以获得行索引名称,但我正在努力获取匹配的列和特定单元格。到目前为止,这是我的代码
df = pd.read_excel (r"Test_file")
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.
df['extract2'] = Column name
df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)
df.drop(['extract', 'extract2'], axis=1)
最终输出应该是这样的
Output
您可以先为特定列创建一个掩码(我稍微编辑了您的第二行),然后创建一个新的 'Match' 列,所有值都初始化为 'No Match',最后,更改应用掩码后返回的行的所需格式(“列||值”)的值。我在以下示例代码中实现了这一点:
def match_column(df, column_name):
column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
df['Match'] = 'No Match'
df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
df = match_column(df, 'Segment')
display(df)
输出:
但是,这仅适用于单个列。我不知道当多个列中有匹配项时你想要什么输出(如果可以,请指定)。
更新:
如果您想使用列列表作为输入并与第一个实例匹配,您可以改用它:
def match_first_column(df, column_list):
df['Match'] = 'No Match'
# iterate over rows
for index, row in df.iterrows():
# iterate over column names
for column_name in column_list:
column_value = row[column_name]
substrings = ['Chann', 'Midm', 'Fran']
# if a match is found
if any(x in column_value for x in substrings):
# add match string
df.loc[index, 'Match'] = column_name + ' || ' + column_value
# stop iterating and move to next row
break
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
column_list= df.columns.tolist()
match_first_column(df, column_list)
输出:
你可以试试:
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df.loc[mask, 'Match'] = '||'.join(df[['extract', 'extract2']])
df['Match'].fillna('No Match', inplace=True)
我正在搜索一个包含 300 列和超过 20 万行的大型电子表格。我想创建一个包含列 header 和匹配单元格值的列。一些看起来像“Column ||Value”的东西。我有搜索词和加入聚合器。我可以获得行索引名称,但我正在努力获取匹配的列和特定单元格。到目前为止,这是我的代码
df = pd.read_excel (r"Test_file")
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.
df['extract2'] = Column name
df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)
df.drop(['extract', 'extract2'], axis=1)
最终输出应该是这样的 Output
您可以先为特定列创建一个掩码(我稍微编辑了您的第二行),然后创建一个新的 'Match' 列,所有值都初始化为 'No Match',最后,更改应用掩码后返回的行的所需格式(“列||值”)的值。我在以下示例代码中实现了这一点:
def match_column(df, column_name):
column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
df['Match'] = 'No Match'
df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
df = match_column(df, 'Segment')
display(df)
输出:
但是,这仅适用于单个列。我不知道当多个列中有匹配项时你想要什么输出(如果可以,请指定)。
更新:
如果您想使用列列表作为输入并与第一个实例匹配,您可以改用它:
def match_first_column(df, column_list):
df['Match'] = 'No Match'
# iterate over rows
for index, row in df.iterrows():
# iterate over column names
for column_name in column_list:
column_value = row[column_name]
substrings = ['Chann', 'Midm', 'Fran']
# if a match is found
if any(x in column_value for x in substrings):
# add match string
df.loc[index, 'Match'] = column_name + ' || ' + column_value
# stop iterating and move to next row
break
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
column_list= df.columns.tolist()
match_first_column(df, column_list)
输出:
你可以试试:
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df.loc[mask, 'Match'] = '||'.join(df[['extract', 'extract2']])
df['Match'].fillna('No Match', inplace=True)