pandas:将值赋回未知列

pandas: assigning values back to unknown column

我有一些数据框,其列具有字符串值(句子)。这些数据帧中的每一个都有包含单词 'gold' 与其他单词组合的列名(例如,df.columns: 'gold_data', 'dataset_gold',...etc' 或具有单词 'labeled' 与其他单词的组合(例如,df.columns:'labeled_data'、'dataset_labeled'、...等,或者同时具有 'gold' 和 'labeled' 与其他词组合。

这是两个列名称都存在时数据框的外观示例。

import pandas as pd

df = pd.DataFrame({'gold_data':['hello the weather nice','this is interesting','the weather is good'],
                   'data2':['goodbye','the plant is green','the weather is sunny'],
                   'new_labeled_dataset':['hello','there is no food in the fridge','this weather amazing']})

我尝试根据存在的列和return原始数据框中的行条件为真的数据框来处理列中的字符串,如下所示。

result = []

for index, entry in df.iterrows():
    if not any(df.columns.str.contains(pat='labeled')):
        text = entry.filter(regex='gold').squeeze()
    else:
        text = entry.filter(regex='labeled').squeeze()


    if len(text.split()) > 2:
       # assigment? = 'new_info:' + text (this is where i do not know how to assign back to the column which was processed)
        result.append(entry)


print(pd.DataFrame(result))

所以,我的意思是,如果列名中没有 'labeled',则从包含单词 'gold' 的列中获取文本,否则从 'labeled' 列中获取文本。但由于我不知道列的完整名称,我不确定如何将处理后的文本分配回该列。 所需的输出应为:

            gold_data              data2                   augmented_new

0     new_info:this is interesting    the plant is green  there is no food in the fridge
1     new_info:the weather is good  the weather is sunny            this weather amazing

我试图获取列的 full_name 并将其分配给该列,但这也不正确。

# df[col for col in df if 'gold' or 'labeled' in col] ='new_info:' + text

如果我没理解错的话,您想对使用列名选择的列的特定元素应用字符串转换。如果是这种情况,您可以避免手动遍历每一行,而只需使用 Pandas 的 apply() method of Pandas over the retrieved column. Since you do not want to do this for all the strings, but only with strings of at least 3 words, you can filter them thanks to the loc 方法。您可以使用以下代码完成:

# Chose in what case you are
if not any(df.columns.str.contains(pat='labeled')):
    # Retrieve the 'gold' column name
    chosen_col = next(filter(lambda x: 'gold' in x, [col for col in df.columns ])) 
else:
    # Retrieve the 'labeled' column name
    chosen_col = next(filter(lambda x: 'labeled' in x, [col for col in df.columns ]))
# Filter rows
df = df.loc[df[chosen_col].str.split().map(len) > 2]
# Transform all the string in the retrieved column
df[chosen_col] = df[chosen_col].apply(lambda x: 'new_info:' + x) 
print(df)

由于你提供了两个不同的dataframes,得到的结果 通过此代码是:

             gold_data                 data2                      new_labeled_dataset
1  this is interesting    the plant is green  new_info:there is no food in the fridge
2  the weather is good  the weather is sunny            new_info:this weather amazing

最后一个:

                         gold_data                 data2                   augmented_new
0  new_info:hello the weather nice               goodbye                           hello
1     new_info:this is interesting    the plant is green  there is no food in the fridge
2     new_info:the weather is good  the weather is sunny            this weather amazing