pandas：将值赋回未知列

Question

我有一些数据框，其列具有字符串值（句子）。这些数据帧中的每一个都有包含单词 'gold' 与其他单词组合的列名（例如，df.columns: 'gold_data', 'dataset_gold',...etc' 或具有单词 'labeled' 与其他单词的组合（例如，df.columns：'labeled_data'、'dataset_labeled'、...等，或者同时具有 'gold' 和 'labeled' 与其他词组合。

这是两个列名称都存在时数据框的外观示例。

import pandas as pd

df = pd.DataFrame({'gold_data':['hello the weather nice','this is interesting','the weather is good'],
                   'data2':['goodbye','the plant is green','the weather is sunny'],
                   'new_labeled_dataset':['hello','there is no food in the fridge','this weather amazing']})

我尝试根据存在的列和return原始数据框中的行条件为真的数据框来处理列中的字符串，如下所示。

result = []

for index, entry in df.iterrows():
    if not any(df.columns.str.contains(pat='labeled')):
        text = entry.filter(regex='gold').squeeze()
    else:
        text = entry.filter(regex='labeled').squeeze()


    if len(text.split()) > 2:
       # assigment? = 'new_info:' + text (this is where i do not know how to assign back to the column which was processed)
        result.append(entry)


print(pd.DataFrame(result))

所以，我的意思是，如果列名中没有 'labeled'，则从包含单词 'gold' 的列中获取文本，否则从 'labeled' 列中获取文本。但由于我不知道列的完整名称，我不确定如何将处理后的文本分配回该列。所需的输出应为：

            gold_data              data2                   augmented_new

0     new_info:this is interesting    the plant is green  there is no food in the fridge
1     new_info:the weather is good  the weather is sunny            this weather amazing

我试图获取列的 full_name 并将其分配给该列，但这也不正确。

# df[col for col in df if 'gold' or 'labeled' in col] ='new_info:' + text

Answer 1

如果我没理解错的话，您想对使用列名选择的列的特定元素应用字符串转换。如果是这种情况，您可以避免手动遍历每一行，而只需使用 Pandas 的 apply() method of Pandas over the retrieved column. Since you do not want to do this for all the strings, but only with strings of at least 3 words, you can filter them thanks to the loc 方法。您可以使用以下代码完成：

# Chose in what case you are
if not any(df.columns.str.contains(pat='labeled')):
    # Retrieve the 'gold' column name
    chosen_col = next(filter(lambda x: 'gold' in x, [col for col in df.columns ])) 
else:
    # Retrieve the 'labeled' column name
    chosen_col = next(filter(lambda x: 'labeled' in x, [col for col in df.columns ]))
# Filter rows
df = df.loc[df[chosen_col].str.split().map(len) > 2]
# Transform all the string in the retrieved column
df[chosen_col] = df[chosen_col].apply(lambda x: 'new_info:' + x) 
print(df)

由于你提供了两个不同的dataframes，得到的结果通过此代码是：

             gold_data                 data2                      new_labeled_dataset
1  this is interesting    the plant is green  new_info:there is no food in the fridge
2  the weather is good  the weather is sunny            new_info:this weather amazing

最后一个：

                         gold_data                 data2                   augmented_new
0  new_info:hello the weather nice               goodbye                           hello
1     new_info:this is interesting    the plant is green  there is no food in the fridge
2     new_info:the weather is good  the weather is sunny            this weather amazing

pandas：将值赋回未知列

pandas: assigning values back to unknown column

python

variable-assignment

pandas