pandas:将值赋回未知列
pandas: assigning values back to unknown column
我有一些数据框,其列具有字符串值(句子)。这些数据帧中的每一个都有包含单词 'gold' 与其他单词组合的列名(例如,df.columns: 'gold_data', 'dataset_gold',...etc' 或具有单词 'labeled' 与其他单词的组合(例如,df.columns:'labeled_data'、'dataset_labeled'、...等,或者同时具有 'gold' 和 'labeled' 与其他词组合。
这是两个列名称都存在时数据框的外观示例。
import pandas as pd
df = pd.DataFrame({'gold_data':['hello the weather nice','this is interesting','the weather is good'],
'data2':['goodbye','the plant is green','the weather is sunny'],
'new_labeled_dataset':['hello','there is no food in the fridge','this weather amazing']})
我尝试根据存在的列和return原始数据框中的行条件为真的数据框来处理列中的字符串,如下所示。
result = []
for index, entry in df.iterrows():
if not any(df.columns.str.contains(pat='labeled')):
text = entry.filter(regex='gold').squeeze()
else:
text = entry.filter(regex='labeled').squeeze()
if len(text.split()) > 2:
# assigment? = 'new_info:' + text (this is where i do not know how to assign back to the column which was processed)
result.append(entry)
print(pd.DataFrame(result))
所以,我的意思是,如果列名中没有 'labeled',则从包含单词 'gold' 的列中获取文本,否则从 'labeled' 列中获取文本。但由于我不知道列的完整名称,我不确定如何将处理后的文本分配回该列。
所需的输出应为:
gold_data data2 augmented_new
0 new_info:this is interesting the plant is green there is no food in the fridge
1 new_info:the weather is good the weather is sunny this weather amazing
我试图获取列的 full_name 并将其分配给该列,但这也不正确。
# df[col for col in df if 'gold' or 'labeled' in col] ='new_info:' + text
如果我没理解错的话,您想对使用列名选择的列的特定元素应用字符串转换。如果是这种情况,您可以避免手动遍历每一行,而只需使用 Pandas 的 apply() method of Pandas over the retrieved column. Since you do not want to do this for all the strings, but only with strings of at least 3 words, you can filter them thanks to the loc 方法。您可以使用以下代码完成:
# Chose in what case you are
if not any(df.columns.str.contains(pat='labeled')):
# Retrieve the 'gold' column name
chosen_col = next(filter(lambda x: 'gold' in x, [col for col in df.columns ]))
else:
# Retrieve the 'labeled' column name
chosen_col = next(filter(lambda x: 'labeled' in x, [col for col in df.columns ]))
# Filter rows
df = df.loc[df[chosen_col].str.split().map(len) > 2]
# Transform all the string in the retrieved column
df[chosen_col] = df[chosen_col].apply(lambda x: 'new_info:' + x)
print(df)
由于你提供了两个不同的dataframes,得到的结果
通过此代码是:
gold_data data2 new_labeled_dataset
1 this is interesting the plant is green new_info:there is no food in the fridge
2 the weather is good the weather is sunny new_info:this weather amazing
最后一个:
gold_data data2 augmented_new
0 new_info:hello the weather nice goodbye hello
1 new_info:this is interesting the plant is green there is no food in the fridge
2 new_info:the weather is good the weather is sunny this weather amazing
我有一些数据框,其列具有字符串值(句子)。这些数据帧中的每一个都有包含单词 'gold' 与其他单词组合的列名(例如,df.columns: 'gold_data', 'dataset_gold',...etc' 或具有单词 'labeled' 与其他单词的组合(例如,df.columns:'labeled_data'、'dataset_labeled'、...等,或者同时具有 'gold' 和 'labeled' 与其他词组合。
这是两个列名称都存在时数据框的外观示例。
import pandas as pd
df = pd.DataFrame({'gold_data':['hello the weather nice','this is interesting','the weather is good'],
'data2':['goodbye','the plant is green','the weather is sunny'],
'new_labeled_dataset':['hello','there is no food in the fridge','this weather amazing']})
我尝试根据存在的列和return原始数据框中的行条件为真的数据框来处理列中的字符串,如下所示。
result = []
for index, entry in df.iterrows():
if not any(df.columns.str.contains(pat='labeled')):
text = entry.filter(regex='gold').squeeze()
else:
text = entry.filter(regex='labeled').squeeze()
if len(text.split()) > 2:
# assigment? = 'new_info:' + text (this is where i do not know how to assign back to the column which was processed)
result.append(entry)
print(pd.DataFrame(result))
所以,我的意思是,如果列名中没有 'labeled',则从包含单词 'gold' 的列中获取文本,否则从 'labeled' 列中获取文本。但由于我不知道列的完整名称,我不确定如何将处理后的文本分配回该列。 所需的输出应为:
gold_data data2 augmented_new
0 new_info:this is interesting the plant is green there is no food in the fridge
1 new_info:the weather is good the weather is sunny this weather amazing
我试图获取列的 full_name 并将其分配给该列,但这也不正确。
# df[col for col in df if 'gold' or 'labeled' in col] ='new_info:' + text
如果我没理解错的话,您想对使用列名选择的列的特定元素应用字符串转换。如果是这种情况,您可以避免手动遍历每一行,而只需使用 Pandas 的 apply() method of Pandas over the retrieved column. Since you do not want to do this for all the strings, but only with strings of at least 3 words, you can filter them thanks to the loc 方法。您可以使用以下代码完成:
# Chose in what case you are
if not any(df.columns.str.contains(pat='labeled')):
# Retrieve the 'gold' column name
chosen_col = next(filter(lambda x: 'gold' in x, [col for col in df.columns ]))
else:
# Retrieve the 'labeled' column name
chosen_col = next(filter(lambda x: 'labeled' in x, [col for col in df.columns ]))
# Filter rows
df = df.loc[df[chosen_col].str.split().map(len) > 2]
# Transform all the string in the retrieved column
df[chosen_col] = df[chosen_col].apply(lambda x: 'new_info:' + x)
print(df)
由于你提供了两个不同的dataframes,得到的结果 通过此代码是:
gold_data data2 new_labeled_dataset
1 this is interesting the plant is green new_info:there is no food in the fridge
2 the weather is good the weather is sunny new_info:this weather amazing
最后一个:
gold_data data2 augmented_new
0 new_info:hello the weather nice goodbye hello
1 new_info:this is interesting the plant is green there is no food in the fridge
2 new_info:the weather is good the weather is sunny this weather amazing