使用另一个数据框中包含的新值更改字符串值

Question

我有一个包含数千行销售数据的 csv，如下所示：

pd.DataFrame({
    'Item_name': ['guacamole', 'morita', 'verde', 'pico', 'tomatillo'],
    'Inv_number': ['0001', '0002', '0003', '0004', '0005'],
    'Store_name': ['alex', 'pusateris', 'wholefoods','longos', 'metro']

现在项目名称已更改为：

pd.DataFrame ({
'Item_name': ['Dip guacamole', 'morita Spicy', ' Salsa verde', 'Pico de Gallo', 'Roasted tomatillo']

我想要实现的是将旧名称更改为新名称。我正在为每个项目使用以下代码，但这将永远持续下去！

sales_df['item_code']= sales_df['item_code'].replace({'Guacamole':'Dip Guacamole'})

有没有办法简化这段代码？也许创建一个包含新名称的列表并遍历销售数据？

期待听到您的意见。

谢谢！

Answer 1

您可以使用replace函数：

dic = {'Guacamole':'Dip Guacamole', 'morita': 'morita Spicy'}
sales_df = sales_df.replace({"item_code": dic})

Answer 2

如果库存数量保持不变，你应该把它作为一个索引。我会尝试在索引和名称之间创建一个映射并将其应用于旧 table:

name_dict = new_df.set_index("Inv_number")["Item_name'"].drop_duplicates()
old_df["new_names"] = old_df["Inv_number"].map(name_dict)

Answer 3

这里使用模糊逻辑。

# Python env: pip install thefuzz
# Anaconda env: conda install thefuzz

from thefuzz import process

THRESHOLD = 90  # reject all values below this score (%)

# df: your original dataframe
# df1: your new names
df['Item_name_new'] = \
    df['Item_name'].apply(lambda x: process.extractOne(x, df1['Item_name'],
                              score_cutoff=THRESHOLD)).str[0]
print(df)

# Output
   Item_name Inv_number  Store_name      Item_name_new
0  guacamole       0001        alex      Dip guacamole
1     morita       0002   pusateris       morita Spicy
2      verde       0003  wholefoods        Salsa verde
3       pico       0004      longos      Pico de Gallo
4  tomatillo       0005       metro  Roasted tomatillo
5      water       0006      nature               None

使用另一个数据框中包含的新值更改字符串值

Change string values with new values contain in another data frame

python

data-analysis

pandas