NLP分类标签有很多相似点，替换成只有一个

Question

我一直在尝试使用 Python 中的 fuzzywuzzy 库来查找标签中字符串之间的相似度百分比。我遇到的问题是，即使我尝试进行查找和替换，仍然有很多字符串非常相似。

想知道这里有没有人用过清理标签的方法。举个例子。我有这些看起来完全相同的标签：

 'Cable replaced',
 'Cable replaced.',
 'Camera is up and recording',
 'Chat closed due to inactivity.',
 'Closing as duplicate',
 'Closing as duplicate.',
 'Closing duplicate ticket.',
 'Closing ticket.',

理想情况下，我希望能够找到并替换为一个通用字符串，因此我们只说了 'closing as duplicate' 的一个实例。非常感谢任何想法或建议。

提供更详尽的示例。这是我正在尝试做的事情：

import fuzzywuzzy
from fuzzywuzzy import process
import chardet

res = h['resolution'].unique()
res.sort()
res

'All APs are up and stable hence resoling TT  Logs are updated in WL',
'Asset returned to IT hub closing ticket.',
'Auto Resolved - No reply from requester', 'Cable replaced',
'Cable replaced.', 'Camera is up and recording',
'Chat closed due to inactivity.', 'Closing as duplicate',
'Closing as duplicate.', 'Closing duplicate ticket.',
'Closing ticket.', 'Completed', 'Connection to IDF restored',

哦，看看那个，让我们看看是否可以找到像 'cable replaced'.

这样的字符串

# get the top 10 closest matches to "cable replaced"
matches = fuzzywuzzy.process.extract("cable replaced", res, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

[('cable replaced', 100),
 ('cable replaced.', 100),
 ('replaced cable', 100),
 ('replaced scanner cable', 78),
 ('replaced scanner cable.', 78),
 ('scanner cable replaced', 78),
 ('battery replaced', 73),
 ('replaced', 73),
 ('replaced battery', 73),
 ('replaced battery.', 73)]

嗯，也许我应该创建一个函数来替换相似度得分大于 90.

的字符串

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

# use the function we just wrote to replace close matches to "cable replaced" with "cable replaced"
replace_matches_in_column(df=h, column='resolution', string_to_match="cable replaced")

# get all the unique values in the 'City' column
res = h['resolution'].unique()

# sort them alphabetically and then take a closer look
res.sort()
res

'auto resolved - no reply from requester', 'battery replaced',
       'cable replaced', 'camera is up and recording',
       'chat closed due to inactivity.', 'check ok',

太棒了！现在我只有一个 'cable replaced' 实例。让我们验证一下

# get the top 10 closest matches to "cable replaced"
matches = fuzzywuzzy.process.extract("cable replaced", res, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

[('cable replaced', 100),
 ('replaced scanner cable', 78),
 ('replaced scanner cable.', 78),
 ('scanner cable replaced', 78),
 ('battery replaced', 73),
 ('replaced', 73),
 ('replaced battery', 73),
 ('replaced battery.', 73),
 ('replaced.', 73),
 ('hardware replaced', 71)]

是的！看起来不错。现在，这个示例效果很好，但如您所见，它是相当手动的。理想情况下，我希望为我的解析列中的所有字符串自动执行此操作。有什么想法吗？

Answer 1

利用this link中的函数，可以找到如下映射：

from fuzzywuzzy import fuzz


def replace_similars(input_list):
    # Replaces %90 and more similar strings
    for i in range(len(input_list)):
        for j in range(len(input_list)):
            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
                input_list[j] = input_list[i]


def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

让我们看看如何使用：

# Let's assume items in labels are unique.
# If they are not unique, it will work anyway but will be slower.
labels = [
    "Cable replaced",
    "Cable replaced.",
    "Camera is up and recording",
    "Chat closed due to inactivity.",
    "Closing as duplicate",
    "Closing as duplicate.",
    "Closing duplicate ticket.",
    "Closing ticket.",
    "Completed",
    "Connection to IDF restored",
]

mapping = generate_mapping(labels)


# Print to see mapping
print("\n".join(["{:<50}: {}".format(k, v) for k, v in mapping.items()]))

输出：

Cable replaced                                    : Cable replaced
Cable replaced.                                   : Cable replaced
Camera is up and recording                        : Camera is up and recording
Chat closed due to inactivity.                    : Chat closed due to inactivity.
Closing as duplicate                              : Closing as duplicate
Closing as duplicate.                             : Closing as duplicate
Closing duplicate ticket.                         : Closing duplicate ticket.
Closing ticket.                                   : Closing ticket.
Completed                                         : Completed
Connection to IDF restored                        : Connection to IDF restored

因此，您可以找到 h['resolution'].unique() 的映射，然后使用此映射更新 h['resolution'] 列。由于我没有您的数据框，因此无法尝试。基于 this，我想你可以使用以下内容：

for k, v in mapping.items():
    if k != v:
        h.loc[h['resolution'] == k, 'resolution'] = v

NLP分类标签有很多相似点，替换成只有一个

NLP Classification labels have many similarirites,replace to only have one

python

automation

text

nlp

machine-learning