合并 NLP 中的相关词
Merge related words in NLP
我想定义一个新词,其中包含来自两个(或更多)不同词的计数值。例如:
Words Frequency
0 mom 250
1 2020 151
2 the 124
3 19 82
4 mother 81
... ... ...
10 London 6
11 life 6
12 something 6
我想把妈妈定义为mom + mother
:
Words Frequency
0 mother 331
1 2020 151
2 the 124
3 19 82
... ... ...
9 London 6
10 life 6
11 something 6
这是一种替代定义具有某种意义的词组的方法(至少对我来说是这样)。
如有任何建议,我们将不胜感激。
matthewreagan/WebstersEnglishDictionary
想法是使用这个词典来识别相似的词。
简而言之:运行一些根据英语语法提取知识的知识发现算法
Here is a thesaurus:18MB。
这里是同义词库的摘录,您可以尝试通过某种算法匹配单词 alternates。
{"word": "ma", "key": "ma_1", "pos": "noun", "synonyms": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}
对于使用外部 api 的快速修复,这里是 link:它们允许使用 api 做更多事情,例如获取同义词、查找多个定义、查找押韵词等等.
2020 年 10 月 21 日更新
我决定构建一个 Python 模块来处理我在此答案中概述的任务。该模块名为 wordhoard,可从 pypi
下载
我曾尝试在需要确定关键字频率(例如医疗保健)和关键字的同义词(例如健康计划、预防医学)的项目中使用 Word2vec 和 WordNet。我发现大多数 NLP 库都没有产生我需要的结果,所以我决定用自定义关键字和同义词构建自己的字典。这种方法已在多个项目中用于分析和分类文本。
我确信精通 NLP 技术的人可能有更强大的解决方案,但下面的解决方案与我多次使用过的类似解决方案有关。
我对我的答案进行了编码以匹配您在问题中的词频数据,但可以对其进行修改以使用任何关键字和同义词数据集。
import string
# Python Dictionary
# I manually created these word relationship - primary_word:synonyms
word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],
"mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}
# This input text is from various poems about mothers and fathers
input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \
'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \
'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \
'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \
'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \
'This to me you have always been. Through the good times and the bad, Your understanding I have had.'
# converts the input text to lowercase and splits the words based on empty space.
wordlist = input_text.lower().split()
# remove all punctuation from the wordlist
remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation)
for s in wordlist]
# list for word frequencies
wordfreq = []
# count the frequencies of a word
for w in remove_punctuation:
wordfreq.append(remove_punctuation.count(w))
word_frequencies = (dict(zip(remove_punctuation, wordfreq)))
word_matches = []
# loop through the dictionaries
for word, frequency in word_frequencies.items():
for keyword, synonym in word_relationship.items():
match = [x for x in synonym if word == x]
if word == keyword or match:
match = ' '.join(map(str, match))
# append the keywords (mother), synonyms(mom) and frequencies to a list
word_matches.append([keyword, match, frequency])
# used to hold the final keyword and frequencies
final_results = {}
# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
if item[0] not in final_results.keys():
frequency_count = 0
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
else:
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
print(final_results)
# output
{'mother': 3, 'father': 2}
其他方法
下面是其他一些方法及其 out-of-box 输出。
NLTK WORDNET
在此示例中,我查找了单词 'mother.' 的同义词 请注意,WordNet 没有与单词 mother 相关联的同义词 'mom' 或 'mum'。这两个词在我上面的示例文本中。另请注意,单词 'father' 被列为 'mother.'
的同义词
from nltk.corpus import wordnet
synonyms = []
word = 'mother'
for synonym in wordnet.synsets(word):
for item in synonym.lemmas():
if word != synonym.name() and len(synonym.lemma_names()) > 1:
synonyms.append(item.name())
print(synonyms)
['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']
PyDictionary
在这个例子中,我使用 PyDictionary 查找了单词 'mother' 的同义词,它查询 synonym.com。此示例中的同义词包括单词 'mom' 和 'mum.' 此示例还包括 WordNet 未生成的其他同义词。
但是,PyDictionary 还为 'mum.' 生成了一个同义词列表,它与单词 'mother.' 无关 似乎 PyDictionary 从页面的 adjective section 中提取了这个列表的名词部分。计算机很难区分形容词妈妈和名词妈妈。
from PyDictionary import PyDictionary
dictionary_mother = PyDictionary('mother')
print(dictionary_mother.getSynonyms())
# output
[{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]
dictionary_mum = PyDictionary('mum')
print(dictionary_mum.getSynonyms())
# output
[{'mum': ['incommunicative', 'silent', 'uncommunicative']}]
其他一些可能的方法是使用牛津词典 API 或查询 thesaurus.com。这两种方法也都有缺陷。例如,牛津词典 API 需要一个 API 密钥和基于查询编号的付费订阅。 thesaurus.com 缺少可能对单词分组有用的潜在同义词。
https://www.thesaurus.com/browse/mother
synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator
更新
为语料库中的每个潜在词生成精确的同义词列表很困难,需要多管齐下。下面的代码使用
WordNet 和 PyDictionary 创建同义词超集。像所有其他答案一样,这种组合方法也会导致一些单词频率的过度计算。我一直在尝试通过在我的最终同义词词典中组合键值对来减少这个 over-counting 。后一个问题比我预期的要难得多,可能需要我提出自己的问题来解决。最后,我认为您需要根据您的用例确定哪种方法最有效,并且可能需要结合多种方法。
感谢您提出这个问题,因为它让我看到了解决复杂问题的其他方法。
from string import punctuation
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from PyDictionary import PyDictionary
input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother
that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of
her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She
has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the
greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,
This to me you have always been. Through the good times and the bad, Your understanding I have had."""
def normalize_textual_information(text):
# split text into tokens by white space
token = text.split()
# remove punctuation from each token
table = str.maketrans('', '', punctuation)
token = [word.translate(table) for word in token]
# remove any tokens that are not alphabetic
token = [word.lower() for word in token if word.isalpha()]
# filter out English stop words
stop_words = set(stopwords.words('english'))
# you could add additional stops like this
stop_words.add('cannot')
stop_words.add('could')
stop_words.add('would')
token = [word for word in token if word not in stop_words]
# filter out any short tokens
token = [word for word in token if len(word) > 1]
return token
def generate_word_frequencies(words):
# list to hold word frequencies
word_frequencies = []
# loop through the tokens and generate a word count for each token
for word in words:
word_frequencies.append(words.count(word))
# aggregates the words and word_frequencies into tuples and coverts them into a dictionary
word_frequencies = (dict(zip(words, word_frequencies)))
# sort the frequency of the words from low to high
sorted_frequencies = {key: value for key, value in
sorted(word_frequencies.items(), key=lambda item: item[1])}
return sorted_frequencies
def get_synonyms_internet(word):
dictionary = PyDictionary(word)
synonym = dictionary.getSynonyms()
return synonym
words = normalize_textual_information(input_text)
all_synsets_1 = {}
for word in words:
for synonym in wordnet.synsets(word):
if word != synonym.name() and len(synonym.lemma_names()) > 1:
for item in synonym.lemmas():
if word != item.name():
all_synsets_1.setdefault(word, []).append(str(item.name()).lower())
all_synsets_2 = {}
for word in words:
word_synonyms = get_synonyms_internet(word)
for synonym in word_synonyms:
if word != synonym and synonym is not None:
all_synsets_2.update(synonym)
word_relationship = {**all_synsets_1, **all_synsets_2}
frequencies = generate_word_frequencies(words)
word_matches = []
word_set = {}
duplication_check = set()
for word, frequency in frequencies.items():
for keyword, synonym in word_relationship.items():
match = [x for x in synonym if word == x]
if word == keyword or match:
match = ' '.join(map(str, match))
if match not in word_set or match not in duplication_check or word not in duplication_check:
duplication_check.add(word)
duplication_check.add(match)
word_matches.append([keyword, match, frequency])
# used to hold the final keyword and frequencies
final_results = {}
# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
if item[0] not in final_results.keys():
frequency_count = 0
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
else:
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
# do something with the final results
解决此问题的另一种古怪方法是使用良好的旧 PyDictionary 库。您可以使用
dictionary.getSynonyms()
循环遍历列表中的所有单词并将它们分组的功能。列出的所有可用同义词将被覆盖并映射到一个组。通过允许您分配最终变量并总结同义词。在你的例子中。您选择最后一个单词 Mother,它会显示同义词的最终计数。
这是一个难题,最佳解决方案取决于您要解决的用例。这是一个难题,因为要组合单词,您需要了解单词的语义。您可以将 mom
和 mother
组合在一起,因为它们在语义上是相关的。
识别两个词是否在语义上相关的一种方法是通过关联分布式词嵌入(向量),如 word2vec、Glove、fasttext 等。您可以找到所有单词的向量之间关于一个单词的余弦相似度,并且可以选择前 5 个接近的单词并创建新单词。
使用 word2vec 的示例
# Load a pretrained word2vec model
import gensim.downloader as api
model = api.load('word2vec-google-news-300')
vectors = [model.get_vector(w) for w in words]
for i, w in enumerate(vectors):
first_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][1]
second_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][2]
print (f"{words[i]} + {words[first_best_match]}")
print (f"{words[i]} + {words[second_best_match]}")
输出:
mom + mother
mom + teacher
mother + mom
mother + teacher
london + mom
london + life
life + mother
life + mom
teach + teacher
teach + mom
teacher + teach
teacher + mother
您可以尝试将阈值设置在余弦相似度上,并且仅 select 那些余弦相似度大于该阈值的阈值。
语义相似性的一个问题是它们可以在语义上相反,因此它们相似(男人 - 女人),另一方面(man-king)在语义上相似,因为它们相同。
你想要实现的是语义文本相似度。
我想推荐 Tensorflow Universal Sentence Encoder
例如:
#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)
def plot_similarity(labels, features, rotation):
corr = np.inner(features, features)
sns.set(font_scale=1.2)
g = sns.heatmap(
corr,
xticklabels=labels,
yticklabels=labels,
vmin=0,
vmax=1,
cmap="YlOrRd")
g.set_xticklabels(labels, rotation=rotation)
g.set_title("Semantic Textual Similarity")
def run_and_plot(messages_):
message_embeddings_ = embed(messages_)
plot_similarity(messages_, message_embeddings_, 90)
messages = [
"Mother",
"Mom",
"Mama",
"Dog",
"Cat"
]
run_and_plot(messages)
该示例是用 python 编写的,但我还创建了一个将模型加载到基于 JVM 的语言的示例
您可以生成词嵌入向量并使用一些聚类算法。最后,您需要调整算法的超参数以获得高精度的结果。
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import spacy
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Load the large english model
nlp = spacy.load("en_core_web_lg")
tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")
# Generate word embedding vectors
vectors = np.array([token.vector for token in tokens])
vectors.shape
# (12, 300)
让我们使用主成分分析算法在 3 维中可视化我们的嵌入 space:
pca_vecs = PCA(n_components=3).fit_transform(vectors)
pca_vecs.shape
# (12, 3)
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')
xs, ys, zs = pca_vecs[:, 0], pca_vecs[:, 1], pca_vecs[:, 2]
_ = ax.scatter(xs, ys, zs)
for x, y, z, lable in zip(xs, ys, zs, tokens):
ax.text(x+0.3, y, z, str(lable))
让我们使用 DBSCAN 算法对单词进行聚类:
model = DBSCAN(eps=5, min_samples=1)
model.fit(vectors)
for word, cluster in zip(tokens, model.labels_):
print(word, '->', cluster)
输出:
dog -> 0
cat -> 0
banana -> 1
apple -> 2
teaching -> 3
teacher -> 3
mom -> 4
mother -> 4
mama -> 4
mommy -> 4
berlin -> 5
paris -> 6
我想定义一个新词,其中包含来自两个(或更多)不同词的计数值。例如:
Words Frequency
0 mom 250
1 2020 151
2 the 124
3 19 82
4 mother 81
... ... ...
10 London 6
11 life 6
12 something 6
我想把妈妈定义为mom + mother
:
Words Frequency
0 mother 331
1 2020 151
2 the 124
3 19 82
... ... ...
9 London 6
10 life 6
11 something 6
这是一种替代定义具有某种意义的词组的方法(至少对我来说是这样)。
如有任何建议,我们将不胜感激。
matthewreagan/WebstersEnglishDictionary
想法是使用这个词典来识别相似的词。
简而言之:运行一些根据英语语法提取知识的知识发现算法
Here is a thesaurus:18MB。
这里是同义词库的摘录,您可以尝试通过某种算法匹配单词 alternates。
{"word": "ma", "key": "ma_1", "pos": "noun", "synonyms": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}
对于使用外部 api 的快速修复,这里是 link:它们允许使用 api 做更多事情,例如获取同义词、查找多个定义、查找押韵词等等.
2020 年 10 月 21 日更新
我决定构建一个 Python 模块来处理我在此答案中概述的任务。该模块名为 wordhoard,可从 pypi
下载我曾尝试在需要确定关键字频率(例如医疗保健)和关键字的同义词(例如健康计划、预防医学)的项目中使用 Word2vec 和 WordNet。我发现大多数 NLP 库都没有产生我需要的结果,所以我决定用自定义关键字和同义词构建自己的字典。这种方法已在多个项目中用于分析和分类文本。
我确信精通 NLP 技术的人可能有更强大的解决方案,但下面的解决方案与我多次使用过的类似解决方案有关。
我对我的答案进行了编码以匹配您在问题中的词频数据,但可以对其进行修改以使用任何关键字和同义词数据集。
import string
# Python Dictionary
# I manually created these word relationship - primary_word:synonyms
word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],
"mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}
# This input text is from various poems about mothers and fathers
input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \
'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \
'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \
'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \
'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \
'This to me you have always been. Through the good times and the bad, Your understanding I have had.'
# converts the input text to lowercase and splits the words based on empty space.
wordlist = input_text.lower().split()
# remove all punctuation from the wordlist
remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation)
for s in wordlist]
# list for word frequencies
wordfreq = []
# count the frequencies of a word
for w in remove_punctuation:
wordfreq.append(remove_punctuation.count(w))
word_frequencies = (dict(zip(remove_punctuation, wordfreq)))
word_matches = []
# loop through the dictionaries
for word, frequency in word_frequencies.items():
for keyword, synonym in word_relationship.items():
match = [x for x in synonym if word == x]
if word == keyword or match:
match = ' '.join(map(str, match))
# append the keywords (mother), synonyms(mom) and frequencies to a list
word_matches.append([keyword, match, frequency])
# used to hold the final keyword and frequencies
final_results = {}
# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
if item[0] not in final_results.keys():
frequency_count = 0
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
else:
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
print(final_results)
# output
{'mother': 3, 'father': 2}
其他方法
下面是其他一些方法及其 out-of-box 输出。
NLTK WORDNET
在此示例中,我查找了单词 'mother.' 的同义词 请注意,WordNet 没有与单词 mother 相关联的同义词 'mom' 或 'mum'。这两个词在我上面的示例文本中。另请注意,单词 'father' 被列为 'mother.'
的同义词from nltk.corpus import wordnet
synonyms = []
word = 'mother'
for synonym in wordnet.synsets(word):
for item in synonym.lemmas():
if word != synonym.name() and len(synonym.lemma_names()) > 1:
synonyms.append(item.name())
print(synonyms)
['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']
PyDictionary
在这个例子中,我使用 PyDictionary 查找了单词 'mother' 的同义词,它查询 synonym.com。此示例中的同义词包括单词 'mom' 和 'mum.' 此示例还包括 WordNet 未生成的其他同义词。
但是,PyDictionary 还为 'mum.' 生成了一个同义词列表,它与单词 'mother.' 无关 似乎 PyDictionary 从页面的 adjective section 中提取了这个列表的名词部分。计算机很难区分形容词妈妈和名词妈妈。
from PyDictionary import PyDictionary
dictionary_mother = PyDictionary('mother')
print(dictionary_mother.getSynonyms())
# output
[{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]
dictionary_mum = PyDictionary('mum')
print(dictionary_mum.getSynonyms())
# output
[{'mum': ['incommunicative', 'silent', 'uncommunicative']}]
其他一些可能的方法是使用牛津词典 API 或查询 thesaurus.com。这两种方法也都有缺陷。例如,牛津词典 API 需要一个 API 密钥和基于查询编号的付费订阅。 thesaurus.com 缺少可能对单词分组有用的潜在同义词。
https://www.thesaurus.com/browse/mother
synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator
更新
为语料库中的每个潜在词生成精确的同义词列表很困难,需要多管齐下。下面的代码使用 WordNet 和 PyDictionary 创建同义词超集。像所有其他答案一样,这种组合方法也会导致一些单词频率的过度计算。我一直在尝试通过在我的最终同义词词典中组合键值对来减少这个 over-counting 。后一个问题比我预期的要难得多,可能需要我提出自己的问题来解决。最后,我认为您需要根据您的用例确定哪种方法最有效,并且可能需要结合多种方法。
感谢您提出这个问题,因为它让我看到了解决复杂问题的其他方法。
from string import punctuation
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from PyDictionary import PyDictionary
input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother
that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of
her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She
has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the
greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,
This to me you have always been. Through the good times and the bad, Your understanding I have had."""
def normalize_textual_information(text):
# split text into tokens by white space
token = text.split()
# remove punctuation from each token
table = str.maketrans('', '', punctuation)
token = [word.translate(table) for word in token]
# remove any tokens that are not alphabetic
token = [word.lower() for word in token if word.isalpha()]
# filter out English stop words
stop_words = set(stopwords.words('english'))
# you could add additional stops like this
stop_words.add('cannot')
stop_words.add('could')
stop_words.add('would')
token = [word for word in token if word not in stop_words]
# filter out any short tokens
token = [word for word in token if len(word) > 1]
return token
def generate_word_frequencies(words):
# list to hold word frequencies
word_frequencies = []
# loop through the tokens and generate a word count for each token
for word in words:
word_frequencies.append(words.count(word))
# aggregates the words and word_frequencies into tuples and coverts them into a dictionary
word_frequencies = (dict(zip(words, word_frequencies)))
# sort the frequency of the words from low to high
sorted_frequencies = {key: value for key, value in
sorted(word_frequencies.items(), key=lambda item: item[1])}
return sorted_frequencies
def get_synonyms_internet(word):
dictionary = PyDictionary(word)
synonym = dictionary.getSynonyms()
return synonym
words = normalize_textual_information(input_text)
all_synsets_1 = {}
for word in words:
for synonym in wordnet.synsets(word):
if word != synonym.name() and len(synonym.lemma_names()) > 1:
for item in synonym.lemmas():
if word != item.name():
all_synsets_1.setdefault(word, []).append(str(item.name()).lower())
all_synsets_2 = {}
for word in words:
word_synonyms = get_synonyms_internet(word)
for synonym in word_synonyms:
if word != synonym and synonym is not None:
all_synsets_2.update(synonym)
word_relationship = {**all_synsets_1, **all_synsets_2}
frequencies = generate_word_frequencies(words)
word_matches = []
word_set = {}
duplication_check = set()
for word, frequency in frequencies.items():
for keyword, synonym in word_relationship.items():
match = [x for x in synonym if word == x]
if word == keyword or match:
match = ' '.join(map(str, match))
if match not in word_set or match not in duplication_check or word not in duplication_check:
duplication_check.add(word)
duplication_check.add(match)
word_matches.append([keyword, match, frequency])
# used to hold the final keyword and frequencies
final_results = {}
# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]
# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
if item[0] not in final_results.keys():
frequency_count = 0
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
else:
frequency_count = frequency_count + item[1]
final_results[item[0]] = frequency_count
# do something with the final results
解决此问题的另一种古怪方法是使用良好的旧 PyDictionary 库。您可以使用
dictionary.getSynonyms()
循环遍历列表中的所有单词并将它们分组的功能。列出的所有可用同义词将被覆盖并映射到一个组。通过允许您分配最终变量并总结同义词。在你的例子中。您选择最后一个单词 Mother,它会显示同义词的最终计数。
这是一个难题,最佳解决方案取决于您要解决的用例。这是一个难题,因为要组合单词,您需要了解单词的语义。您可以将 mom
和 mother
组合在一起,因为它们在语义上是相关的。
识别两个词是否在语义上相关的一种方法是通过关联分布式词嵌入(向量),如 word2vec、Glove、fasttext 等。您可以找到所有单词的向量之间关于一个单词的余弦相似度,并且可以选择前 5 个接近的单词并创建新单词。
使用 word2vec 的示例
# Load a pretrained word2vec model
import gensim.downloader as api
model = api.load('word2vec-google-news-300')
vectors = [model.get_vector(w) for w in words]
for i, w in enumerate(vectors):
first_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][1]
second_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][2]
print (f"{words[i]} + {words[first_best_match]}")
print (f"{words[i]} + {words[second_best_match]}")
输出:
mom + mother
mom + teacher
mother + mom
mother + teacher
london + mom
london + life
life + mother
life + mom
teach + teacher
teach + mom
teacher + teach
teacher + mother
您可以尝试将阈值设置在余弦相似度上,并且仅 select 那些余弦相似度大于该阈值的阈值。
语义相似性的一个问题是它们可以在语义上相反,因此它们相似(男人 - 女人),另一方面(man-king)在语义上相似,因为它们相同。
你想要实现的是语义文本相似度。
我想推荐 Tensorflow Universal Sentence Encoder
例如:
#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)
def plot_similarity(labels, features, rotation):
corr = np.inner(features, features)
sns.set(font_scale=1.2)
g = sns.heatmap(
corr,
xticklabels=labels,
yticklabels=labels,
vmin=0,
vmax=1,
cmap="YlOrRd")
g.set_xticklabels(labels, rotation=rotation)
g.set_title("Semantic Textual Similarity")
def run_and_plot(messages_):
message_embeddings_ = embed(messages_)
plot_similarity(messages_, message_embeddings_, 90)
messages = [
"Mother",
"Mom",
"Mama",
"Dog",
"Cat"
]
run_and_plot(messages)
该示例是用 python 编写的,但我还创建了一个将模型加载到基于 JVM 的语言的示例
您可以生成词嵌入向量并使用一些聚类算法。最后,您需要调整算法的超参数以获得高精度的结果。
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import spacy
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Load the large english model
nlp = spacy.load("en_core_web_lg")
tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")
# Generate word embedding vectors
vectors = np.array([token.vector for token in tokens])
vectors.shape
# (12, 300)
让我们使用主成分分析算法在 3 维中可视化我们的嵌入 space:
pca_vecs = PCA(n_components=3).fit_transform(vectors)
pca_vecs.shape
# (12, 3)
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')
xs, ys, zs = pca_vecs[:, 0], pca_vecs[:, 1], pca_vecs[:, 2]
_ = ax.scatter(xs, ys, zs)
for x, y, z, lable in zip(xs, ys, zs, tokens):
ax.text(x+0.3, y, z, str(lable))
让我们使用 DBSCAN 算法对单词进行聚类:
model = DBSCAN(eps=5, min_samples=1)
model.fit(vectors)
for word, cluster in zip(tokens, model.labels_):
print(word, '->', cluster)
输出:
dog -> 0
cat -> 0
banana -> 1
apple -> 2
teaching -> 3
teacher -> 3
mom -> 4
mother -> 4
mama -> 4
mommy -> 4
berlin -> 5
paris -> 6