python - 计算列表单词之间的正字法相似度
python - calculate orthographic similarity between words of a list
我需要计算给定语料库中单词之间的正字法相似度(edit/Levenshtein 距离)。
正如 Kirill 在下面建议的那样,我尝试执行以下操作:
import csv, itertools, Levenshtein
import numpy as np
# import the list of words from csv file
path = '/Users/my path'
file = path + 'file.csv'
with open(file, 'rb') as f:
reader = csv.reader(f)
wordlist = list(reader)
wordlist = np.array(wordlist) #make it a np array
wordlist2 = wordlist[:,0] #subset the first column of the imported list
for a, b in itertools.product(wordlist, wordlist):
if a < b:
print(a, b, Levenshtein.distance(a, b))
但是,弹出如下错误:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
我理解代码中的歧义,但有人可以帮我弄清楚如何解决这个问题吗?谢谢!
根据其定义,编辑距离只能在两个字符串之间计算:这是您编辑一个字符串以获得另一个字符串的方式。您可以成对比较单词,它需要 n*(n-1)/2
次比较(其中 n
是语料库中唯一单词的数量)。方法如下:
>>> import itertools, Levenshtein
>>> words = sorted(set('little Mary had a little lamb'.split()))
>>> for a, b in itertools.product(words, words):
... if a < b:
... print(a, b, Levenshtein.distance(a, b))
...
Mary a 3
Mary had 3
Mary lamb 3
Mary little 6
a had 2
a lamb 3
a little 6
had lamb 3
had little 6
lamb little 5
这是我在 Kirill 的帮助下得出的代码。
import csv#, StringIO
import itertools, Levenshtein
# open the newline-separated list of words
path = '/Users/your path'
file = path + 'wordlists.txt'
output = path + 'ortho_similarities.txt'
words = sorted(set(s.strip() for s in open(file)))
# the following loop take all possible pairwise combinations
# of the words in the list words, and calculate the LD
# and then let's write everything in a csv file
with open(output, 'wb') as f:
writer = csv.writer(f, delimter=",", lineterminator="\n")
for a, b in itertools.product(words, words):
if a < b:
write.writerow([a, b, Levenshtein.distance(a,b)])
我需要计算给定语料库中单词之间的正字法相似度(edit/Levenshtein 距离)。
正如 Kirill 在下面建议的那样,我尝试执行以下操作:
import csv, itertools, Levenshtein
import numpy as np
# import the list of words from csv file
path = '/Users/my path'
file = path + 'file.csv'
with open(file, 'rb') as f:
reader = csv.reader(f)
wordlist = list(reader)
wordlist = np.array(wordlist) #make it a np array
wordlist2 = wordlist[:,0] #subset the first column of the imported list
for a, b in itertools.product(wordlist, wordlist):
if a < b:
print(a, b, Levenshtein.distance(a, b))
但是,弹出如下错误:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
我理解代码中的歧义,但有人可以帮我弄清楚如何解决这个问题吗?谢谢!
根据其定义,编辑距离只能在两个字符串之间计算:这是您编辑一个字符串以获得另一个字符串的方式。您可以成对比较单词,它需要 n*(n-1)/2
次比较(其中 n
是语料库中唯一单词的数量)。方法如下:
>>> import itertools, Levenshtein
>>> words = sorted(set('little Mary had a little lamb'.split()))
>>> for a, b in itertools.product(words, words):
... if a < b:
... print(a, b, Levenshtein.distance(a, b))
...
Mary a 3
Mary had 3
Mary lamb 3
Mary little 6
a had 2
a lamb 3
a little 6
had lamb 3
had little 6
lamb little 5
这是我在 Kirill 的帮助下得出的代码。
import csv#, StringIO
import itertools, Levenshtein
# open the newline-separated list of words
path = '/Users/your path'
file = path + 'wordlists.txt'
output = path + 'ortho_similarities.txt'
words = sorted(set(s.strip() for s in open(file)))
# the following loop take all possible pairwise combinations
# of the words in the list words, and calculate the LD
# and then let's write everything in a csv file
with open(output, 'wb') as f:
writer = csv.writer(f, delimter=",", lineterminator="\n")
for a, b in itertools.product(words, words):
if a < b:
write.writerow([a, b, Levenshtein.distance(a,b)])