将 CSV 文件的每个元素与不同 CSV 文件的每个元素进行比较,并找到最相似的元素
Compare each element of CSV file to every element of a different CSV file, and find the most similar elements
我有两个 CSV 文件需要比较。第一个叫SAP.csv,第二个叫SAPH.csv.
SAP.csv 有这些单元格:
Notification Description
5000000001 Detailed Inspection of Masts (2100mm) (3
5000000002 Ceremonial Awnings-Survey and Load Test
5000000003 HPA-Carry out 4000 hour service routine
5000000004 UxE 8 in Number Temperature Probs for C
5000000005 Overhaul valves
...而 SAPH.csv 有这些单元格:
Notification Description
4000000015 Detailed Inspection of Masts (2100mm) (3
4000000016 Ceremonial Awnings-Survey and Load Test
4000000017 HPA-Carry out 8000 hour service routine
4000000018 UxE 8 in Number Temperature Probs for C
4000000019 Represerve valves
4000000020 STW System
它们很相似,但是有些行,比如第四行,(HPA-Carry out 4000 hour service routine vs. HPA-Carry out 8000小时服务程序), 略有不同。
我想将 SAP.csv 的每个值与 SAPH.csv 的每个值进行比较,然后使用余弦相似度找到最相似的行,以便输出看起来像这样(这里的相似度百分比只是示例,并非实际情况):
Description
Detailed Inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test - 100%
HPA-Carry out 4000 hour service routine - 85%
UxE 8 in Number Temperature Probs for C - 90%
Overhaul valves - 0%
Post 答案编辑
运行文件('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')
追溯(最近调用最后):
文件“”,第 1 行,在
中
runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行,在 运行 文件
中
execfile(filename, namespace)
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile
中
exec(compile(f.read(), filename, 'exec'), namespace)
文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第 31 行,在
similarity_score = similar(job, description) # Get their similarity
文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第 14 行,类似
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”,第 173 行,距离
return self.maximum(*sequences) - self.similarity(*sequences)
文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”,第 176 行,相似性
return self(*sequences)
文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\token_based.py”,第 175 行,在 call
中
return intersection / pow(prod, 1.0 / len(sequences))
ZeroDivisionError: 浮点除以零
由于解决了上述问题而进行了第二次编辑
所以最初的请求只有两个输出 - 描述和相似度分数。
说明来自 SAP
相似度来自textdistance calc
解决方案能否修改为如下
通知(这是 SAP 文件中的 10 位数字)
说明(目前的情况)
相似性(目前)
通知(这个数字来自 SAPH 文件,将是提供相似性分数的那个)
所以示例行输出是这样的
80000115360 附加材料 FWD 护绳器 86.24% 7123456789
这将沿着 A、B、C、D 列
A、B来自SAP
C计算
D 来自 SAPH
编辑 3
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行,在 运行 文件
中
execfile(filename, namespace)
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile
中
exec(compile(f.read(), filename, 'exec'), namespace)
文件“C:/Users/andrew.stillwell2/.spyder-py3/Est 测试 2.py”,第 16 行,在
中
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'})
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 702 行,在 parser_f
中
return _read(filepath_or_buffer, kwds)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 429 行,在 _read
parser = TextFileReader(filepath_or_buffer, **kwds)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 895 行,在 init
中
self._make_engine(self.engine)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1122 行,在 _make_engine
中
self._engine = CParserWrapper(self.f, **self.options)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1853 行,在 init
中
self._reader = parsers.TextReader(src, **kwds)
文件“pandas/_libs/parsers.pyx”,第 490 行,在 pandas._libs.parsers.TextReader.cinit
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”,第 2017 行,在 pandas_dtype
中
dtype))
TypeError: 数据类型'string' 不理解
Post 编辑 4 - 25/10/20
嗨,所以我得到了和以前一样的错误
此电子邮件可能包含 BAE Systems and/or 第三方的专有信息。
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行,在 运行 文件
中
execfile(filename, namespace)
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile
中
exec(compile(f.read(), filename, 'exec'), namespace)
文件“C:/Users/andrew.stillwell2/.spyder-py3/Est 测试 2.py”,第 16 行,在
中
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 702 行,在 parser_f
中
return _read(filepath_or_buffer, kwds)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 435 行,在 _read
data = parser.read(nrows)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1139 行,已读取
ret = self._engine.read(nrows)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 2421 行,已读取
data = self._convert_data(data)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 2487 行,在 _convert_data
中
clean_conv, clean_dtypes)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1705 行,在 _convert_to_ndarrays
中
cvals = self._cast_types(cvals, cast_type, c)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1808 行,在 _cast_types
中
copy=True, skipna=True)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py”,第 623 行,在 astype_nansafe
中
dtype = pandas_dtype(dtype)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”,第 2017 行,在 pandas_dtype
中
dtype))
TypeError: 数据类型'string' 不理解
我了解了您对分隔符的看法,所以我上传了一个 csv 文件到 repl.it,它看起来好像“,”是分隔符。
因此修改了代码以适应。当我在 repl.it 上这样做时,它起作用了。
这是我正在使用的代码
导入文本距离
导入 pandas 作为 pd
def similar(a, b): # 改编自这里:
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
读取 CSV
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python" )
SAPH = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP_History.csv', dtype={'Notification':'string'}, delimiter=",", engine="python" )
创建一个 pandas 数据框来存储输出。 'Description' 列填充了 SAP['Description']
的值
分数 = pd.DataFrame(SAP['Description'], 列 = ['Notification (SAP)','Description', 'Similarity', 'Notification (SAPH)'])
存储最高相似度分数的临时变量
highest_score = 0
desc = 0
遍历 SAP['Description']
在 SAP 中的工作['Description']:
highest_score = 0 # 在每次迭代中重置highest_score
用于 SAPH['Description'] 中的描述:# 遍历 SAPH['Description']
similarity_score = similar(job, description) # Get their similarity
if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
highest_score = similarity_score
desc = str(description)
if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
break
用highest_score和其他值
更新数据框'scores'
print(SAPH['Description'][SAPH['Description'] == desc])
分数['Notification (SAP)'][分数['Description'] == 工作] = SAP['Notification'][SAP['Description'] == 工作]
分数['Similarity'][分数['Description'] == 工作] = f'{highest_score}%'
分数['Notification (SAPH)'][分数['Description'] == 工作] = SAPH['Notification'][SAPH['Description'] == desc]
打印(分数)
不带索引列输出到Scores.csv
with open('./Scores.csv', 'w') as file:
file.write(scores.__repr__())
正在 运行 Spyder (Python 3.7)
@George_Pipas's answer to this question 演示了一个使用库 textdistance
的示例(我在这里解释了他的部分回答):
A solution is to work with the textdistance
library. I will provide an example of Cosine Similarity
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5
因此,我们可以创建一个相似性查找函数:
def similar(a, b):
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity
根据相似度,如果 a
和 b
更相似,这将输出一个更接近 1 的数字,如果它们不相似,它将输出一个更接近 0 的数字'吨。所以如果a === b
,输出将是1
,但如果a !== b
,输出将小于1。
要获得百分比,只需将输出乘以 100。像这样:
def similar(a, b): # adapted from here:
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
使用 pandas
:
可以很容易地读取 CSV 文件
# Read the CSVs
SAP = pd.read_csv('SAP.csv')
SAPH = pd.read_csv('SAPH.csv')
我们创建另一个 pandas dataframe 来存储我们将计算的结果:
# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']}, columns = ['SAP', 'SAPH', 'Similarity'])
现在,我们遍历 SAP['Description']
和 SAPH['Description']
,将每个元素与其他元素进行比较,计算它们的相似度,并将最高的保存到 scores
。
# Temporary variable to store both the highest similarity score, and the 'SAPH' value the score was computed with
highest_score = {"score": 0, "description": ""}
# Iterate though SAP['Description']
for job in SAP['Description']:
highest_score = {"score": 0, "description": ""} # Reset highest_score at each iteration
for description in SAPH['Description']: # Iterate through SAPH['Description']
similarity_score = similar(job, description) # Get their similarity
if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
highest_score['score'] = similarity_score
highest_score['description'] = description
if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
break
# Update the dataframe 'scores' with highest_score
scores['SAPH'][scores['SAP'] == job] = highest_score['description']
scores['Similarity'][scores['SAP'] == job] = highest_score['score']
细目如下:
- 创建一个临时变量
highest_score
来存储最高计算分数。
- 现在我们彻底迭代
SAP['Description']
,并在内部迭代 SAPH['Description']
。这允许我们将 SAP['Description']
(job
) 的每个值与 SAPH['Description']
(description
). 的每个值进行比较
- 在遍历
SAPH['Description']
时,我们:
- 计算
job
和description
的相似度分数
- 如果它高于
highest_score
中保存的分数,我们相应地更新 highest_score
;否则我们继续
- 如果
similarity_score
等于100
,我们就知道是绝配了,不用一直找了。我们在这种情况下打破循环。
- 在
SAPH['Description']
循环之外,现在我们已经将 job
与 SAPH['Description']
的每个元素进行比较(或找到完美匹配),我们将值保存到 scores
.
对 SAP['Description']
的每个元素重复此操作。
这是 scores
完成后的样子:
SAP SAPH Similarity
0 Detailed Inspection of Masts (2100mm) (3 Detailed Inspection of Masts (2100mm) (3 100
1 Ceremonial Awnings-Survey and Load Test Ceremonial Awnings-Survey and Load Test 100
2 HPA-Carry out 4000 hour service routine HPA-Carry out 8000 hour service routine 94.7368
3 UxE 8 in Number Temperature Probs for C UxE 8 in Number Temperature Probs for C 100
4 Overhaul valves Represerve valves 53.4522
然后将其输出到 CSV 文件后:
# Output it to Scores.csv without the index column (0, 1, 2, 3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv', index=False)
...Scores.csv 看起来像这样:
SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488
View the full code, and run and edit it online
请注意, textdistance
和 pandas
是为此所需的库。安装它们,如果你还没有它们,使用:
pip install textdistance pandas
备注:
- 您可以通过将
f'{highest_score}%'
替换为以下内容来四舍五入百分比:f'{round(highest_score, NUMBER_OF_PLACES_TO_ROUND_TO)}%'
- Here's a formatted version, and here's the code
编辑:(针对评论中提到的遇到的问题)
这是相似度函数的 error-catching 版本:
def similar(a, b): # adapted from here:
try:
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
except ZeroDivisionError:
print('There was an error. Here are the values of a and b that were passed')
print(f'a: {repr(a)}')
print(f'b: {repr(b)}')
exit()
我有两个 CSV 文件需要比较。第一个叫SAP.csv,第二个叫SAPH.csv.
SAP.csv 有这些单元格:
Notification Description
5000000001 Detailed Inspection of Masts (2100mm) (3
5000000002 Ceremonial Awnings-Survey and Load Test
5000000003 HPA-Carry out 4000 hour service routine
5000000004 UxE 8 in Number Temperature Probs for C
5000000005 Overhaul valves
...而 SAPH.csv 有这些单元格:
Notification Description
4000000015 Detailed Inspection of Masts (2100mm) (3
4000000016 Ceremonial Awnings-Survey and Load Test
4000000017 HPA-Carry out 8000 hour service routine
4000000018 UxE 8 in Number Temperature Probs for C
4000000019 Represerve valves
4000000020 STW System
它们很相似,但是有些行,比如第四行,(HPA-Carry out 4000 hour service routine vs. HPA-Carry out 8000小时服务程序), 略有不同。
我想将 SAP.csv 的每个值与 SAPH.csv 的每个值进行比较,然后使用余弦相似度找到最相似的行,以便输出看起来像这样(这里的相似度百分比只是示例,并非实际情况):
Description
Detailed Inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test - 100%
HPA-Carry out 4000 hour service routine - 85%
UxE 8 in Number Temperature Probs for C - 90%
Overhaul valves - 0%
Post 答案编辑
运行文件('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')
追溯(最近调用最后):
文件“”,第 1 行,在
中runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行,在 运行 文件
中execfile(filename, namespace)
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile
中exec(compile(f.read(), filename, 'exec'), namespace)
文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第 31 行,在
similarity_score = similar(job, description) # Get their similarity
文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第 14 行,类似
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”,第 173 行,距离
return self.maximum(*sequences) - self.similarity(*sequences)
文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”,第 176 行,相似性
return self(*sequences)
文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\token_based.py”,第 175 行,在 call
中return intersection / pow(prod, 1.0 / len(sequences))
ZeroDivisionError: 浮点除以零
由于解决了上述问题而进行了第二次编辑
所以最初的请求只有两个输出 - 描述和相似度分数。
说明来自 SAP 相似度来自textdistance calc
解决方案能否修改为如下
通知(这是 SAP 文件中的 10 位数字) 说明(目前的情况) 相似性(目前) 通知(这个数字来自 SAPH 文件,将是提供相似性分数的那个)
所以示例行输出是这样的
80000115360 附加材料 FWD 护绳器 86.24% 7123456789
这将沿着 A、B、C、D 列
A、B来自SAP C计算 D 来自 SAPH
编辑 3
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行,在 运行 文件
中execfile(filename, namespace)
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile
中exec(compile(f.read(), filename, 'exec'), namespace)
文件“C:/Users/andrew.stillwell2/.spyder-py3/Est 测试 2.py”,第 16 行,在
中SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'})
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 702 行,在 parser_f
中return _read(filepath_or_buffer, kwds)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 429 行,在 _read
parser = TextFileReader(filepath_or_buffer, **kwds)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 895 行,在 init
中self._make_engine(self.engine)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1122 行,在 _make_engine
中self._engine = CParserWrapper(self.f, **self.options)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1853 行,在 init
中self._reader = parsers.TextReader(src, **kwds)
文件“pandas/_libs/parsers.pyx”,第 490 行,在 pandas._libs.parsers.TextReader.cinit
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”,第 2017 行,在 pandas_dtype
中dtype))
TypeError: 数据类型'string' 不理解
Post 编辑 4 - 25/10/20
嗨,所以我得到了和以前一样的错误
此电子邮件可能包含 BAE Systems and/or 第三方的专有信息。
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行,在 运行 文件
中execfile(filename, namespace)
文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile
中exec(compile(f.read(), filename, 'exec'), namespace)
文件“C:/Users/andrew.stillwell2/.spyder-py3/Est 测试 2.py”,第 16 行,在
中SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 702 行,在 parser_f
中return _read(filepath_or_buffer, kwds)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 435 行,在 _read
data = parser.read(nrows)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1139 行,已读取
ret = self._engine.read(nrows)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 2421 行,已读取
data = self._convert_data(data)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 2487 行,在 _convert_data
中clean_conv, clean_dtypes)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1705 行,在 _convert_to_ndarrays
中cvals = self._cast_types(cvals, cast_type, c)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1808 行,在 _cast_types
中copy=True, skipna=True)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py”,第 623 行,在 astype_nansafe
中dtype = pandas_dtype(dtype)
文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”,第 2017 行,在 pandas_dtype
中dtype))
TypeError: 数据类型'string' 不理解
我了解了您对分隔符的看法,所以我上传了一个 csv 文件到 repl.it,它看起来好像“,”是分隔符。
因此修改了代码以适应。当我在 repl.it 上这样做时,它起作用了。
这是我正在使用的代码
导入文本距离
导入 pandas 作为 pd
def similar(a, b): # 改编自这里:
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
读取 CSV
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python" )
SAPH = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP_History.csv', dtype={'Notification':'string'}, delimiter=",", engine="python" )
创建一个 pandas 数据框来存储输出。 'Description' 列填充了 SAP['Description']
的值分数 = pd.DataFrame(SAP['Description'], 列 = ['Notification (SAP)','Description', 'Similarity', 'Notification (SAPH)'])
存储最高相似度分数的临时变量
highest_score = 0
desc = 0
遍历 SAP['Description']
在 SAP 中的工作['Description']:
highest_score = 0 # 在每次迭代中重置highest_score
用于 SAPH['Description'] 中的描述:# 遍历 SAPH['Description']
similarity_score = similar(job, description) # Get their similarity
if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
highest_score = similarity_score
desc = str(description)
if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
break
用highest_score和其他值
更新数据框'scores'print(SAPH['Description'][SAPH['Description'] == desc])
分数['Notification (SAP)'][分数['Description'] == 工作] = SAP['Notification'][SAP['Description'] == 工作]
分数['Similarity'][分数['Description'] == 工作] = f'{highest_score}%'
分数['Notification (SAPH)'][分数['Description'] == 工作] = SAPH['Notification'][SAPH['Description'] == desc]
打印(分数)
不带索引列输出到Scores.csv
with open('./Scores.csv', 'w') as file:
file.write(scores.__repr__())
正在 运行 Spyder (Python 3.7)
@George_Pipas's answer to this question 演示了一个使用库 textdistance
的示例(我在这里解释了他的部分回答):
A solution is to work with the
textdistance
library. I will provide an example ofCosine Similarity
import textdistance 1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5
因此,我们可以创建一个相似性查找函数:
def similar(a, b):
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity
根据相似度,如果 a
和 b
更相似,这将输出一个更接近 1 的数字,如果它们不相似,它将输出一个更接近 0 的数字'吨。所以如果a === b
,输出将是1
,但如果a !== b
,输出将小于1。
要获得百分比,只需将输出乘以 100。像这样:
def similar(a, b): # adapted from here:
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
使用 pandas
:
# Read the CSVs
SAP = pd.read_csv('SAP.csv')
SAPH = pd.read_csv('SAPH.csv')
我们创建另一个 pandas dataframe 来存储我们将计算的结果:
# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']}, columns = ['SAP', 'SAPH', 'Similarity'])
现在,我们遍历 SAP['Description']
和 SAPH['Description']
,将每个元素与其他元素进行比较,计算它们的相似度,并将最高的保存到 scores
。
# Temporary variable to store both the highest similarity score, and the 'SAPH' value the score was computed with
highest_score = {"score": 0, "description": ""}
# Iterate though SAP['Description']
for job in SAP['Description']:
highest_score = {"score": 0, "description": ""} # Reset highest_score at each iteration
for description in SAPH['Description']: # Iterate through SAPH['Description']
similarity_score = similar(job, description) # Get their similarity
if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
highest_score['score'] = similarity_score
highest_score['description'] = description
if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
break
# Update the dataframe 'scores' with highest_score
scores['SAPH'][scores['SAP'] == job] = highest_score['description']
scores['Similarity'][scores['SAP'] == job] = highest_score['score']
细目如下:
- 创建一个临时变量
highest_score
来存储最高计算分数。 - 现在我们彻底迭代
SAP['Description']
,并在内部迭代SAPH['Description']
。这允许我们将SAP['Description']
(job
) 的每个值与SAPH['Description']
(description
). 的每个值进行比较
- 在遍历
SAPH['Description']
时,我们:- 计算
job
和description
的相似度分数
- 如果它高于
highest_score
中保存的分数,我们相应地更新highest_score
;否则我们继续 - 如果
similarity_score
等于100
,我们就知道是绝配了,不用一直找了。我们在这种情况下打破循环。
- 计算
- 在
SAPH['Description']
循环之外,现在我们已经将job
与SAPH['Description']
的每个元素进行比较(或找到完美匹配),我们将值保存到scores
.
对 SAP['Description']
的每个元素重复此操作。
这是 scores
完成后的样子:
SAP SAPH Similarity
0 Detailed Inspection of Masts (2100mm) (3 Detailed Inspection of Masts (2100mm) (3 100
1 Ceremonial Awnings-Survey and Load Test Ceremonial Awnings-Survey and Load Test 100
2 HPA-Carry out 4000 hour service routine HPA-Carry out 8000 hour service routine 94.7368
3 UxE 8 in Number Temperature Probs for C UxE 8 in Number Temperature Probs for C 100
4 Overhaul valves Represerve valves 53.4522
然后将其输出到 CSV 文件后:
# Output it to Scores.csv without the index column (0, 1, 2, 3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv', index=False)
...Scores.csv 看起来像这样:
SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488
View the full code, and run and edit it online
请注意, textdistance
和 pandas
是为此所需的库。安装它们,如果你还没有它们,使用:
pip install textdistance pandas
备注:
- 您可以通过将
f'{highest_score}%'
替换为以下内容来四舍五入百分比:f'{round(highest_score, NUMBER_OF_PLACES_TO_ROUND_TO)}%'
- Here's a formatted version, and here's the code
编辑:(针对评论中提到的遇到的问题)
这是相似度函数的 error-catching 版本:
def similar(a, b): # adapted from here:
try:
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
except ZeroDivisionError:
print('There was an error. Here are the values of a and b that were passed')
print(f'a: {repr(a)}')
print(f'b: {repr(b)}')
exit()