如何计算来自单独列的两个字符串之间的归一化editex相似度
How to calulate the normalized editex similarity between two strings from seperate columns
我正在尝试使用 python 计算两个字符串之间的归一化 editex 相似度。 ASo far 我已经使用这个代码来获得工作正常的原始 editex 距离:
new_df["EdxScore"] = new_df.apply(lambda x: editex.(x[0],x[1]), axis=1)
我已阅读此处的文档:https://anhaidgroup.github.io/py_stringmatching/v0.3.x/Editex.html
但是当我尝试时:
new_df["EdxScore"] = new_df.apply(lambda x: textdistance.editex.get_sim_score(x[0],x[1]), axis=1)
我收到错误:
AttributeError: ("'Editex' 对象没有属性 'get_sim_score'", 'occurred at index 0')
我不完全确定这里出了什么问题,所以非常感谢您的帮助!
原来我没有正确阅读文档和定义要使用的参数。
为清楚起见,我粘贴了以下参数:
所有算法都有 2 个接口:
Class with algorithm-specific params for customizing.
Class instance with default params for quick and simple usage.
所有算法都有一些共同的方法:
.distance(*sequences) – calculate distance between sequences.
.similarity(*sequences) – calculate similarity for sequences.
.maximum(*sequences) – maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
.normalized_distance(*sequences) – normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
.normalized_similarity(*sequences) – normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
最常见的初始化参数:
qval – q-value for split sequences into q-grams. Possible values:
1 (default) – compare sequences by chars.
2 or more – transform sequences to q-grams.
None – split sequences by words.
as_set – for token-based algorithms:
True – t and ttt is equal.
False (default) – t and ttt is different.
我正在尝试使用 python 计算两个字符串之间的归一化 editex 相似度。 ASo far 我已经使用这个代码来获得工作正常的原始 editex 距离:
new_df["EdxScore"] = new_df.apply(lambda x: editex.(x[0],x[1]), axis=1)
我已阅读此处的文档:https://anhaidgroup.github.io/py_stringmatching/v0.3.x/Editex.html
但是当我尝试时:
new_df["EdxScore"] = new_df.apply(lambda x: textdistance.editex.get_sim_score(x[0],x[1]), axis=1)
我收到错误:
AttributeError: ("'Editex' 对象没有属性 'get_sim_score'", 'occurred at index 0')
我不完全确定这里出了什么问题,所以非常感谢您的帮助!
原来我没有正确阅读文档和定义要使用的参数。
为清楚起见,我粘贴了以下参数:
所有算法都有 2 个接口:
Class with algorithm-specific params for customizing.
Class instance with default params for quick and simple usage.
所有算法都有一些共同的方法:
.distance(*sequences) – calculate distance between sequences.
.similarity(*sequences) – calculate similarity for sequences.
.maximum(*sequences) – maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
.normalized_distance(*sequences) – normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
.normalized_similarity(*sequences) – normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
最常见的初始化参数:
qval – q-value for split sequences into q-grams. Possible values:
1 (default) – compare sequences by chars.
2 or more – transform sequences to q-grams.
None – split sequences by words.
as_set – for token-based algorithms:
True – t and ttt is equal.
False (default) – t and ttt is different.