Pandas: 使用编辑距离查询
Pandas: Query using Levenshtein Distance
给定以下数据集:
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50
和约束条件:
source = "john"
max_dist = 2
我的目标是获得 list
的所有名称值 Levenshtein Distance
和 source
即 <= max_dist
。是否可以使用 pandas.DataFrame.query()
方法来完成此操作,还是必须以其他方式完成?
你会用不同的方式来做。
import editdistance # first do pip install editdistance
from StringIO import StringIO
s = StringIO("""name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50""")
df = pd.read_csv(s, sep=';')
df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)]
name sex city age
0 john male newyork 20
df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)].name.tolist()
['john']
给定以下数据集:
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50
和约束条件:
source = "john"
max_dist = 2
我的目标是获得 list
的所有名称值 Levenshtein Distance
和 source
即 <= max_dist
。是否可以使用 pandas.DataFrame.query()
方法来完成此操作,还是必须以其他方式完成?
你会用不同的方式来做。
import editdistance # first do pip install editdistance
from StringIO import StringIO
s = StringIO("""name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50""")
df = pd.read_csv(s, sep=';')
df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)]
name sex city age
0 john male newyork 20
df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)].name.tolist()
['john']