给定不同参数将函数应用于 pandas 系列

Apply function to pandas series given varying arguments

初始问题

我想计算多个字符串之间的编辑距离,一个在一系列中,另一个在列表中。我尝试了 map、zip 等,但我只使用 for 循环并应用得到了想要的结果。有没有办法改进风格,尤其是速度?

这是我尝试过的,它做了它应该做的事情,但在大型系列中缺乏速度。

import stringdist

strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']
df = pd.DataFrame()
for w in c:
    df[w] = s.apply(lambda x: stringdist.levenshtein(x, w))

## Result: ##
        me  mine  Friend
Hello    4     5       6
my       1     3       6
Friend   5     4       0
I        2     4       6
am       2     4       6

解决方案

感谢@Dames 和@molybdenum42,我可以直接在问题下方提供我使用的解决方案。如需更多见解,请在下方查看他们的精彩回答。

import stringdist
from itertools import product

strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']

word_combinations = np.array(list(product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:, 0],       
word_combinations[:, 1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)

这会生成所需的数据框。

设置:

import stringdist
import pandas as pd
import numpy as np
import itertools

s = pd.Series(data=['Hello', 'my', 'Friend'],
              index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']

选项

  1. 选项:简单的单行
df = pd.DataFrame([s.apply(lambda x: stringdist.levenshtein(x, w)) for w in c])
  1. 选项:np.fromfunction(感谢 @baccandr
@np.vectorize
def lavdist(a, b):
    return stringdist.levenshtein(c[a], s[b])

df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int), 
                  columns=c, index=s)
  1. 选项:见@molybdenum42
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2].unstack()
  1. (最佳)选项:修改选项3
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)

性能测试:

import timeit
from Levenshtein import distance
import pandas as pd
import numpy as np
import itertools

s = pd.Series(data=['Hello', 'my', 'Friend'],
              index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']

test_code0 = """
df = pd.DataFrame()
for w in c:
    df[w] = s.apply(lambda x: distance(x, w))
"""

test_code1 = """
df = pd.DataFrame({w:s.apply(lambda x: distance(x, w)) for w in c})
"""

test_code2 = """
@np.vectorize
def lavdist(a, b):
    return distance(c[a], s[b])

df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int), 
                  columns=c, index=s)
"""

test_code3 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2] #.unstack() produces error
"""

test_code4 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
"""

test_setup = "from __main__ import distance, s, c, pd, np, itertools"

print("test0", timeit.timeit(test_code0, number = 1000, setup = test_setup))
print("test1", timeit.timeit(test_code1, number = 1000, setup = test_setup))
print("test2", timeit.timeit(test_code2, number = 1000, setup = test_setup))
print("test3", timeit.timeit(test_code3, number = 1000, setup = test_setup))
print("test4", timeit.timeit(test_code4, number = 1000, setup = test_setup))

结果

# results
# test0 1.3671939949999796
# test1 0.5982696900009614
# test2 0.3246431229999871
# test3 2.0100400850005826
# test4 0.23796007100099814

使用itertools,你至少可以得到所有需要的组合。使用 stringcount.levenshtein 的矢量化版本(使用 numpy.vectorize() 制作),您可以在根本不循环的情况下获得所需的结果,尽管我还没有测试矢量化 levenshtein 函数的性能。

代码可能如下所示:

import stringdist
import numpy as np
import pandas as pd
import itertools

s = pd.Series(["Hello", "my","Friend"])
c = ['me', 'mine', 'Friend']

word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])

此时您在一个 numpy 数组中得到了结果,每个结果对应于您的两个初始数组的所有可能组合之一。如果你想让它变成你的例子中的形状,有一些 pandas 技巧需要完成:

df = pd.DataFrame([word_combinations[:,0], word_combinations[:,1], result]).T

### initially looks like: ###
#         0       1  2
# 0   Hello      me  4
# 1   Hello    mine  5
# 2   Hello  Friend  6
# 3      my      me  1
# 4      my    mine  3
# 5      my  Friend  6
# 6  Friend      me  5
# 7  Friend    mine  4
# 8  Friend  Friend  0

df = df.set_index([0,1])[2].unstack()

### Now looks like: ###
#        Friend Hello my
# Friend      0     6  6
# me          5     4  1
# mine        4     5  3

同样,我还没有测试过这种方法的性能,所以我建议检查一下 - 不过它应该比迭代更快。

编辑: 用户@Dames 有一个更好的建议,可以使结果看起来很像:

result = result.reshape(len(c), len(s))
df = pd.DataFrame(result, columns=c, index=s)