TypeError: expected string or bytes-like object on Pandas using Fuzzy matching
TypeError: expected string or bytes-like object on Pandas using Fuzzy matching
背景
我有df
import pandas as pd
import nltk
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df= pd.DataFrame({'ID': [1,2,3],
'Text':['This num dogs and cats is (111)888-8780 and other',
'dont block cow 23 here',
'cat two num: dog and cows here']
})
我也有一个列表
word_list = ['dog', 'cat', 'cow']
和一个函数,该函数应该在 df 的 Text
列上与 word_list
进行模糊匹配
def fuzzy(row, word_list):
tweet = row[0]
fuzzy_match = []
for word in word_list:
token_words = nltk.word_tokenize(tweet)
for token in range(0, len(token_words) - 1):
fuzzy_fx = process.extract(word_list[word], token_words[token], limit=100, scorer = fuzz.ratio)
fuzzy_match.append(fuzzy_fx[0])
return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
然后我加入
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
但是我得到一个错误
TypeError: expected string or bytes-like object
期望的输出
我想要的输出是 1) 新列 Fuzzy_Match
和 fuzzy
函数
的输出
ID Text Fuzzy_Match
0 1 This num dogs and cats is (111)888-8780 and other output of fuzzy 1
1 2 dont block cow 23 here output of fuzzy 2
2 3 cat two num: dog and cows here output of fuzzy 3
问题
我需要做什么才能获得我想要的输出?
这应该有效:
In [32]: def fuzzy(row, word_list):
...: tweet = row[1]
...: fuzzy_match = []
...: token_words = nltk.word_tokenize(tweet)
...: for word in word_list:
...:
...: fuzzy_fx = process.extract(word, token_words, limit=100, scorer = fuzz.ratio)
...: fuzzy_match.append(fuzzy_fx[0])
...:
...: return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
process.extract()
需要一个列表作为第二个参数。你可以在这里读更多关于它的内容。
python fuzzywuzzy's process.extract(): how does it work?
背景
我有df
import pandas as pd
import nltk
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df= pd.DataFrame({'ID': [1,2,3],
'Text':['This num dogs and cats is (111)888-8780 and other',
'dont block cow 23 here',
'cat two num: dog and cows here']
})
我也有一个列表
word_list = ['dog', 'cat', 'cow']
和一个函数,该函数应该在 df 的 Text
列上与 word_list
def fuzzy(row, word_list):
tweet = row[0]
fuzzy_match = []
for word in word_list:
token_words = nltk.word_tokenize(tweet)
for token in range(0, len(token_words) - 1):
fuzzy_fx = process.extract(word_list[word], token_words[token], limit=100, scorer = fuzz.ratio)
fuzzy_match.append(fuzzy_fx[0])
return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
然后我加入
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
但是我得到一个错误
TypeError: expected string or bytes-like object
期望的输出
我想要的输出是 1) 新列 Fuzzy_Match
和 fuzzy
函数
ID Text Fuzzy_Match
0 1 This num dogs and cats is (111)888-8780 and other output of fuzzy 1
1 2 dont block cow 23 here output of fuzzy 2
2 3 cat two num: dog and cows here output of fuzzy 3
问题 我需要做什么才能获得我想要的输出?
这应该有效:
In [32]: def fuzzy(row, word_list):
...: tweet = row[1]
...: fuzzy_match = []
...: token_words = nltk.word_tokenize(tweet)
...: for word in word_list:
...:
...: fuzzy_fx = process.extract(word, token_words, limit=100, scorer = fuzz.ratio)
...: fuzzy_match.append(fuzzy_fx[0])
...:
...: return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
process.extract()
需要一个列表作为第二个参数。你可以在这里读更多关于它的内容。
python fuzzywuzzy's process.extract(): how does it work?