比较单行的多列
Comparing multiple columns for a single row
我正在对列进行分组并确定每组具有不同值的行。例如:我可以将列 A、B、C、D 分组并删除列 A,因为它不同(第 2 行是 2.1)。另外,我可以对列 E、F、G、H 进行分组并删除列 G,因为第 1 行(第 0 行是蓝色)。
A | B | C | D | E | F | G | H
| ---------------------------------------------------------|
0 | 1.0 | 1 | 1 in | 1 inch | Red | Red | Blue | Red
| ---------------------------------------------------------|
1 | 2.0 | 2 | 2 in | 2 inch | Green | Green | Green| Green
| ---------------------------------------------------------|
2 | 2.1 | 2 | 2 in | 2 inch | Blue | Blue | Blue | Blue
到目前为止我尝试比较的值:
import difflib
text1 = '1.0'
text2 = '1 in'
text3 = '1 inch'
output = str(int(difflib.SequenceMatcher(None, text1, text2, text3).ratio()*100))
output: '28'
这不能很好地比较数字后跟英寸或毫米等测量值。然后我尝试了 spacy.load('en_core_web_sm') 并且效果更好但它仍然不存在。有什么方法可以比较一组类似于 1.0, 1, 1 in, 1 inch 的值吗?
对于只有字符串的列,您可以使用 pandas df.equals()
来比较两个数据帧或系列 (cols)
#Example
df.E.equals(df.F)
您可以使用此函数将许多列与我称为 main 或 template 的单个列进行比较,该列应该是您具有“正确”值的列。
def col_compare(main_col, *to_compare):
'''Compares each column from a list to another column
Inputs:
* main_col: enter the column name (e.g. 'A')
* to_compare: enter as many column names as you want (e.g. 'B', 'C') '''
# Columns to compare to list
to_compare = list(to_compare)
# List to store results
results = []
# Compare columns from the list with the template column
for col in to_compare:
if not df[main_col].equals(df[col]):
results.append(col)
print(f'Main Column: {main_col}')
print(f'Compared to: {to_compare}')
return f"The columns that have different values from {main_col} are {results}"
例如
`col_compare('E', 'F', 'G', 'H')`
output:
Main Column: E
Compared to: ['F', 'G', 'H']
The columns that have different values from E are ['G']
对于 A、B、C 和 D 列,您有要比较的数字,但之后有字符串片段,一种选择是将数字提取到新列中仅用于比较,您可以删除它们之后。
您可以使用以下代码为每个包含数字和字符串的列创建新列:
df['C_num'] = df.C.apply( lambda x: int(re.search('[0-9]*', x).group() ) )
然后使用上面的函数col_compare来运行数值列之间的比较。
我找到了问题的答案。 Crystal L 推荐我使用 FuzzyMatch,我发现它很有用。这是文档:https://www.datacamp.com/community/tutorials/fuzzy-string-python 以下是我尝试过的几件事:
# Fucntion to compare length and similar characters
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Str1= '1 mm'
Str2= '1 in'
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
import Levenshtein as lev
Str1= '1 mm'
Str2= '1 in'
Distance = lev.distance(Str1.lower(),Str2.lower()),
print(Distance)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
print(Ratio)
# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
Str1= '2 inches'
Str2= '1 mm'
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)
我正在对列进行分组并确定每组具有不同值的行。例如:我可以将列 A、B、C、D 分组并删除列 A,因为它不同(第 2 行是 2.1)。另外,我可以对列 E、F、G、H 进行分组并删除列 G,因为第 1 行(第 0 行是蓝色)。
A | B | C | D | E | F | G | H
| ---------------------------------------------------------|
0 | 1.0 | 1 | 1 in | 1 inch | Red | Red | Blue | Red
| ---------------------------------------------------------|
1 | 2.0 | 2 | 2 in | 2 inch | Green | Green | Green| Green
| ---------------------------------------------------------|
2 | 2.1 | 2 | 2 in | 2 inch | Blue | Blue | Blue | Blue
到目前为止我尝试比较的值:
import difflib
text1 = '1.0'
text2 = '1 in'
text3 = '1 inch'
output = str(int(difflib.SequenceMatcher(None, text1, text2, text3).ratio()*100))
output: '28'
这不能很好地比较数字后跟英寸或毫米等测量值。然后我尝试了 spacy.load('en_core_web_sm') 并且效果更好但它仍然不存在。有什么方法可以比较一组类似于 1.0, 1, 1 in, 1 inch 的值吗?
对于只有字符串的列,您可以使用 pandas df.equals()
来比较两个数据帧或系列 (cols)
#Example
df.E.equals(df.F)
您可以使用此函数将许多列与我称为 main 或 template 的单个列进行比较,该列应该是您具有“正确”值的列。
def col_compare(main_col, *to_compare):
'''Compares each column from a list to another column
Inputs:
* main_col: enter the column name (e.g. 'A')
* to_compare: enter as many column names as you want (e.g. 'B', 'C') '''
# Columns to compare to list
to_compare = list(to_compare)
# List to store results
results = []
# Compare columns from the list with the template column
for col in to_compare:
if not df[main_col].equals(df[col]):
results.append(col)
print(f'Main Column: {main_col}')
print(f'Compared to: {to_compare}')
return f"The columns that have different values from {main_col} are {results}"
例如
`col_compare('E', 'F', 'G', 'H')`
output:
Main Column: E
Compared to: ['F', 'G', 'H']
The columns that have different values from E are ['G']
对于 A、B、C 和 D 列,您有要比较的数字,但之后有字符串片段,一种选择是将数字提取到新列中仅用于比较,您可以删除它们之后。 您可以使用以下代码为每个包含数字和字符串的列创建新列:
df['C_num'] = df.C.apply( lambda x: int(re.search('[0-9]*', x).group() ) )
然后使用上面的函数col_compare来运行数值列之间的比较。
我找到了问题的答案。 Crystal L 推荐我使用 FuzzyMatch,我发现它很有用。这是文档:https://www.datacamp.com/community/tutorials/fuzzy-string-python 以下是我尝试过的几件事:
# Fucntion to compare length and similar characters
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Str1= '1 mm'
Str2= '1 in'
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
import Levenshtein as lev
Str1= '1 mm'
Str2= '1 in'
Distance = lev.distance(Str1.lower(),Str2.lower()),
print(Distance)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
print(Ratio)
# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
Str1= '2 inches'
Str2= '1 mm'
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)