比较单行的多列

Question

我正在对列进行分组并确定每组具有不同值的行。例如：我可以将列 A、B、C、D 分组并删除列 A，因为它不同（第 2 行是 2.1）。另外，我可以对列 E、F、G、H 进行分组并删除列 G，因为第 1 行（第 0 行是蓝色）。

      A |  B |  C   |   D    |   E   |   F   |   G  |   H
  | ---------------------------------------------------------|
0 | 1.0 |  1 | 1 in | 1 inch | Red   |  Red  | Blue |  Red
  | ---------------------------------------------------------|
1 | 2.0 |  2 | 2 in | 2 inch | Green | Green | Green| Green
  | ---------------------------------------------------------|
2 | 2.1 |  2 | 2 in | 2 inch | Blue  |  Blue | Blue |  Blue

到目前为止我尝试比较的值：

import difflib
text1 = '1.0'
text2 = '1 in'
text3 = '1 inch'
output = str(int(difflib.SequenceMatcher(None, text1, text2, text3).ratio()*100))

output: '28'

这不能很好地比较数字后跟英寸或毫米等测量值。然后我尝试了 spacy.load('en_core_web_sm') 并且效果更好但它仍然不存在。有什么方法可以比较一组类似于 1.0, 1, 1 in, 1 inch 的值吗？

Answer 1

对于只有字符串的列，您可以使用 pandas df.equals() 来比较两个数据帧或系列 (cols)

#Example    
df.E.equals(df.F)

您可以使用此函数将许多列与我称为 main 或 template 的单个列进行比较，该列应该是您具有“正确”值的列。

def col_compare(main_col, *to_compare):
  '''Compares each column from a list to another column
  Inputs: 
  * main_col: enter the column name (e.g. 'A')
  * to_compare: enter as many column names as you want (e.g. 'B', 'C') '''
  # Columns to compare to list
  to_compare = list(to_compare)
  # List to store results
  results = []

  # Compare columns from the list with the template column
  for col in to_compare:
    if not df[main_col].equals(df[col]):
      results.append(col)
  
  print(f'Main Column: {main_col}')
  print(f'Compared to: {to_compare}')
  return f"The columns that have different values from {main_col} are {results}"

例如

`col_compare('E', 'F', 'G', 'H')`

output:
Main Column: E
Compared to: ['F', 'G', 'H']
The columns that have different values from E are ['G']

对于 A、B、C 和 D 列，您有要比较的数字，但之后有字符串片段，一种选择是将数字提取到新列中仅用于比较，您可以删除它们之后。您可以使用以下代码为每个包含数字和字符串的列创建新列：

df['C_num'] = df.C.apply( lambda x: int(re.search('[0-9]*', x).group() ) )

然后使用上面的函数col_compare来运行数值列之间的比较。

Answer 2

我找到了问题的答案。 Crystal L 推荐我使用 FuzzyMatch，我发现它很有用。这是文档：https://www.datacamp.com/community/tutorials/fuzzy-string-python 以下是我尝试过的几件事：

# Fucntion to compare length and similar characters
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

Str1= '1 mm'
Str2= '1 in'
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)

import Levenshtein as lev
Str1= '1 mm'
Str2= '1 in'
Distance = lev.distance(Str1.lower(),Str2.lower()),
print(Distance)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
print(Ratio)

# pip install fuzzywuzzy 
from fuzzywuzzy import fuzz
Str1= '2 inches'
Str2= '1 mm'

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

比较单行的多列

Comparing multiple columns for a single row

python

string-comparison

dataframe