在 Python 中,我正在比较包含字符串的数据帧,以确定它是应该通过还是失败。当数据应该失败时,如何阻止数据通过?

In Python, I am comparing dataframes containing strings to decide if it should pass or fail. How can I stop data from passing when it should fail?

我有 20 多个测试用例,用于检查 CSV 是否存在由于数据输入导致的数据异常。此测试用例 (#15) 将称呼和收件人与婚姻状况进行比较。

# Test case  15
# Compares MrtlStat to  PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
df = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15 = df[df['PrimSalText'].str.contains("AND|&", na=False)]

# Adds row to list of failed data
ids = data_15.index.tolist()

# Keep track of data that failed test case 15 
for i in ids:
  data.at[i,'Test Case Failed']+=', 15'

如果 MrtlStat 包含 Widow、Divorced 或 Single 而 PrimAddText 或 PrimSalTexts 包含 AND 或 &,则它应该无法通过测试。此测试仅在 PrimSalTexts 和 PrimAddText 都包含 AND 或 & 时有效。

Table 显示通过但应该失败的数据:

PrimAddText PrimSalText MrtlStat
Mrs. Judith Elfrank Mr. & Mrs. Elfrank & Michael Widowed
Mr. & Mrs.Karl Magnusen Mr. Magnusen Widowed

Table 显示数据按预期失败:

PrimAddText PrimSalText MrtlStat
Mr. & Mrs. Elfrank Mr. & Mrs. Elfrank & Michael Widowed

如果只有一列(PrimSalTexts 或 PrimAddText)包含 AND 或 &,我如何调整测试以使其工作?

您不应按顺序过滤数据,而应将条件合并为一个条件(使用 & 和 |)。一个好的方法是 numpy.where:

import pandas as pd
import numpy as np

# construct data
data = pd.DataFrame({
    'PrimAddText': ['Mrs. Judith Elfrank', 'Mr. & Mrs.Karl Magnusen', 'Mr. & Mrs. Elfrank'],
    'PrimSalText': ['Mr. & Mrs. Elfrank & Michael', 'Mr. Magnusen', 'Mr. & Mrs. Elfrank & Michael'],
    'MrtlStat': ['Widowed', 'Widowed', 'Widowed']
})

# Case 15 - create condition
data['Status_case15'] = np.where((data['MrtlStat'].str.contains("Widow|Divorced|Single") 
                           & (data['PrimAddText'].str.contains("AND|&", na=False) 
                              | data['PrimSalText'].str.contains("AND|&", na=False))), False, True)
# filter failing rows
data.loc[data['Status_case15'] == False]

# sum correct rows
sum(data['Status_case15'])

您有一个 AND 条件 b/w 第二个和第三个条件,您可以将它们分开并从每个条件中捕获结果。最后将两个列表合并在一起

# Test case  15
# Compares MrtlStat to  PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
data_15_A = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15_B = df[df['PrimSalText'].str.contains("AND|&", na=False)]

# Adds row to list of failed data
ids = data_15_A.index.tolist() + data_15_B.index.tolist()

# Keep track of data that failed test case 15 
for i in ids:
  data.at[i,'Test Case Failed']+=', 15'