如何标记数据框中的异常(按行)?
How to flag an anomaly in a data frame (row wise)?
Python 新手,我想标记明显与行的其余部分不同的零星数字。
简单来说,标记似乎不属于每一行的数字。 100s 和 100000s 中的数字被视为 'off the rest'
import pandas as pd
# intialise data of lists.
data = {'A':['R1', 'R2', 'R3', 'R4', 'R5'],
'B':[12005, 18190, 1021, 13301, 31119,],
'C':[11021, 19112, 19021,15, 24509 ],
'D':[10022,19910, 19113,449999, 25519],
'E':[14029, 29100, 39022, 24509, 412271],
'F':[52119,32991,52883,69359,57835],
'G':[41218, 52991,1021,69152,79355],
'H': [43211,7672991,56881,211,77342],
'J': [31211,42901,53818,62158,69325],
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.describe()
我正在尝试做完全像这样的事情
# I need help with step 1
#my code/pseudocode
# step 1: identify the values in each row that are don't belong to the group
# step 2: flag the identified values and export to excel
style_df = .applymap(lambda x: "background-color: yellow" if x else "") # flags the values that meets the criteria
with pd.ExcelWriter("flagged_data.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,index=False)
我猜你可以更好地定义你认为“与众不同”的东西。这在处理数据时非常重要。
例如,您要标记 B 列分布的异常值吗?您可以简单地为您的分布计算四分位数,并将它们附加到某种字典中,那些低于最低四分位数或高于最高四分位数的。但是您显然需要的不仅仅是您显示的那 5 行。
还有整个领域致力于使用机器学习识别异常值。您用来定义什么应该被视为“与众不同”的假设非常重要。
如果您想了解有关离群值检测细节的更多信息,请阅读此内容:
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
如果您不需要使用 machine learning outliers detection or Hampel filter 并且您已经知道过滤器的限制,您可以简单地执行
def higlight_outliers(s):
# force to numeric and coerce string to NaN
s = pd.to_numeric(s, errors='coerce')
indexes = (s<1500)|(s>1000000)
return ['background-color: yellow' if v else '' for v in indexes]
styled = df.style.apply(higlight_outliers, axis=1)
styled.to_excel("flagged_data.xlsx", index=False)
我在这里使用了两个条件,一个检查小于 1000,另一个检查大于 99999。基于此条件,代码将以红色突出显示异常值。
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_conditional.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add a format. Light red fill with dark red text.
format1 = workbook.add_format({'bg_color': '#FFC7CE',
'font_color': '#9C0006'})
first_row = 1
first_col = 2
last_row = len(df)
last_col = 9
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '<',
'value': 1000,
'format': format1})
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '>',
'value': 99999,
'format': format1})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Python 新手,我想标记明显与行的其余部分不同的零星数字。 简单来说,标记似乎不属于每一行的数字。 100s 和 100000s 中的数字被视为 'off the rest'
import pandas as pd
# intialise data of lists.
data = {'A':['R1', 'R2', 'R3', 'R4', 'R5'],
'B':[12005, 18190, 1021, 13301, 31119,],
'C':[11021, 19112, 19021,15, 24509 ],
'D':[10022,19910, 19113,449999, 25519],
'E':[14029, 29100, 39022, 24509, 412271],
'F':[52119,32991,52883,69359,57835],
'G':[41218, 52991,1021,69152,79355],
'H': [43211,7672991,56881,211,77342],
'J': [31211,42901,53818,62158,69325],
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.describe()
我正在尝试做完全像这样的事情
# I need help with step 1
#my code/pseudocode
# step 1: identify the values in each row that are don't belong to the group
# step 2: flag the identified values and export to excel
style_df = .applymap(lambda x: "background-color: yellow" if x else "") # flags the values that meets the criteria
with pd.ExcelWriter("flagged_data.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,index=False)
我猜你可以更好地定义你认为“与众不同”的东西。这在处理数据时非常重要。
例如,您要标记 B 列分布的异常值吗?您可以简单地为您的分布计算四分位数,并将它们附加到某种字典中,那些低于最低四分位数或高于最高四分位数的。但是您显然需要的不仅仅是您显示的那 5 行。
还有整个领域致力于使用机器学习识别异常值。您用来定义什么应该被视为“与众不同”的假设非常重要。
如果您想了解有关离群值检测细节的更多信息,请阅读此内容: https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
如果您不需要使用 machine learning outliers detection or Hampel filter 并且您已经知道过滤器的限制,您可以简单地执行
def higlight_outliers(s):
# force to numeric and coerce string to NaN
s = pd.to_numeric(s, errors='coerce')
indexes = (s<1500)|(s>1000000)
return ['background-color: yellow' if v else '' for v in indexes]
styled = df.style.apply(higlight_outliers, axis=1)
styled.to_excel("flagged_data.xlsx", index=False)
我在这里使用了两个条件,一个检查小于 1000,另一个检查大于 99999。基于此条件,代码将以红色突出显示异常值。
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_conditional.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add a format. Light red fill with dark red text.
format1 = workbook.add_format({'bg_color': '#FFC7CE',
'font_color': '#9C0006'})
first_row = 1
first_col = 2
last_row = len(df)
last_col = 9
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '<',
'value': 1000,
'format': format1})
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '>',
'value': 99999,
'format': format1})
# Close the Pandas Excel writer and output the Excel file.
writer.save()