如何搜索重复项,然后像 excel(在这种情况下为工作表)一样在数据框中突出显示它们?

How to search for duplicates and then highlight them in a dataframe just like excel (worksheets in that case) does?

我想在数据框中突出显示一些值。我需要比较不同数据帧(df1 和 df2)上的 2 列,然后突出显示重复的值并将它们显示在第一个数据帧 df1 中。

给你一个想法,在 excel 你可以通过使用 countif 公式来实现这个,这里有一个视频:

https://www.youtube.com/watch?v=VhECzNIQTIY

有什么方法可以用 pandas 做到这一点?或者一般来说 python。

谢谢!


更新。

代码如下:

import pandas as pd

#Exporting raw data from a csv file
DataOrigin = pd.read_csv('RAWDATA.csv')
#Sorting raw data per interesting columns
DataOriginSorted = DataOrigin.sort_values(['srcip','attack','dstip'])
#Exporting some columns of historical data and sorting them
Historicaldata2 = pd.read_excel('Historicaldata.xlsx', sheet_name=1, usecols = ['Source_IP','Ticket','Customer_Notification','Hostname','Service_desk_ticket'])
Historicaldata2Sorted = Historicaldata2.sort_values(['Source_IP','Ticket'])
#Creating a multindex variable with sorted raw data
index = pd.MultiIndex.from_frame(DataOriginSorted)
Sorted_DataOrigin = pd.DataFrame(index=index)
#Making a count of events per source IP and exporting them as a csv for the code to work (rename column oepration)
Daily_IncidentsIPS = pd.crosstab(DataOrigin.srcip,DataOrigin.attack).to_csv('ControlFile1.csv')
Daily_IncidentsIPS = pd.read_csv('ControlFile1.csv').rename(columns = {'srcip': 'Source_IP'}, inplace = False )
#Mergin 2 dataframes to find coincident data and exporting them to a csv for the next operations to take place and using only interesting columns
Historical2vsSortedOrigin = Historicaldata2Sorted.merge(Daily_IncidentsIPS,left_on='Source_IP',right_on='Source_IP', how='inner').to_csv('ControlFile2.csv')
Historical2vsSortedOrigin = pd.read_csv('ControlFile2.csv', usecols = ['Ticket','Hostname','Source_IP','Customer_Notification','Service_desk_ticket'])
#Searching for duplicated data between two interesting dataframes
duplicated = Daily_IncidentsIPS['Source_IP'].isin(Historical2vsSortedOrigin['Source_IP'])
#Creating a rule to color the rows where the duplicated values are present
def row_styler(row):
    return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)

#Creating a multindex variable to show the data as I want it
index2 = pd.MultiIndex.from_frame(Historical2vsSortedOrigin)
IncidentMatching = pd.DataFrame(index=index2)
#Saving 3 interesting dataframes in an excel file, highlighting the results of previous "search for duplicated" operation
writer = pd.ExcelWriter('C:\Users\myuser\Documents\Spyder\Results_IPS.xlsx', engine='xlsxwriter')
Daily_IncidentsIPS.style.apply(row_styler, axis=1).to_excel(writer, sheet_name='Sheet1')
Sorted_DataOrigin.to_excel(writer, sheet_name='Sheet2')
IncidentMatching.to_excel(writer, sheet_name='Sheet3')
writer.save()

您可以使用 Pandas 中的样式系统:

# Some mock data
df1 = pd.DataFrame({
    'Name': ['David', 'Sue', 'Mary'],
    'Location': ['San Francisco', 'New York', 'Boston']
})

df2 = pd.DataFrame({
    'Name': ['Sue', 'Mary', 'Joe', 'Jack']
})

# Now determine what Name is duplicated across the frames
duplicated = df1['Name'].isin(df2['Name'])

下一步取决于您要突出显示的内容。如果只想突出显示重复值(如 Excel):

def col_styler(col):
    if col.name != 'Name':
        return [''] * len(col)

    return duplicated.map({
        True: 'background-color: yellow',
        False: ''
    })

df1.style.apply(col_styler)

输出:

如果要突出显示整行:

def row_styler(row):
    return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)

df1.style.apply(row_styler, axis=1)

输出: