如何搜索重复项,然后像 excel(在这种情况下为工作表)一样在数据框中突出显示它们?
How to search for duplicates and then highlight them in a dataframe just like excel (worksheets in that case) does?
我想在数据框中突出显示一些值。我需要比较不同数据帧(df1 和 df2)上的 2 列,然后突出显示重复的值并将它们显示在第一个数据帧 df1 中。
给你一个想法,在 excel 你可以通过使用 countif 公式来实现这个,这里有一个视频:
https://www.youtube.com/watch?v=VhECzNIQTIY
有什么方法可以用 pandas 做到这一点?或者一般来说 python。
谢谢!
更新。
代码如下:
import pandas as pd
#Exporting raw data from a csv file
DataOrigin = pd.read_csv('RAWDATA.csv')
#Sorting raw data per interesting columns
DataOriginSorted = DataOrigin.sort_values(['srcip','attack','dstip'])
#Exporting some columns of historical data and sorting them
Historicaldata2 = pd.read_excel('Historicaldata.xlsx', sheet_name=1, usecols = ['Source_IP','Ticket','Customer_Notification','Hostname','Service_desk_ticket'])
Historicaldata2Sorted = Historicaldata2.sort_values(['Source_IP','Ticket'])
#Creating a multindex variable with sorted raw data
index = pd.MultiIndex.from_frame(DataOriginSorted)
Sorted_DataOrigin = pd.DataFrame(index=index)
#Making a count of events per source IP and exporting them as a csv for the code to work (rename column oepration)
Daily_IncidentsIPS = pd.crosstab(DataOrigin.srcip,DataOrigin.attack).to_csv('ControlFile1.csv')
Daily_IncidentsIPS = pd.read_csv('ControlFile1.csv').rename(columns = {'srcip': 'Source_IP'}, inplace = False )
#Mergin 2 dataframes to find coincident data and exporting them to a csv for the next operations to take place and using only interesting columns
Historical2vsSortedOrigin = Historicaldata2Sorted.merge(Daily_IncidentsIPS,left_on='Source_IP',right_on='Source_IP', how='inner').to_csv('ControlFile2.csv')
Historical2vsSortedOrigin = pd.read_csv('ControlFile2.csv', usecols = ['Ticket','Hostname','Source_IP','Customer_Notification','Service_desk_ticket'])
#Searching for duplicated data between two interesting dataframes
duplicated = Daily_IncidentsIPS['Source_IP'].isin(Historical2vsSortedOrigin['Source_IP'])
#Creating a rule to color the rows where the duplicated values are present
def row_styler(row):
return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)
#Creating a multindex variable to show the data as I want it
index2 = pd.MultiIndex.from_frame(Historical2vsSortedOrigin)
IncidentMatching = pd.DataFrame(index=index2)
#Saving 3 interesting dataframes in an excel file, highlighting the results of previous "search for duplicated" operation
writer = pd.ExcelWriter('C:\Users\myuser\Documents\Spyder\Results_IPS.xlsx', engine='xlsxwriter')
Daily_IncidentsIPS.style.apply(row_styler, axis=1).to_excel(writer, sheet_name='Sheet1')
Sorted_DataOrigin.to_excel(writer, sheet_name='Sheet2')
IncidentMatching.to_excel(writer, sheet_name='Sheet3')
writer.save()
您可以使用 Pandas 中的样式系统:
# Some mock data
df1 = pd.DataFrame({
'Name': ['David', 'Sue', 'Mary'],
'Location': ['San Francisco', 'New York', 'Boston']
})
df2 = pd.DataFrame({
'Name': ['Sue', 'Mary', 'Joe', 'Jack']
})
# Now determine what Name is duplicated across the frames
duplicated = df1['Name'].isin(df2['Name'])
下一步取决于您要突出显示的内容。如果只想突出显示重复值(如 Excel):
def col_styler(col):
if col.name != 'Name':
return [''] * len(col)
return duplicated.map({
True: 'background-color: yellow',
False: ''
})
df1.style.apply(col_styler)
输出:
如果要突出显示整行:
def row_styler(row):
return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)
df1.style.apply(row_styler, axis=1)
输出:
我想在数据框中突出显示一些值。我需要比较不同数据帧(df1 和 df2)上的 2 列,然后突出显示重复的值并将它们显示在第一个数据帧 df1 中。
给你一个想法,在 excel 你可以通过使用 countif 公式来实现这个,这里有一个视频:
https://www.youtube.com/watch?v=VhECzNIQTIY
有什么方法可以用 pandas 做到这一点?或者一般来说 python。
谢谢!
更新。
代码如下:
import pandas as pd
#Exporting raw data from a csv file
DataOrigin = pd.read_csv('RAWDATA.csv')
#Sorting raw data per interesting columns
DataOriginSorted = DataOrigin.sort_values(['srcip','attack','dstip'])
#Exporting some columns of historical data and sorting them
Historicaldata2 = pd.read_excel('Historicaldata.xlsx', sheet_name=1, usecols = ['Source_IP','Ticket','Customer_Notification','Hostname','Service_desk_ticket'])
Historicaldata2Sorted = Historicaldata2.sort_values(['Source_IP','Ticket'])
#Creating a multindex variable with sorted raw data
index = pd.MultiIndex.from_frame(DataOriginSorted)
Sorted_DataOrigin = pd.DataFrame(index=index)
#Making a count of events per source IP and exporting them as a csv for the code to work (rename column oepration)
Daily_IncidentsIPS = pd.crosstab(DataOrigin.srcip,DataOrigin.attack).to_csv('ControlFile1.csv')
Daily_IncidentsIPS = pd.read_csv('ControlFile1.csv').rename(columns = {'srcip': 'Source_IP'}, inplace = False )
#Mergin 2 dataframes to find coincident data and exporting them to a csv for the next operations to take place and using only interesting columns
Historical2vsSortedOrigin = Historicaldata2Sorted.merge(Daily_IncidentsIPS,left_on='Source_IP',right_on='Source_IP', how='inner').to_csv('ControlFile2.csv')
Historical2vsSortedOrigin = pd.read_csv('ControlFile2.csv', usecols = ['Ticket','Hostname','Source_IP','Customer_Notification','Service_desk_ticket'])
#Searching for duplicated data between two interesting dataframes
duplicated = Daily_IncidentsIPS['Source_IP'].isin(Historical2vsSortedOrigin['Source_IP'])
#Creating a rule to color the rows where the duplicated values are present
def row_styler(row):
return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)
#Creating a multindex variable to show the data as I want it
index2 = pd.MultiIndex.from_frame(Historical2vsSortedOrigin)
IncidentMatching = pd.DataFrame(index=index2)
#Saving 3 interesting dataframes in an excel file, highlighting the results of previous "search for duplicated" operation
writer = pd.ExcelWriter('C:\Users\myuser\Documents\Spyder\Results_IPS.xlsx', engine='xlsxwriter')
Daily_IncidentsIPS.style.apply(row_styler, axis=1).to_excel(writer, sheet_name='Sheet1')
Sorted_DataOrigin.to_excel(writer, sheet_name='Sheet2')
IncidentMatching.to_excel(writer, sheet_name='Sheet3')
writer.save()
您可以使用 Pandas 中的样式系统:
# Some mock data
df1 = pd.DataFrame({
'Name': ['David', 'Sue', 'Mary'],
'Location': ['San Francisco', 'New York', 'Boston']
})
df2 = pd.DataFrame({
'Name': ['Sue', 'Mary', 'Joe', 'Jack']
})
# Now determine what Name is duplicated across the frames
duplicated = df1['Name'].isin(df2['Name'])
下一步取决于您要突出显示的内容。如果只想突出显示重复值(如 Excel):
def col_styler(col):
if col.name != 'Name':
return [''] * len(col)
return duplicated.map({
True: 'background-color: yellow',
False: ''
})
df1.style.apply(col_styler)
输出:
如果要突出显示整行:
def row_styler(row):
return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)
df1.style.apply(row_styler, axis=1)
输出: