根据两个数据框值对绘图进行颜色编码

Question

我想根据两个数据帧值对散点图进行颜色编码，这样对于 df[1] 的每个不同值，将分配一个新颜色，并且每个 df[2] 值具有相同的值df[1] 值，较早分配的颜色需要 df[2] 值最高的不透明度变化（在具有相同 df[1] 值的 df[2] 值中）获得 100% 不透明，最低的不透明度在组中最低的数据点。

代码如下：

def func():
    ...

df = pd.read_csv(PATH + file, sep=",", header=None)


b = 2.72
a = 0.00000009

popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])

perr = np.sqrt(np.diag(pcov))

plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure

plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure

plt.legend(loc="upper left")

这是示例数据集：

**df[0],df[1],df[2],df[3],df[4],df[5],df[6]**

file_name_1_i1,31,413,36120,10,9,10
file_name_1_i2,31,1240,60488,10,25,27
file_name_1_i3,31,2769,107296,10,47,48
file_name_1_i4,31,8797,307016,10,150,150
file_name_2_i1,34,72,10868,11,9,10
file_name_2_i2,34,6273,250852,11,187,196
file_name_3_i1,36,84,29568,12,9,10
file_name_3_i2,36,969,68892,12,25,26
file_name_3_i3,36,6545,328052,12,150,151
file_name_4_i1,69,116,40712,13,25,26
file_name_4_i2,69,417,80080,13,47,48
file_name_4_i2,69,1313,189656,13,149,150
file_name_4_i4,69,3009,398820,13,195,196
file_name_4_i5,69,22913,2855044,13,3991,4144
file_name_5_i1,85,59,48636,16,47,48
file_name_5_i2,85,163,64888,15,77,77
file_name_5_i3,85,349,108728,16,103,111
file_name_5_i4,85,1063,253180,14,248,248
file_name_5_i5,85,2393,526164,15,687,689
file_name_5_i6,85,17713,3643728,15,5862,5867
file_name_6_i1,104,84,75044,33,137,138
file_name_6_i2,104,455,204792,28,538,598
file_name_6_i3,104,1330,513336,31,2062,2063
file_name_6_i4,104,2925,1072276,28,3233,3236
file_name_6_i5,104,6545,2340416,28,7056,7059
...

因此，x 轴将是 df[1]，即 31, 31, 31, 31, 34, 34,...，y 轴是 df[5], df[4], df[2]，即 9, 10, 413.对于 df[1] 的每个不同值，需要分配一种新颜色。在 6 种独特的颜色之后重复颜色循环会很好。在每种颜色中，不透明度需要更改为 df[2] 的值（尽管 y 轴是 df[5], df[4], df[2]）。最高的获得相同颜色的较深版本，最低的获得相同颜色的最浅版本。

和散点图：

我想要的颜色代码解决方案大致如下所示：

我在 csv 文件中有大约 200 个条目。

在这种情况下使用 NumPy 是否更有优势？

Answer 1

让我知道这是否合适或者我是否误解了什么-

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# not needed for you
# df = pd.read_csv('~/Documents/tmp.csv')

max_2 = pd.DataFrame(df.groupby('1').max()['2'])

no_unique_colors = 3
color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
# assign colors to unique df2 in cyclic order
max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]

# calculate the opacities for each entry in the dataframe
colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
# repeat thrice so that df2, df4 and df5 share the same opacity
colors = [x for x in colors for _ in range(3)]

plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
plt.show()

Answer 2

好吧，你知道什么。我对这项任务的理解完全不同。我认为重点是根据每个 df[1] 值的所有 df[2]、df[4] 和 df[5] 值设置 alpha 级别。哦，好吧，既然我已经完成了工作，为什么不post呢？

from matplotlib import pyplot as plt
import pandas as pd
from itertools import cycle
from matplotlib.colors import to_rgb

#read the data, column numbers will be generated automatically
df = pd.read_csv("data.txt", sep = ",", header=None)

#our figure with the ax object
fig, ax = plt.subplots(figsize=(10,10))
#definition of the colors
sc_color = cycle(["tab:orange", "red", "blue", "black"])

#get groups of the same df[1] value, they will also be sorted at the same time
dfgroups = df.iloc[:, [2, 4, 5]].groupby(by=df[1])

#plot each group with a different colour
for groupkey, groupval in dfgroups:
    #create group dataframe with df[1] value as x and df[2], df[4], and df[5] values as y
    groupval= groupval.melt(var_name="x", value_name="y")
    groupval.x = groupkey
    
    #get  min and max y for the normalization
    y_high = groupval.y.max()
    y_low = groupval.y.min()
    #read out r, g, and b values of the next color in the cycle
    r, g, b = to_rgb(next(sc_color))
    #create a colour array with nonlinear normalized alpha levels 
    #between 0.2 and 0.8, so that all data point are visible
    group_color = [(r, g, b, 0.19 + 0.8 * ((y_high-val) / (y_high-y_low))**7) for val in groupval.y]
    #and plot
    ax.scatter(groupval.x, groupval.y, c=group_color)
    
    
plt.show()

数据输出示例：

这里有两个主要问题。一个是散点图中的 alpha 不接受数组。但是颜色会因此绕道读出 RGB 值并创建一个添加了 alpha 级别的 RGBA 数组。
另一个是您的数据分布在相当广泛的范围内。线性归一化使接近最低值的变化不可见。肯定有一些优化可能；我喜欢 this suggestion.

根据两个数据框值对绘图进行颜色编码

Colour code the plot based on the two data frame values

python

color-scheme

matplotlib

scatter-plot

pandas