如何根据 Python 中的 CSV 文件中的数据绘制多个累积分布函数?

How to plot several cumulative distribution functions from data in a CSV file in Python?

我正在尝试创建一个 python 脚本来读取一个 CSV 文件,该文件包含在第一行中按样本名称排列的数据以及每个名称下方的数据,例如:

sample1,sample2,sample3
343.323,234.123,312.544

我试图从数据集中将每个样本的累积分布函数绘制到同一轴上。使用以下代码:

import matplotlib.pyplot as plt
import numpy as np
import csv


def isfloat(value):
    '''make sure sample values are floats
    (problem with different number of values per sample)'''
    try:
      float(value)
      return True
    except ValueError:
      return False

def createCDFs (dataset):
    '''create a dictionary with sample name as key and data for each
    sample as one list per key'''
    dataset = dataset
    num_headers = len(list(dataset))
    dict_CDF = {}
    for a in dataset.keys():
        dict_CDF["{}".format(a)]= 1. * np.arange(len(dataset[a])) / (len(dataset[a]) - 1)
    return dict_CDF

def getdata ():
    '''retrieve data from a CSV file - file must have sample names in first row
    and data below'''

    with open('file.csv') as csvfile:
        reader = csv.DictReader(csvfile, delimiter = ',' )
        #create a dict that has sample names as key and associated ages as lists
        dataset = {}
        for row in reader:
            for column, value in row.iteritems():
                if isfloat(value):
                    dataset.setdefault(column, []).append(value)
                else:
                    break
        return dataset

x = getdata()
y = createCDFs(x)

#plot data
for i in x.keys():
    ax1 = plt.subplot(1,1,1)
    ax1.plot(x[i],y[i],label=str(i))


plt.legend(loc='upper left')
plt.show()

这给出了下面的输出,它只正确显示了一个示例(图 1A 中的示例 1)。

Figure 1A. Only one CDF is displaying correctly (Sample1). B. Expected output

每个样本的值数量不同,我认为这是我的问题所在。

这一直困扰着我,因为我认为解决方案应该相当简单。任何 help/suggestions 都会有所帮助。我只是想知道如何正确显示数据。可以找到数据here。预期输出如图 1B 所示。

这是一个更简单的方法。这当然取决于您是否要使用 pandas。我使用 this 方法来计算 cum dist

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

data_req = pd.read_table("yourfilepath", sep=",")
#sort values per column
sorted_values = data_req.apply(lambda x: x.sort_values())

#plot with matplotlib
#note that you have to drop the Na's on columns to have appropriate
#dimensions per variable.

for col in sorted_values.columns: 
    y = np.linspace(0.,1., len(sorted_values[col].dropna()))
    plt.plot(sorted_values[col].dropna(), y)

最后得到你要的图: