使用 Python 从 CSV 文件中查找中位数

Find the median from a CSV File using Python

我有一个名为'salaries.csv'的CSV文件,文件内容如下:

City,Job,Salary
Delhi,Doctors,500
Delhi,Lawyers,400
Delhi,Plumbers,100
London,Doctors,800
London,Lawyers,700
London,Plumbers,300
Tokyo,Doctors,900
Tokyo,Lawyers,800
Tokyo,Plumbers,400
Lawyers,Doctors,300
Lawyers,Lawyers,400
Lawyers,Plumbers,500
Hong Kong,Doctors,1800
Hong Kong,Lawyers,1100
Hong Kong,Plumbers,1000
Moscow,Doctors,300
Moscow,Lawyers,200
Moscow,Plumbers,100
Berlin,Doctors,800
Berlin,Plumbers,900
Paris,Doctors,900
Paris,Lawyers,800
Paris,Plumbers,500
Paris,Dog catchers,400

我需要打印每个职业的工资中位数。我尝试了一个代码,它显示了一些错误。

我的代码是:

from StringIO import StringIO
import sqlite3
import csv
import operator #from operator import itemgetter, attrgetter

data = open('sal.csv', 'r').read()
string = ''.join(data)
f = StringIO(string)
reader = csv.reader(f)
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table data (City text, Job text, Salary real)''')
conn.commit()
count = 0

for e in reader:
    if count==0:
        print ""
    else:
        e[0]=str(e[0])
        e[1]=str(e[1])
        e[2] = float(e[2])
        c.execute("""insert into data values (?,?,?)""", e)
        count=count+1
        conn.commit()

labels = []
counts = []
count = 0
c.execute('''select count(Salary),Job from data group by Job''')

for row in c:
      for i in row:
            if count==0:
               counts.append(i)
               count=count+1
           else:
                count=0
      labels.append(i)

c.execute('''select Salary,Job from data order by Job''')

count = 1
count1 = 1
temp = 0
pri = 0
lis = []

for row in c:
      lis.append(row)
for cons in counts:
      if cons%2 == 0:
         pri = cons/2
     else:
         pri = (cons+1)/2
     if count1 == 1:
        for li in lis:
              if count == pri:
                  print "Median is ",li
        count = count + 1
        count = 0
        temp = pri+cons
     else:
        for li in lis:
              if count == temp:
                  print "Median is",li
              count = count+1
              count = 0
              temp = temp + pri
       count1 = count1 + 1

但是,它显示了一些错误:

IndentationError('expected an indented block', ('', 28, 2, 'if count==0:\n'))

如何修复错误?

您可以使用 defaultdict 来输入每个职业的所有薪水,然后只得到中位数。

import csv
from collections import defaultdict

with open("C:/Users/jimenez/Desktop/a.csv","r") as f:
    d = defaultdict(list)
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        d[row[1]].append(float(row[2]))   

for k,v in d.iteritems():
    print "{} median is {}".format(k,sorted(v)[len(v) // 2])
    print "{} average is {}".format(k,sum(v)/len(v))

产出

Plumbers median is 500.0
Plumbers average is 475.0
Lawyers median is 700.0
Lawyers average is 628.571428571
Dog catchers median is 400.0
Dog catchers average is 400.0
Doctors median is 800.0
Doctors average is 787.5

如果您的问题是计算中位数,而不是将所有内容插入 SQL 数据库并对其进行加扰, 只需读取所有行,将所有薪水分组到一个列表中,然后从中获取中位数 - 这会将您的百行级脚本减少到:

import csv
professions = {}

with open("sal.csv") as data:
    for city, profession, salary in csv.reader(data):
        professions.setdefault(profession.strip(), []).append(int(salary.strip()))

for profession, salaries in sorted(professions.items()):
    print ("{}: {}".format(profession, sorted(salaries)[len(salaries//2)] ))

(给定或减去“1”以从排序后的工资中获得适当的中位数)

如果你使用pandashttp://pandas.pydata.org)就很容易了:

import pandas as pd
df = pd.read_csv('test.csv', names=['City', 'Job', 'Salary'])
df.groupby('Job').median()

#               Salary
# Job                 
# Doctors          800
# Dog catchers     400
# Lawyers          700
# Plumbers         450

如果你想要平均值而不是中位数,

df.groupby('Job').mean()

#                   Salary
# Job                     
# Doctors       787.500000
# Dog catchers  400.000000
# Lawyers       628.571429
# Plumbers      475.000000