使用 Python 获取每个 CSV 列中的字符串计数
Get counts of strings in each CSV column using Python
我有这样一个 CSV 文件:
Header1,Header2,Header3,Header4
AA,12,ABCS,A1
BDDV,34,ABCS,BB2
ABCS,5666,gf,KK0
其中一列只能有 letters/words,或者只有数字,或者两者都有。我有多个这样的文件,每个文件中的列不一定相同。我想获取一列中每个元素的计数,该列中只有字母而没有数字。
我想要的输出是
Header1- [('AA', 1),('BDDV',1),('ABCS',1)] Header3- [('ABCS', 2),('gf', 1)]
在这里,虽然两列都有 'ABCS',但我想为每一列分别计算它们。
我可以通过硬编码列号来获取计数,如下所示:
import csv
import collections
count_number = collections.Counter()
with open('filename.csv') as input_file:
r = csv.reader(input_file, delimiter=',')
headers = next(r)
for row in r:
count_number[row[1]] += 1
print count_number.most_common()
但我对如何针对列进行操作感到困惑。
这可以使用 Counter
每个 header:
#!/usr/bin/env python
from collections import Counter, defaultdict
import csv
header_counter = defaultdict(Counter)
with open('filename.csv') as input_file:
r = csv.reader(input_file, delimiter=',')
# read headers
headers = next(r)
for row in r:
# count values for each row to add in header context
row_val = sum([w.isdigit() for w in row])
# zip each row with headers to know where to count
for header, val in zip(headers, row):
# count only non-digits
if not any(map(str.isdigit, val)):
header_counter[header].update({val: row_val})
for k, v in header_counter.iteritems():
print k, v
输出:
Header3 Counter({'ABCS': 2, 'gf': 1})
Header1 Counter({'AA': 1, 'BDDV': 1, 'ABCS': 1})
仅部分解决方案(您仍然需要在 CSV 的第二次迭代中过滤带有数字的列 reader)。
import csv
import collections
with open('filename.csv') as input_file:
r = csv.reader(input_file, delimiter=',')
headers = next(r)
count_number = [collections.Counter() for I in Len(headers)]
for row in r:
for i, val in enumerate(row):
count_number[i][val] += 1
print [cr.most_common() for cr in count_number]
我有这样一个 CSV 文件:
Header1,Header2,Header3,Header4
AA,12,ABCS,A1
BDDV,34,ABCS,BB2
ABCS,5666,gf,KK0
其中一列只能有 letters/words,或者只有数字,或者两者都有。我有多个这样的文件,每个文件中的列不一定相同。我想获取一列中每个元素的计数,该列中只有字母而没有数字。
我想要的输出是
Header1- [('AA', 1),('BDDV',1),('ABCS',1)] Header3- [('ABCS', 2),('gf', 1)]
在这里,虽然两列都有 'ABCS',但我想为每一列分别计算它们。
我可以通过硬编码列号来获取计数,如下所示:
import csv
import collections
count_number = collections.Counter()
with open('filename.csv') as input_file:
r = csv.reader(input_file, delimiter=',')
headers = next(r)
for row in r:
count_number[row[1]] += 1
print count_number.most_common()
但我对如何针对列进行操作感到困惑。
这可以使用 Counter
每个 header:
#!/usr/bin/env python
from collections import Counter, defaultdict
import csv
header_counter = defaultdict(Counter)
with open('filename.csv') as input_file:
r = csv.reader(input_file, delimiter=',')
# read headers
headers = next(r)
for row in r:
# count values for each row to add in header context
row_val = sum([w.isdigit() for w in row])
# zip each row with headers to know where to count
for header, val in zip(headers, row):
# count only non-digits
if not any(map(str.isdigit, val)):
header_counter[header].update({val: row_val})
for k, v in header_counter.iteritems():
print k, v
输出:
Header3 Counter({'ABCS': 2, 'gf': 1})
Header1 Counter({'AA': 1, 'BDDV': 1, 'ABCS': 1})
仅部分解决方案(您仍然需要在 CSV 的第二次迭代中过滤带有数字的列 reader)。
import csv
import collections
with open('filename.csv') as input_file:
r = csv.reader(input_file, delimiter=',')
headers = next(r)
count_number = [collections.Counter() for I in Len(headers)]
for row in r:
for i, val in enumerate(row):
count_number[i][val] += 1
print [cr.most_common() for cr in count_number]