Python 从 .csv 文件中提取最常见名称的程序

Question

我创建了一个程序，可以生成 5000 个随机名称、ssn、城市、地址和电子邮件，并将它们存储在 fakeprofile.csv 文件中。我正在尝试从文件中提取最常见的名称。我能够让程序在语法上正常工作，但无法提取常用名称。这是代码：

import re
import statistics
file_open = open('fakeprofile.csv').read()
frequent_names = re.findall('[A-Z][a-z]*', file_open)
print(frequent_names)

文件中的示例：

Alicia Walters 419-52-4141 Yorkstad 66616 Schultz Extensions Suite 225
Reynoldsmouth, VA 72465 stevenserin@stein.biz
Nicole Duffy 212-38-9009 West Timothy 51077 Phillips Ports Apt. 314
Hubbardville, IN 06723 kaitlinthomas@bennett-carter.com
Stephanie Lewis 442-20-1279 Jacquelineshire 650 Gutierrez Forge Apt. 839
West Christianbury, TN 13654 ukelley@gmail.com
Michael Harris 108-81-3733 East Toddberg 14387 Douglas Mission Suite 038
Garciaview, WI 58624 kshields@yahoo.com
Aaron Moreno 171-30-7715 Port Taraburgh 56672 Wagner Path
Lake Christopher, VA 37884 lucasscott@nguyen.info
Alicia Zimmerman 286-88-9507 Barberstad 5365 Heath Extensions Apt. 731
South Randyburgh, NJ 79367 daniellewebb@yahoo.com
Brittney Mcmillan 334-44-0321 Lisahaven PSC 3856, Box 2428
APO AE 03215 kevin95@hotmail.com
Amanda Perkins 327-31-6610 Perryville 8750 Hurst Harbor Apt. 929

示例输出：

', 'Lake', 'Brianna', 'P', 'A', 'Michael', 'Smith', 'Harveymouth', 'Patricia', 'Tunnel', 'West', 'William', 'G', 'A', 'Charles', 'Perkins', 'Lake', 'Marie', 'Lisa', 'Overpass', 'Suite', 'Kennedymouth', 'C', 'A', 'Barbara', 'Perez', 'Billyshire', 'Joshua', 'Village', 'Cindymouth', 'W', 'I', 'Curtis', 'Simmons', 'North', 'Mitchellport', 'Gordon', 'Crest', 'Suite', 'Jacksonburgh', 'C', 'O', 'Cameron', 'Berg', 'South', 'Dean', 'Christina', 'Coves', 'Williamton', 'T', 'N', 'Maria', 'Williams', 'North', 'Judith', 'Carson', 'Overpass', 'Apt', 'West', 'Amandastad', 'N', 'M', 'Hannah', 'Dennis', 'Rodriguezmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Laura', 'Richardson', 'Lake', 'Kayla', 'Johnson', 'Place', 'Suite', 'Port', 'Jennifermouth', 'N', 'H', 'John', 'Lawson', 'Hintonhaven', 'Thomas', 'Via', 'Mossport', 'N', 'J', 'Jennifer', 'Hill', 'East', 'Phillip', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Cody', 'Jackson', 'Lake', 'Jessicamouth', 'Snyder', 'Ways', 'Apt', 'New', 'Stacey', 'M', 'E', 'Ryan', 'Friedman', 'Shahburgh', 'Jerry', 'Pike', 'Suite', 'Toddfort', 'N', 'V', 'Kathleen', 'Fox', 'Ferrellmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'P', 'Michael', 'Thompson', 'Port', 'Jessica', 'Boone', 'Spurs', 'Suite', 'Port', 'Ashleyland', 'C', 'O', 'Christopher', 'Marsh', 'North', 'Catherine', 'Scott', 'Trail', 'Apt', 'Baileyburgh', 'F', 'L', 'Richard', 'Rangel', 'New', 'Anna', 'Ray', 'Drive', 'Apt', 'Nunezland', 'I', 'A', 'Connor', 'Stanton', 'Troyshire', 'Rodgers', 'Hill', 'West', 'Annmouth', 'N', 'H', 'James', 'Medina',

我的问题是无法提取大多数经常找到的名字以及避免使用大写字母。相反，我提取了所有名称（包括不必要的大写字母），上面看到的是提取的所有名称的一小部分样本。我注意到名字总是在输出中的奇数行，我正在尝试捕获那些奇数行中最常见的名字。

fakeprofile.csv 文件由此程序创建：

import csv
import faker
from faker import Faker
fake = Faker()
name = fake.name(); print(name)
ssn = fake.ssn(); print(ssn)
city = fake.city(); print(city)
address = fake.address(); print(address)
email = fake.email(); print(email)
profile = fake.simple_profile()
for i,j in profile.items():
    print('{}: {}'.format(i,j))
print('Name: {}, SSN: {}, City: {}, Address: {}, Email: {}'.format(name,ssn,city,address,email))
with open('fakeprofile.csv', 'w') as f:
    for i in range(0,5001):
        print(f'{fake.name()} {fake.ssn()} {fake.city()} {fake.address()} {fake.email()}', file=f)

Answer 1

我认为如果您使用 pandas 库进行 CSV 操作（收集需求信息），然后应用 python 集合，例如 counter(df ['name'] ) ，或者您能否向我们提供有关 CSV 文件的更多信息。

谢谢

Answer 2

所以您遇到的主要问题是您使用的正则表达式会捕获每个字母。你对奇行第一世界感兴趣

你可以在这些行上做一些事情：

# either use a dict to count or a list to transform as counter.
dico_count = {}
with open('fakeprofile.csv') as file_open:  # use of context manager

    line_number = 1
    for line in file_open: #iterates all the lines

        if line_number % 2 != 0 : # odd line

            spt = line.strip().split()
            dico_count[spt[0]] = dico_count.get(spt[0], 0) + 1

frequent_name_counter = [(k,v) for k,v in sorted(dico_count.items(), key=lambda x: x[1], reverse=True)]

Answer 3

这是否达到了您想要的效果？

import collections, re

# Read in all lines into a list
with open('fakeprofile.csv') as f:
    lines = f.readlines()
# Throw out every other line
lines = [line for i, line in enumerate(lines) if i%2 == 0]
# Keep only first word of each line
names = [line.split()[0] for line in lines]
# Find most common names
n = 3
frequent_names = collections.Counter(names).most_common(n)
# Display most common names
for name, count in frequent_names:
    print(name, count)

要进行计数，它使用 collections.Counter together with its most_common() 方法。

Python 从 .csv 文件中提取最常见名称的程序

Python program that extracts most frequently found names from .csv file

csv

names

extraction

python-3.x