如何按行首的字母间隔浏览 Python 中的 csv

How to browse a csv in Python by intervals of letters at the beginning of lines

我有一个包含大量数据的 csv。当我启动网络抓取时,我收到:

TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

为了限制webscraping要处理的数据量,我想把下面的脚本分成几个脚本,每个csv文件的浏览间隔:

# Get the data from the csv containing pmid list by author :
with open("D:/Nancy/Pèse-Savants/Excercice Covid-19/Exercice 3/pmid_par_auteur.csv",'r', encoding='utf-8') as f:   
    # Sseperate author's list from pmid's list into 2 columns :
    with open ("pmid_par_auteur_uniformise.csv", "w", encoding='utf-8') as fu:
        csv_f = csv.reader(f, delimiter = ';')
        for ligne in csv_f: 
            fu.write(ligne[0] + '\n')

auteur_pmid_doi = []

# Clean up encoded data in 'utf-8'          
with open("pmid_par_auteur_uniformise.csv",encoding='utf-8') as fu:
    csv_fu = csv.reader(fu)

for ligne in csv_fu:
    ligne[1] = ligne[1].replace("'", " ")
    ligne[1] = ligne[1].replace("[", " ")
    ligne[1] = ligne[1].replace("]", " ")
    ligne[1] = ligne[1].split(" , ")
    
# Get DOI for each pmid for each author that wrote on Covid-19
    pmid_doi = []

    for pmid in ligne[1]:
        
        try : 
            handle = Entrez.esummary(db="pubmed", id=pmid) 
            record = Entrez.read(handle) 
            record = record[0]['DOI']
        except IndexError :
            print ('Missing DOI')
        except KeyError :
            print ('Missing DOI')

        else :
            pmid_doi.append([pmid, record])
        
#handles are a finite resource, I close it in order to avoid exhausting the handle supply with a large dataset.
        handle.close()
        
# Delete temporary variables to free some space in the RAM:
    auteur_pmid_doi.append([ligne[0], pmid_doi])
    del (ligne[1])
    del (handle)
    del (record)
    del (pmid_doi)

auteur_pmid_doi

每个脚本都会 运行 通过这样的数据间隔:

如何通过这些类型的间隔浏览 csv 的行?

我将 link 添加到我的 csv 中,在此先感谢您的帮助。

pmid_par_auteur_uniformise.csv

假设文件已排序,以下将:

  • 将行分组到它们的起始字母中。
  • 一次收集五个组的行。
  • 用行调用 process()
  • 示例process() 将:
    • pmids 字符串转换为字符串列表
    • 计算行数
    • 打印组的第一行和最后一行。

代码:

import csv
import itertools
import ast

def process(rows):
    rows = [[name,ast.literal_eval(pmids)] for name,pmids in rows]
    print(f' -> {len(rows)} row(s)')
    print(f'    first row: {rows[0]}')
    print(f'    last row:  {rows[-1]}')

with open('pmid_par_auteur_uniformise.csv',encoding='utf-8-sig',newline='') as f:
    r = csv.reader(f)
    rows = []
    for i,(key,group) in enumerate(itertools.groupby(r,key=lambda x: x[0][0])):
        rows.extend(list(group))
        print(key,end='')
        if i % 5 == 4:
            process(rows)
            rows = []
    if rows:
        process(rows)

输出:

ABCDE -> 3124 row(s)
    first row: ['A Aljabali Alaa A', ['32397911']]
    last row:  ['Eğrilmez Sait', ['32366061', '32328919', '32099313', '31486610', '31245968', '31045617', '30202612', '29697353', '29109898', '28761568', '28405481', '28169512', '27849319', '27758983', '27447356', '27028890', '26887566', '26801653', '26654385', '26558208', '26401170', '26208680', '26159181', '25955827', '25247377', '25111119', '24621171', '24367973', '24322806', '23970986', '23684361', '23205897', '23061415', '21598429', '21453232', '21414053', '21353409', '21332970', '20490802', '20385953', '20164799', '19958116', '26649479', '19237782', '18404071', '18209647', '18039348', '17641704', '15696768', '15305561', '15177606', '15177597', '15050248', '12908533', '12780405', '12752051', '12662987']]
FGHIJ -> 2087 row(s)
    first row: ['Fabbricatore Davide', ['32383763', '31898206', '31564087', '29754460', '29460403', '29252600', '28775965']]
    last row:  ['Jüni Peter', ['32450456', '32396180', '32385067', '32385063', '32333878', '32294317', '32293511', '32241376', '32215640', '32199780', '32139280', '32139222', '32006758', '32006156', '31920002', '31857278', '31857277', '31854112', '31851302', '31845894', '31841136', '31707794', '31696762', '31693078', '31672177', '31648781', '31589276', '31570258', '31537275', '31525083', '31497854', '31488373', '31488372', '31476244', '31462531', '31434508', '31410968', '31397487', '31379378', '31368907', '31329852', '31269364', '31217143', '31204678', '31197439', '31164366', '31132298', '31084961', '31056295', '30975683', '30888959', '30852547', '30846254', '30833323', '30789921', '30703644', '30689825', '30667361', '30601734', '30596995', '30592349', '30566213', '30560696', '30424891', '30356345', '30354650', '30354532', '30347031', '30291678', '30215374', '30182362', '30170848', '30166073', '30165632', '30165437', '30165435', '30153988', '30146969', '30107514', '30044478', '29992264', '29916872', '29912740', '29885826', '29850808', '29794879', '29786535', '29785878', '29742109', '29628287', '29606865', '29487111', '29478826', '29467161', '29277234', '29251754', '29228059', '29205157', '29162610', '29155984', '29130845', '29113968', '29097450', '29045581', '29038228', '29020259', '28967416', '28948934', '28886622', '28850362', '28827257', '28796809', '28790165', '28781251', '28742627', '28732814', '28699595', '28671552', '28611089', '28601820', '28566364', '28536005', '28528767', '28472484', '28430920', '28425755', '28330794', '28329389', '28253938', '28213601', '28213600', '28185702', '28079554', '28067197', '28029055', '28027351', '28017369', '28003290', '27998831', '27923461', '27884241', '27753599', '27733354', '27677503', '27665852', '27578808', '27497359', '27479866', '27478115', '27437661', '27389906', '27372195', '27318845', '27296200', '27289296', '27252878', '27179724', '27125947', '27078262', '27033859', '31997951', '26997557', '26979080', '26916479', '26896474', '26823484', '26762519', '26741741', '26700531', '26690319', '26655339', '26649651', '26606735', '26585615', '26490760', '26453687', '26428025', '26408014', '26376691', '26373562', '26352574', '26334160', '26324049', '26210282', '26208006', '26205445', '26196758', '26142466', '26071600', '26043895', '26040806', '26010634', '26007299', '25979551', '25934823', '25910501', '25875821', '25858975', '25794671', '25794517', '25791214', '25634905', '25623431', '25572026', '25551539', '25546177', '25529190', '25524605', '25495124', '25494429', '25489846', '25433627', '25423953', '25416325', '25330508', '25229835', '25208215', '25189359', '25187201', '25184244', '25182248', '25176289', '25173601', '25173535', '25173339', '25169183', '25163691', '25112661', '25042419', '25042271', '25011716', '24958760', '24958153', '24919052', '24882698', '24847017', '24755380', '24738641', '24711124', '24694729', '24682843', '24676282', '24631113', '24602961', '24552862', '24531331', '24429160', '24332419', '24206920', '24132187', '24064474', '24064377', '24039795', '23993323', '23968698', '23946263', '23909727', '23822782', '23793972', '23759706', '23747228', '23723742', '23702009', '23514285', '23487519', '23386662', '23370065', '23339812', '23277909', '23169986', '23152242', '23045205', '23008508', '22995882', '22945832', '22924638', '22922416', '22910755', '22868835', '22846347', '22759453', '22739992', '22726632', '22711083', '22645184', '22625186', '22607867', '22580250', '22456025', '22447805', '22440496', '22362513', '22361598', '22319063', '22302840', '22301368', '22285579', '22238228', '22093210', '22078420', '22075451', '22056618', '22027687', '22008217', '21959221', '21931648', '21930465', '21904996', '21878462', '21851904', '21768536', '21700254', '21646500', '21641358', '21632168', '21596229', '21396782', '21385807', '21362706', '21356042', '26063638', '21330239', '21296599', '21224324', '21205944', '21161860', '21042932', '20884434', '20870808', '20853471', '20847017', '20807617', '20639294', '20633818', '26061467', '20562074', '20506333', '20464751', '20461793', '20298923', '20152241', '20152233', '20142179', '20091539', '19950329', '19930626', '19889649', '19821404', '19821403', '19821302', '19821296', '19819375', '19778775', '19736281', '19736154', '19679616', '19620501', '19370423', '19284063', '19204314', '19074491', '19036745', '18804739', '18765162', '18757996', '18534034', '18512273', '18502079', '18414453', '18316340', '18272504', '18050181', '17968921', '17903638', '17869634', '17868802', '17726091', '17707588', '17696267', '17606174', '17606172', '17602184', '17438317', '17321312', '16979535', '16824829', '16717169', '16704569', '16255025', '16125589', '16105989', '15947376', '15911545', '15897534', '15649954', '15641050', '15582059', '15513969', '15485938', '15122753', '15087341', '14960422', '12912727', '12814907', '12654410', '12574052', '12456259', '12435252', '12111917', '12039807', '12038917', '11914306']]
KLMNO -> 3147 row(s)
    first row: ['Kaakinen Markus', ['32320402', '32306232', '32301734', '31895044', '31815340', '31761930', '30465150', '30176935', '29892700', '29623505', '29048938', '27441787', '27094352', '26620915', '26563678']]
    last row:  ['Ozog David M', ['32452977', '32446829', '32442698', '32291807', '32246972', '32224709', '31820478', '31797796', '31743247', '31592926', '31567612', '31449081', '31335426', '31335419', '31268498', '30528311', '30528310', '30322300', '30235390', '29381548', '28975212', '28658462', '28522039', '28522038', '27984327', '27336945', '27051812', '26845540', '26547045', '26504503', '26458039', '25946625', '25738444', '24891062', '24664987', '24196328', '23157724', '23069917', '22269028', '22093099', '22055283', '21931055', '21864935', '21518098', '21457398', '20855672', '20384757', '20227579', '17254036', '16918570']]
PQRST -> 2954 row(s)
    first row: ['Paakkari Leena', ['32438595', '32302535', '31654998', '31410386', '31297559', '30753409', '30578457', '30412226', '29510702', '28673131', '27655781', '26630180', '24609436']]
    last row:  ['Tánczos Krisztián', ['32453702', '29953455', '29264668', '27597981', '27288610', '26543848', '25608924', '24818123', '24457113', '24012232']]
UVWXY -> 1509 row(s)
    first row: ['Ubaldi Filippo M', ['32404170', '32354663', '32038484', '31824427', '31756248', '31403619', '31174207', '30848112', '30739329', '28895679', '28362681', '27827818', '27310263', '26921622', '26728489', '26531067', '25985139', '24884585', '21665542', '16580372']]
    last row:  ['Yılmaz Aydın', ['32299200', '32029697', '31799089', '31615316', '31130132', '30683026', '30582673', '29151304', '29135403', '29052061', '28940588', '28653494', '28621292', '28393719', '27704696', '27481085', '27442525', '27412127', '26885104', '26556895', '26523900', '26482979', '26422878', '26374581', '26281327', '26257957', '26107228', '25619495', '25492957', '25492815', '24976998', '24814084', '24506753', '23949189', '23431310', '23377781', '23030749', '22609980', '22320975', '22087531', '22019748', '21851414', '21334917', '21038140', '20981186', '20954282', '20689268', '20517728', '20332658', '19803278', '17845896', '16650973', '16304288', '16217989', '15143428', '14971870']]
ZdvÁÇ -> 741 row(s)
    first row: ['Zabetakis Ioannis', ['32438620', '32340775', '32270798', '32224958', '31816871', '31540159', '31137500', '30721934', '30669323', '30619728', '30381909', '30319088', '29882848', '29757226', '29494487', '29135918', '28714908', '28119955', '27109548', '24973582', '24735421', '24128590', '24084786', '23957417', '23480708', '23433838', '25212344', '22087726', '26047447', '16390205']]
    last row:  ['Çinier Göksel', ['32462219', '32406873', '32338313', '32250347', '32222434', '32147660', '32147654', '32035356', '31846583', '31764010', '31707766', '31670716', '31582673', '31542896', '31483310', '31339201', '31204510', '31139269', '31038781', '30961362', '30928819', '30815636', '30808220', '30694809', '30230925', '30174880', '30149941', '30075884', '30024391', '30022507', '29894304', '29848928', '29523425', '29487682', '29451310', '29339686', '29191504', '28971172', '28898454', '28864320', '28838153', '28595215', '28595209', '28592959', '28424447', '28401800', '28169085', '27641906', '27608320', '27581673', '27414730', '27341666', '26946973', '26778640', '26295613']]
ÖØÜĆČ -> 12 row(s)
    first row: ['Özdemir Vural', ['32319847', '32316827', '32223589', '32105560', '32027574', '31990612', '31855503', '31794335', '31794294', '31199695', '31094658', '31066623', '30789303', '30707659', '30362880', '30281399', '30260734', '30036157', '30004300', '29791250', '29432059', '29431577', '29293405', '29083982', '29064337', '28655746', '28622116', '28253085', '27726641', '27631187', '27310474', '27211534', '26793622', '26785082', '26684591', '26645377', '26484977', '26430925', '26345196', '26270647', '26161545', '26066837', '26061584', '25970399', '25748435', '25656538', '25353263', '25000304', '24955641', '24795761', '24795460', '24766116', '24730382', '24649998', '24521341', '24456465', '24456464', '26120345', '27447251', '24048056', '27442201', '23765483', '23574338', '23531886', '23301640', '23258262', '23249198', '23249197', '23249193', '23249192', '23194449', '22987569', '22545073', '22523528', '22401659', '22279516', '22279515', '22220951', '22198458', '21848419', '21848418', '21490881', '21476845', '21400375', '21399751', '20977184', '20808949', '20547595', '20526970', '20455752', '19882466', '19470561', '19290811', '19290807', '19214141', '19207026', '19040373', '19040370', '18708948', '18570104', '18480690', '18266561', '18075468', '17963681', '17924827', '17716237', '17559347', '17481382', '17481379', '17474081', '17439540', '17429316', '17286540', '17224711', '17184207', '17093467', '16900136', '16544144', '16433578', '15738749', '14965233', '14582457', '12054061', '11864722']]
    last row:  ['Čivljak Rok', ['32426118', '32118371', '31661701', '30801727', '29164144', '27625226', '26638539', '26538030', '26494527', '26012149', '25801665', '25634680', '25274934', '24392752', '24192278', '23941014', '21294322', '23120878', '18773823', '15936084', '15330131']]
İŚŞŠ -> 10 row(s)
    first row: ['İrkeç Murat', ['32366061', '32287143', '31844978', '31232743', '30605936', '30471351', '29944505', '29554816', '29196953', '29135602', '28598963', '28553957', '28390092', '27874294', '27790127', '27775457', '27467041', '27513901', '27452505', '27055218', '26257227', '26187885', '26035419', '25811727', '25686056', '25642816', '25603441', '25370397', '25264994', '25203662', '25119963', '25069074', '25069002', '24967185', '24803156', '24790882', '24774884', '24767236', '24646901', '24627252', '24401152', '24216677', '24145558', '23806084', '23730902', '23635854', '23564611', '23561604', '23378724', '23377585', '23323578', '23084388', '22885886', '22799438', '22415150', '21575121', '21174000', '20813741', '20595897', '20544681', '20299976', '19878104', '19825835', '19681791', '19618994', '19491966', '19396777', '19158563', '19085375', '18700919', '18675411', '18580271', '18524193', '18471650', '18414106', '18352875', '18216575', '18083592', '17760635', '17522853', '17519661', '17413955', '17300569', '17198020', '17102682', '17068451', '16849641', '16280977', '16075220', '15968156', '15068429', '14746595', '14746583', '14566648', '14533030', '12791215', '12789598', '12789579', '12738551', '12695716', '12648019', '12427231', '12072719', '12065849', '12027104', '11821216', '11820668']]
    last row:  ['Šín Robin', ['32434337', '32304370', '31974532', '31971245']]