Python 的正则表达式 findall 不 return Unicode 文本的所有匹配项
Python's Regex findall Does not return All matches of Unicode Text
我有一个 unicode 文本,其中包含一个期刊列表,其中包含每个期刊的一些详细信息。我只想检索期刊的名称。
我的文字很大,看起来像这样:
6) 6. ACROSS LANGUAGES AND CULTURES Semiannual ISSN: 1585-1923
AKADEMIAI KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY,
H-1117 Social Sciences Citation Index Arts & Humanities Citation Index
7) 7. ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE
ANALYTICAL TR ADITION Quarterly ISSN: 0353-5150 SPRINGER, 233 SPRING
ST, NEW YORK, USA, NY, 10013 Arts & Humanities Citation Index 8) 8.
ACTA ARCHAEOLOGICA Annual ISSN: 0065-101X WILEY, 111 RIVER ST,
HOBOKEN, USA, NJ, 07030-5774 Arts & Humanities Citation Index 9) 9.
ACTA BOREALIA Semiannual ISSN: 0800-3831 ROUTLEDGE JOURNALS, TAYLOR &
FRANCIS LTD, 2-4 PARK SQUARE, MILTON PARK, ABINGDON, ENGLAND, OXON,
OX14 4RN Arts & Humanities Citation Index 10) 10. ACTA CLASSICA Annual
ISSN: 0065-1141 UNIV FREE STATE, DEPT ENG CLASSICAL LANG, PO BOX 339,
BLOEMFONTEIN, SOUTH AFRICA, 9300 Arts & Humanities Citation Index 11)
11. ACTA HISTORICA TALLINNENSIA Annual ISSN: 1406-2925 ESTONIAN ACADEMY PUBLISHERS, 6 KOHTU, TALLINN, ESTONIA, 10130 Arts & Humanities
Citation Index 12) 12. ACTA HISTRIAE Tri-annual ISSN: 1318-0185
4 تاریخ انتشار: 89/2/62 پژوهشگاه و شبکه آزمایشگاهی 98/3 :Code
UNIV PRIMORSKA, SCI RES CENTRE KOPER, GARIBALDIJEVA 1, KOPER,
SLOVENIA, CAPODISTRIA, SI-6000 Social Sciences Citation Index Arts &
Humanities Citation Index 13) 13. ACTA KOREANA Semiannual ISSN:
1520-7412 ACADEMIA KOREANA KEIMYUNG UNIV, 1095 DALGUBEOLDAERO,
DALSEO-GU, DAEGU, SOUTH KOREA, 704-701 Arts & Humanities Citation
Index Current Contents - Arts & Humanities 14) 14. ACTA LINGUISTICA
HUNGARICA Quarterly ISSN: 1216-8076 AKADEMIAI KIADO ZRT, BUDAFOKI UT
187-189-A-3, BUDAPEST, HUNGARY, H-1117 Social Sciences Citation Index
Arts & Humanities Citation Index 15)15. ACTA LITERARIA Semiannual
ISSN: 0717-6848 UNIV CONCEPCION, FAC HUMANIDADES ARTE, CASILLA 160-C,
CORREO 3, CONCEPCION, CHILE, 00000 Arts & Humanities Citation Index
16) 16. ACTA MUSICOLOGICA Semiannual ISSN: 0001-6241 INT MUSICOLOGICAL
SOC, BOX 561, BASEL, SWITZERLAND, CH-4001 Arts & Humanities Citation
Index Current Contents - Arts & Humanities 17) 17. ACTA ORIENTALIA
ACADEMIAE SCIENTIARUM HUNGARICAE Quarterly ISSN: 1588-2667 AKADEMIAI
KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY, H-1117 Arts &
Humanities Citation Index 5 تاریخ انتشار: 89/2/62 پژوهشگاه و
شبکه آزمایشگاهی 98/3 :Code Current Contents - Arts & Humanities 18)
18. ACTA PHILOSOPHICA Semiannual ISSN: 1121-2179 FABRIZIO SERRA EDITORE, PO BOX NO,1, SUCC NO. 8, PISA, ITALY, I-56123 Arts &
Humanities Citation Index Current Contents - Arts & Humanities
它想要比赛 return
ACROSS LANGUAGES AND CULTURES Semiannual
ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE
ANALYTICAL TR ADITION Quarterly
ACTA ARCHAEOLOGICA Annual
etc.
我已经试过了 (https://regex101.com/r/eyafNd/1) 并且在 reg101 网站上,它似乎有效。
regex = r"^(\d+\)\s*\d+\.\s+)(.*?) ISSN"
l = re.findall(regex,txt,re.IGNORECASE)
print(len(l))
print(l)
return 是只有 1 个结果的列表,如下所示
[('6) 6. ', 'ACROSS LANGUAGES AND CULTURES Semiannual')]
如有任何帮助,我们将不胜感激。
CS
也许看看这个正则表达式:
(?<=\d\.\s).+?(?=\sISSN)
regex = r"(?<=\d\.\s).+?(?=\sISSN)"
l = re.findall(regex, txt, re.I)
print(len(l))
print(l)
这表示开始匹配 数字+点+空格 到字符 空格+ISSN。然后我可以确认,当我写你的文本时,我会收到以下带有你的代码的输出列表:
['ACROSS LANGUAGES AND CULTURES Semiannual', 'ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE ANALYTICAL TR ADITION Quarterly', 'ACTA ARCHAEOLOGICA Annual'...]
我有一个 unicode 文本,其中包含一个期刊列表,其中包含每个期刊的一些详细信息。我只想检索期刊的名称。
我的文字很大,看起来像这样:
6) 6. ACROSS LANGUAGES AND CULTURES Semiannual ISSN: 1585-1923 AKADEMIAI KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY, H-1117 Social Sciences Citation Index Arts & Humanities Citation Index 7) 7. ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE ANALYTICAL TR ADITION Quarterly ISSN: 0353-5150 SPRINGER, 233 SPRING ST, NEW YORK, USA, NY, 10013 Arts & Humanities Citation Index 8) 8. ACTA ARCHAEOLOGICA Annual ISSN: 0065-101X WILEY, 111 RIVER ST, HOBOKEN, USA, NJ, 07030-5774 Arts & Humanities Citation Index 9) 9. ACTA BOREALIA Semiannual ISSN: 0800-3831 ROUTLEDGE JOURNALS, TAYLOR & FRANCIS LTD, 2-4 PARK SQUARE, MILTON PARK, ABINGDON, ENGLAND, OXON, OX14 4RN Arts & Humanities Citation Index 10) 10. ACTA CLASSICA Annual ISSN: 0065-1141 UNIV FREE STATE, DEPT ENG CLASSICAL LANG, PO BOX 339, BLOEMFONTEIN, SOUTH AFRICA, 9300 Arts & Humanities Citation Index 11) 11. ACTA HISTORICA TALLINNENSIA Annual ISSN: 1406-2925 ESTONIAN ACADEMY PUBLISHERS, 6 KOHTU, TALLINN, ESTONIA, 10130 Arts & Humanities Citation Index 12) 12. ACTA HISTRIAE Tri-annual ISSN: 1318-0185 4 تاریخ انتشار: 89/2/62 پژوهشگاه و شبکه آزمایشگاهی 98/3 :Code UNIV PRIMORSKA, SCI RES CENTRE KOPER, GARIBALDIJEVA 1, KOPER, SLOVENIA, CAPODISTRIA, SI-6000 Social Sciences Citation Index Arts & Humanities Citation Index 13) 13. ACTA KOREANA Semiannual ISSN: 1520-7412 ACADEMIA KOREANA KEIMYUNG UNIV, 1095 DALGUBEOLDAERO, DALSEO-GU, DAEGU, SOUTH KOREA, 704-701 Arts & Humanities Citation Index Current Contents - Arts & Humanities 14) 14. ACTA LINGUISTICA HUNGARICA Quarterly ISSN: 1216-8076 AKADEMIAI KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY, H-1117 Social Sciences Citation Index Arts & Humanities Citation Index 15)15. ACTA LITERARIA Semiannual ISSN: 0717-6848 UNIV CONCEPCION, FAC HUMANIDADES ARTE, CASILLA 160-C, CORREO 3, CONCEPCION, CHILE, 00000 Arts & Humanities Citation Index 16) 16. ACTA MUSICOLOGICA Semiannual ISSN: 0001-6241 INT MUSICOLOGICAL SOC, BOX 561, BASEL, SWITZERLAND, CH-4001 Arts & Humanities Citation Index Current Contents - Arts & Humanities 17) 17. ACTA ORIENTALIA ACADEMIAE SCIENTIARUM HUNGARICAE Quarterly ISSN: 1588-2667 AKADEMIAI KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY, H-1117 Arts & Humanities Citation Index 5 تاریخ انتشار: 89/2/62 پژوهشگاه و شبکه آزمایشگاهی 98/3 :Code Current Contents - Arts & Humanities 18) 18. ACTA PHILOSOPHICA Semiannual ISSN: 1121-2179 FABRIZIO SERRA EDITORE, PO BOX NO,1, SUCC NO. 8, PISA, ITALY, I-56123 Arts & Humanities Citation Index Current Contents - Arts & Humanities
它想要比赛 return
ACROSS LANGUAGES AND CULTURES Semiannual
ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE ANALYTICAL TR ADITION Quarterly
ACTA ARCHAEOLOGICA Annual
etc.
我已经试过了 (https://regex101.com/r/eyafNd/1) 并且在 reg101 网站上,它似乎有效。
regex = r"^(\d+\)\s*\d+\.\s+)(.*?) ISSN"
l = re.findall(regex,txt,re.IGNORECASE)
print(len(l))
print(l)
return 是只有 1 个结果的列表,如下所示
[('6) 6. ', 'ACROSS LANGUAGES AND CULTURES Semiannual')]
如有任何帮助,我们将不胜感激。
CS
也许看看这个正则表达式:
(?<=\d\.\s).+?(?=\sISSN)
regex = r"(?<=\d\.\s).+?(?=\sISSN)"
l = re.findall(regex, txt, re.I)
print(len(l))
print(l)
这表示开始匹配 数字+点+空格 到字符 空格+ISSN。然后我可以确认,当我写你的文本时,我会收到以下带有你的代码的输出列表:
['ACROSS LANGUAGES AND CULTURES Semiannual', 'ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE ANALYTICAL TR ADITION Quarterly', 'ACTA ARCHAEOLOGICA Annual'...]