解析文本的特定区域,与字符串列表进行比较,然后生成由匹配项组成的新列表

Parsing specific region of a txt, comparing to list of strings, then generating new list composed of matches

我正在尝试执行以下操作:

  1. 通读文本文件的特定部分(已知起点和终点)
  2. 在阅读这些行时,检查一个词是否与我包含在列表中的词相匹配
  3. 如果检测到匹配项,则将该特定词添加到新列表

我已经能够通读文本并从中获取我需要的其他数据,但到目前为止我一直无法执行上述操作。

我已尝试实现以下示例: 但是我没能正确读取它。

我也试过改编以下内容:https://www.geeksforgeeks.org/python-finding-strings-with-given-substring-in-list/ 但是我同样没有成功。

这是我的一些代码:

import re
from itertools import islice
import os

# list of all countries
oneCountries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica,, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
countries = oneCountries.split(",")

path = "C:/Users/me/Desktop/read.txt"
thefile = open(path, errors='ignore')

countryParsing = False
for line in thefile:
    line = line.strip()
#    if line.startswith("Submitting Author:"):
#    if re.match(r"Submitting Author:", line):
#        print("blahblah1")
#        countryParsing = True
#        if countryParsing == True:
#            print("blahblah2")
#            
#            res = [x for x in line if re.search(countries, x)]
#            print("blah blah3: " + str(res))
#    elif re.match(r"Running Head:", line):
#        countryParsing = False
#    if countryParsing == True:
#        res = [x for x in line if re.search(countries, x)]
#        print("blah blah4: " + str(res))


#        for x in countries:
#            if x in thefile:
#                print("a country is: " + x)
#        if any(s in line for s in countries):
#            listOfAuthorCountries = listOfAuthorCountries + s + ", "
#    if re.match(f"Submitting Author:, line"):

#commented out 行是我尝试过但未能正常工作的代码版本。

根据要求,这是我试图从中获取数据的文本文件示例。我修改了它以删除敏感信息,但在这种特殊情况下,"new list" 应该附加一定数量的 "France" 条目:

    txt above....
Submitting Author:

    asdf, asdf  (proxy)
    France
    asdfasdf
    blah blah
    asdfasdf

    asdf, Provence-Alpes-Côte d'Azu 13354
    France

    blah blah
    France
    asdf
Running Head:
    ...more text below

根据你所说的三点你想完成的事情和我从你的代码中了解到的(这可能不是你想要的),我建议:

# list of all countries
countries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
countries = countries.split(",")
countries = [c.strip() for c in countries]

filename = "read.txt"
filehandle = open(filename, errors='ignore')
my_other_list = []
toParse = False
for line in filehandle:
    line = line.strip()
    if line.startswith("Submitting Author:"):
        toParse = True
        continue
    elif line.startswith("Running Head:"):
        toParse = False
        continue
    elif toParse:
        for c in countries:
            if c in line:
                my_other_list.append(c)

编辑摘要

  1. 调整代码以处理提供的文本示例。

  2. 修复了国家列表(哥斯达黎加后面原来有两个逗号)。

我认为您的主要问题是,在 oneCountries 中,国家/地区名称由逗号 + space 分隔,但您只是按逗号分隔,例如第二个条目countries" Albania",前面有一个 space。您需要更改:

oneCountries.split(",")

至:

oneCountries.split(", ")

在那之后,看起来您注释掉的代码中有足够的有用的东西来实现您想要的。