使用关键字将 Python 列表解析为 pandas.DataFrame
Parsing Python list into pandas.DataFrame with keywords
我有一个应该变成 DataFrame 的国家列表。问题是每个国家和数据在列表中都是一个单独的词。示例:
[
'Viet',
'Nam',
'0',
'12.3',
'0',
'Brunei',
'Darussalam',
'12',
'1.1',
'0',
'Bosnia',
'and',
'Herzegovina',
'2',
'2.1',
'0',
'Not',
'applicable',
'Turkey',
'4',
'4.3',
'0',
'Only',
'partial',
'coverage'
...
]
如何将其转换为:
[
['Viet Nam', '0', '12.3', '0'],
['Brunei Darussalam', '12', '1.1', ...],
...
]
或`pd.DataFrame:
country coef1 coef2 grade
0 Viet Nam 0 12.3 0
1 Brunei Darussalam 12 1.1 0
注意:有些国家有一个词,如中国、法国或三个或更多词,如大韩民国。还有,有时这一系列数字后面可以有备注。
试试这个:
其中 data_in 是您要解析的数据,国家/地区是世界上所有国家的列表
import pandas as pd
import re
countries = ["Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia" ...]
data_in = [
'Viet', 'Nam', '0', '12.3', '0', 'Brunei', 'Darussalam', '12', '1.1', '0', 'Bosnia', 'and', 'Herzegovina', '2', '2.1', '0', 'Not', 'applicable', 'Turkey', '4', '4.3', '0'
]
data_out = []
country = coef1 = coef2 = grade = []
def is_country(elem):
isCountry = False
for country in countries:
if elem.lower() in country.lower():
isCountry = True
break
return isCountry
def is_num(elem):
if re.search(r'\d', elem) is not None:
return True
else:
return False
idx = 0
while idx < (len(data_in)):
elem = data_in[idx]
country = ''
elements = []
is_country_name = False
data_out_local = []
if is_country(elem):
#
while (not is_num(elem) and idx < len(data_in)):
country += elem + " "
idx += 1
elem = data_in[idx]
while(is_num(elem) and idx < len(data_in)):
elements.append(elem)
idx += 1
if idx < len(data_in):
elem = data_in[idx]
data_out_local.append(country)
data_out_local.extend(elements)
data_out.append(data_out_local)
idx += 1
df = pd.DataFrame(data_out, columns=['country', 'coef1', 'coef1', 'grade'])
print(df)
pandas.DataFrame 输出:
country coef1 coef1 grade
0 Viet Nam 0 12.3 0
1 Bosnia and Herzegovina 2 2.1 0
2 Turkey 4 4.3 0
Nonstandard solution, but it works
我有一个应该变成 DataFrame 的国家列表。问题是每个国家和数据在列表中都是一个单独的词。示例:
[
'Viet',
'Nam',
'0',
'12.3',
'0',
'Brunei',
'Darussalam',
'12',
'1.1',
'0',
'Bosnia',
'and',
'Herzegovina',
'2',
'2.1',
'0',
'Not',
'applicable',
'Turkey',
'4',
'4.3',
'0',
'Only',
'partial',
'coverage'
...
]
如何将其转换为: [ ['Viet Nam', '0', '12.3', '0'], ['Brunei Darussalam', '12', '1.1', ...], ... ] 或`pd.DataFrame:
country coef1 coef2 grade
0 Viet Nam 0 12.3 0
1 Brunei Darussalam 12 1.1 0
注意:有些国家有一个词,如中国、法国或三个或更多词,如大韩民国。还有,有时这一系列数字后面可以有备注。
试试这个:
其中 data_in 是您要解析的数据,国家/地区是世界上所有国家的列表
import pandas as pd
import re
countries = ["Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia" ...]
data_in = [
'Viet', 'Nam', '0', '12.3', '0', 'Brunei', 'Darussalam', '12', '1.1', '0', 'Bosnia', 'and', 'Herzegovina', '2', '2.1', '0', 'Not', 'applicable', 'Turkey', '4', '4.3', '0'
]
data_out = []
country = coef1 = coef2 = grade = []
def is_country(elem):
isCountry = False
for country in countries:
if elem.lower() in country.lower():
isCountry = True
break
return isCountry
def is_num(elem):
if re.search(r'\d', elem) is not None:
return True
else:
return False
idx = 0
while idx < (len(data_in)):
elem = data_in[idx]
country = ''
elements = []
is_country_name = False
data_out_local = []
if is_country(elem):
#
while (not is_num(elem) and idx < len(data_in)):
country += elem + " "
idx += 1
elem = data_in[idx]
while(is_num(elem) and idx < len(data_in)):
elements.append(elem)
idx += 1
if idx < len(data_in):
elem = data_in[idx]
data_out_local.append(country)
data_out_local.extend(elements)
data_out.append(data_out_local)
idx += 1
df = pd.DataFrame(data_out, columns=['country', 'coef1', 'coef1', 'grade'])
print(df)
pandas.DataFrame 输出:
country coef1 coef1 grade
0 Viet Nam 0 12.3 0
1 Bosnia and Herzegovina 2 2.1 0
2 Turkey 4 4.3 0
Nonstandard solution, but it works