read_table 在 pandas 中,如何将文本输入到数据框

read_table in pandas, how to get input from text to a dataframe

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)

这是我的文本,我需要创建一个数据框,其中 1 列用于州名称,另一列用于城镇名称,我知道如何删除大学名称。但是我如何告诉 pandas 每个 [edit] 都是一个新状态。

预期输出数据帧

Alabama Auburn
Alabama Florence 
Alabama Jacksonville
Alaska  Fairbanks 
Arizona Flagstaff
Arizona Tempe
Arizona Tucson  

我不确定我是否可以使用 read_table,如果可以怎么办?我确实将所有内容都导入了数据框,但州和城市在同一列中。我也尝试了一个列表,但问题仍然是一样的。

我需要一些东西,如果该行中有一个 [edit],那么它之后和下一个 [edit] 行之前的所有值就是它们之间的行的状态

也许pandas可以做到,但你可以轻松做到。

data = '''Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)'''

# ---

result = []

state = None

for line in data.split('\n'):

    if line.endswith('[edit]'):
        # remember new state
        state = line[:-6] # without `[edit]`
    else:
        # add state, city to result
        city, rest = line.split(' ', 1)
        result.append( [state, city] )

# --- display ---

for state, city in result:
    print(state, city)

如果您从文件中读取,则

result = []

state = None

with open('your_file') as f:
    for line in f:
        line = line.strip() # remove '\n'

        if line.endswith('[edit]'):
            # remember new state
            state = line[:-6] # without `[edit]`
        else:
            # add state, city to result
            city, rest = line.split(' ', 1)
            result.append( [state, city] )

# --- display ---

for state, city in result:
    print(state, city)

现在您可以使用 result 创建 DataFrame

使用 Pandas,您可以执行以下操作:

import pandas as pd
df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
df['index'] = df.groupby('groupno').cumcount()
df['state'] = df.groupby('groupno')['town'].transform('first')
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')
df = df.loc[~df['is_state']]
df = df[['state','town']]

产生

     state          town
1  Alabama        Auburn
2  Alabama      Florence
3  Alabama  Jacksonville
5   Alaska     Fairbanks
7  Arizona     Flagstaff
8  Arizona         Tempe
9  Arizona        Tucson

这里是代码正在做什么的细目。将文本文件加载到 DataFrame 后,使用 str.contains 来识别状态行。使用 cumsum 对 True/False 个值求和,其中 True 视为 1,False 视为 0。

df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
#                                               town is_state  groupno
# 0                                    Alabama[edit]     True        1
# 1                    Auburn (Auburn University)[1]    False        1
# 2           Florence (University of North Alabama)    False        1
# 3  Jacksonville (Jacksonville State University)[2]    False        1
# 4                                     Alaska[edit]     True        2
# 5    Fairbanks (University of Alaska Fairbanks)[2]    False        2
# 6                                    Arizona[edit]     True        3
# 7       Flagstaff (Northern Arizona University)[6]    False        3
# 8                 Tempe (Arizona State University)    False        3
# 9                   Tucson (University of Arizona)    False        3

现在对于每个 groupno 数字,我们可以为组中的每一行分配一个唯一的整数:

df['index'] = df.groupby('groupno').cumcount()
#                                               town is_state  groupno  index
# 0                                    Alabama[edit]     True        1      0
# 1                    Auburn (Auburn University)[1]    False        1      1
# 2           Florence (University of North Alabama)    False        1      2
# 3  Jacksonville (Jacksonville State University)[2]    False        1      3
# 4                                     Alaska[edit]     True        2      0
# 5    Fairbanks (University of Alaska Fairbanks)[2]    False        2      1
# 6                                    Arizona[edit]     True        3      0
# 7       Flagstaff (Northern Arizona University)[6]    False        3      1
# 8                 Tempe (Arizona State University)    False        3      2
# 9                   Tucson (University of Arizona)    False        3      3

同样对于每个 groupno 数字,我们可以通过选择每个组中的第一个城镇来找到州:

df['state'] = df.groupby('groupno')['town'].transform('first')
#                                               town is_state  groupno  index          state
# 0                                    Alabama[edit]     True        1      0  Alabama[edit]
# 1                    Auburn (Auburn University)[1]    False        1      1  Alabama[edit]
# 2           Florence (University of North Alabama)    False        1      2  Alabama[edit]
# 3  Jacksonville (Jacksonville State University)[2]    False        1      3  Alabama[edit]
# 4                                     Alaska[edit]     True        2      0   Alaska[edit]
# 5    Fairbanks (University of Alaska Fairbanks)[2]    False        2      1   Alaska[edit]
# 6                                    Arizona[edit]     True        3      0  Arizona[edit]
# 7       Flagstaff (Northern Arizona University)[6]    False        3      1  Arizona[edit]
# 8                 Tempe (Arizona State University)    False        3      2  Arizona[edit]
# 9                   Tucson (University of Arizona)    False        3      3  Arizona[edit]

我们基本上有了想要的DataFrame;剩下的就是美化结果。 我们可以使用 str.replace:

states 中删除 [edit] 并从 towns 中删除第一个括号后的所有内容
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')

删除 town 实际上是状态的行:

df = df.loc[~df['is_state']]

最后,只保留所需的列:

df = df[['state','town']]