仅从列中提取数字并拆分为不同的列

Question

我有一个非常大的数据框，其中的一行通常看起来像这样：

>>>ID    name    year    location
0  341   Dali    1995   {{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}
...

df['geolocation'] = df['geolocation'].str.replace(r'\D', '') 我想将位置列分成许多只包含数字的列，并去掉“纬度”、“经度”和它们之间的所有符号。

我 thouhgt 首先只提取这样的数字：

df['location'] = df['location'].str.extract('(\d+)', expand=False)

但出于某种原因，它给了我位置列作为整数。

我不想使用拆分，因为中间的符号不一致，有时你有这样的序列：{{"{" 有时它只能是 "{"{" 而我不能真正追踪所有可能存在的可能性。不同行中的位数也不同。

我想要的结果应该是这样的：

>>>ID    name    year    lat                 long                     lat1          long1 ....
0  341   Dali    1995    9.4714611480000004  4.3520187860000004 9.4720611479999999 4.3520187860000004

编辑：我也试过这个：

df['location'] = df['location'].str.replace(r'\D', '')

它保留了数字但给了我一个非常长的数字，没有保留“。”并且在数字

之间也没有任何 space

Answer 1

我使用正则表达式匹配来高效提取经纬度。这可以使用以下代码获得。

import re
import pandas as pd

df = pd.DataFrame({
    'ID': [341,321],
    'name':['Dali','daLi'],
    'year':[1995, 1996],
    'location':['{{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}',
                '{{"{\"latitude\":\"9.4714611480000004\",\"longitude\":\"4.3520187860000004\"}","{\"latitude\":\"9.4720611479999999\",\"longitude\":\"4.3520187860000004\"}}']
})

解决方案

df_new = df.location.apply(lambda x: re.findall(r"\d+\.*\d*",x))
df_new = pd.DataFrame(df_new.to_list(), columns=['lat1','long1','lat2','long2'])
pd.concat([df.iloc[:,0:3], df_new], axis=1)

输出

    ID  name    year    lat1                long1               lat2                long2
0   341 Dali    1995    9.4714611480000004  4.3520187860000004  9.4720611479999999  4.3520187860000004
1   321 daLi    1996    9.4714611480000004  4.3520187860000004  9.4720611479999999  4.3520187860000004

仅从列中提取数字并拆分为不同的列

Extract from column only the digits and split to different columns

python

string

split

digits

pandas