如何拆分具有多个选项的 pandas 系列?
How to split a pandas series with multiple options?
我有一个带有字符串列的 pandas 数据框。我要做的是将城市名称与字符串分开。
这是我的 MWE:
import numpy as np
import pandas as pd
data = """\
2930 Beverly Glen Circle Los Angeles
435 S. La Cienega Blvd. Los Angeles
12224 Ventura Blvd. Studio City
9570 Wilshire Blvd. Beverly Hills
26025 Pacific Coast Hwy. Malibu""".split('\n')
df = pd.DataFrame(data)
print(df)
cities = ['Los Angeles', 'Studio City', 'Beverly Hills','Malibu']
pat = '|'.join([r'(.*)\s({city})' for city in cities])
df = df[0].str.extract(pat,expand=True)
df
如何获得以下输出:
0 addr city
0 2930 Beverly Glen Circle Los Angeles 2930 Beverly Glen Circle Los Angeles
1 435 S. La Cienega Blvd. Los Angeles 435 S. La Cienega Blvd. Los Angeles
2 12224 Ventura Blvd. Studio City 12224 Ventura Blvd. Studio City
3 9570 Wilshire Blvd. Beverly Hills 9570 Wilshire Blvd. Beverly Hills
4 26025 Pacific Coast Hwy. Malibu 26025 Pacific Coast Hwy. Malibu
您应该将可选匹配移动到一个捕获组中:
import pandas as pd
data = """\
2930 Beverly Glen Circle Los Angeles
435 S. La Cienega Blvd. Los Angeles
12224 Ventura Blvd. Studio City
9570 Wilshire Blvd. Beverly Hills
26025 Pacific Coast Hwy. Malibu""".split('\n')
df = pd.DataFrame(data)
print(df)
cities = ['Los Angeles', 'Studio City', 'Beverly Hills','Malibu']
c = '|'.join(cities)
pat = fr'(.*?)\s({c})' # fixed pattern with f and r
df = df[0].str.extract(pat,expand=True)
print(df)
输出:
0 1
0 2930 Beverly Glen Circle Los Angeles
1 435 S. La Cienega Blvd. Los Angeles
2 12224 Ventura Blvd. Studio City
3 9570 Wilshire Blvd. Beverly Hills
4 26025 Pacific Coast Hwy. Malibu
您可以尝试使用 Series.str.split
:
pat = '|'.join([rf'\s(?={city})' for city in cities])
df1 = df[0].str.split(pat, expand=True).rename(columns={0: 'addr', 1: 'city'})
df = pd.concat([df[0], df1], axis=1)
或者,您可以使用 Series.str.extract
:
pat = r'(?P<addr>.*)?\s' + r'(?P<city>' + '|'.join(cities) + r')'
df = pd.concat([df[0], df[0].str.extract(pat, expand=True)], axis=1)
结果:
# print(df)
0 addr city
0 2930 Beverly Glen Circle Los Angeles 2930 Beverly Glen Circle Los Angeles
1 435 S. La Cienega Blvd. Los Angeles 435 S. La Cienega Blvd. Los Angeles
2 12224 Ventura Blvd. Studio City 12224 Ventura Blvd. Studio City
3 9570 Wilshire Blvd. Beverly Hills 9570 Wilshire Blvd. Beverly Hills
4 26025 Pacific Coast Hwy. Malibu 26025 Pacific Coast Hwy. Malibu
我有一个带有字符串列的 pandas 数据框。我要做的是将城市名称与字符串分开。
这是我的 MWE:
import numpy as np
import pandas as pd
data = """\
2930 Beverly Glen Circle Los Angeles
435 S. La Cienega Blvd. Los Angeles
12224 Ventura Blvd. Studio City
9570 Wilshire Blvd. Beverly Hills
26025 Pacific Coast Hwy. Malibu""".split('\n')
df = pd.DataFrame(data)
print(df)
cities = ['Los Angeles', 'Studio City', 'Beverly Hills','Malibu']
pat = '|'.join([r'(.*)\s({city})' for city in cities])
df = df[0].str.extract(pat,expand=True)
df
如何获得以下输出:
0 addr city
0 2930 Beverly Glen Circle Los Angeles 2930 Beverly Glen Circle Los Angeles
1 435 S. La Cienega Blvd. Los Angeles 435 S. La Cienega Blvd. Los Angeles
2 12224 Ventura Blvd. Studio City 12224 Ventura Blvd. Studio City
3 9570 Wilshire Blvd. Beverly Hills 9570 Wilshire Blvd. Beverly Hills
4 26025 Pacific Coast Hwy. Malibu 26025 Pacific Coast Hwy. Malibu
您应该将可选匹配移动到一个捕获组中:
import pandas as pd
data = """\
2930 Beverly Glen Circle Los Angeles
435 S. La Cienega Blvd. Los Angeles
12224 Ventura Blvd. Studio City
9570 Wilshire Blvd. Beverly Hills
26025 Pacific Coast Hwy. Malibu""".split('\n')
df = pd.DataFrame(data)
print(df)
cities = ['Los Angeles', 'Studio City', 'Beverly Hills','Malibu']
c = '|'.join(cities)
pat = fr'(.*?)\s({c})' # fixed pattern with f and r
df = df[0].str.extract(pat,expand=True)
print(df)
输出:
0 1
0 2930 Beverly Glen Circle Los Angeles
1 435 S. La Cienega Blvd. Los Angeles
2 12224 Ventura Blvd. Studio City
3 9570 Wilshire Blvd. Beverly Hills
4 26025 Pacific Coast Hwy. Malibu
您可以尝试使用 Series.str.split
:
pat = '|'.join([rf'\s(?={city})' for city in cities])
df1 = df[0].str.split(pat, expand=True).rename(columns={0: 'addr', 1: 'city'})
df = pd.concat([df[0], df1], axis=1)
或者,您可以使用 Series.str.extract
:
pat = r'(?P<addr>.*)?\s' + r'(?P<city>' + '|'.join(cities) + r')'
df = pd.concat([df[0], df[0].str.extract(pat, expand=True)], axis=1)
结果:
# print(df)
0 addr city
0 2930 Beverly Glen Circle Los Angeles 2930 Beverly Glen Circle Los Angeles
1 435 S. La Cienega Blvd. Los Angeles 435 S. La Cienega Blvd. Los Angeles
2 12224 Ventura Blvd. Studio City 12224 Ventura Blvd. Studio City
3 9570 Wilshire Blvd. Beverly Hills 9570 Wilshire Blvd. Beverly Hills
4 26025 Pacific Coast Hwy. Malibu 26025 Pacific Coast Hwy. Malibu