如何 select 字符串中的多个元素用“;”分隔和“,”在DataFrame中?
How to select multiple elements in a string separated by ";" and "," in a DataFrame?
示例:
数据帧的第一行:name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3
数据帧的第 2 行:name a, age a, country a; name b, age b, country b; name c, age c, country c
我只想 select 数据框每一行的国家,然后在同一数据框中创建一个新列:
country 1, country 2, country 3
country a, country b, country c
我试过了,但我只能得到每行最后一所学校的最后一个国家
df["countries"] = df["school_info"].apply(lambda x: str(x).split(",")[-1].strip())
输出:
country 3
country c
谢谢!
好的 - 现在我明白了你的要求
- 构建一个
tuples
的临时 list
,您希望将其变成行
- 使用
explode()
将列表扩展为行
- 在每行的
tuple
中挑选出值来形成列。为了示例的目的,我已经挑选出所有组件并将原始编码字符串留在原处
data = """name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3
name a, age a, country a; name b, age b, country b; name c, age c, country c"""
df = pd.DataFrame({"school_info":data.split("\n")})
# df["data_tuple"] = df["school_info"].apply(lambda s: [tuple(t.split(",")) for t in s.split(";")])
df = df.assign(data_tuple=lambda dfa: dfa["school_info"].apply(
# build a list of tuples - delimiter is ";" each tuple contains (name,age,country)
lambda s: [tuple(t.split(",")) for t in s.split(";")]))\
# explode the list and pick out each of the elements of resultant tuple
.explode("data_tuple").assign(
name=lambda dfa: dfa["data_tuple"].apply(lambda t: t[0]),
age=lambda dfa: dfa["data_tuple"].apply(lambda t: t[1]),
country=lambda dfa: dfa["data_tuple"].apply(lambda t: t[2]),
).drop("data_tuple", axis=1) # this was a temporary contruct drop it
print(df.to_string(index=False))
输出
school_info name age country
name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3 name 1 age 1 country 1
name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3 name 2 age 2 country 2
name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3 name 3 age 3 country 3
name a, age a, country a; name b, age b, country b; name c, age c, country c name a age a country a
name a, age a, country a; name b, age b, country b; name c, age c, country c name b age b country b
name a, age a, country a; name b, age b, country b; name c, age c, country c name c age c country c
如果您的行位于名为 school_info 的一列中:
df["school_info"].apply(lambda r: ', '.join([c.split(",")[-1].strip() for c in r.split(";")]))
输入:
data = [["name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3"],
["name a, age a, country a; name b, age b, country b; name c, age c, country c"]]
df = pd.DataFrame(data, columns=['school_info'])
输出:
0 country 1, country 2, country 3
1 country a, country b, country c
Name: school_info, dtype: object
示例:
数据帧的第一行:name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3
数据帧的第 2 行:name a, age a, country a; name b, age b, country b; name c, age c, country c
我只想 select 数据框每一行的国家,然后在同一数据框中创建一个新列:
country 1, country 2, country 3
country a, country b, country c
我试过了,但我只能得到每行最后一所学校的最后一个国家
df["countries"] = df["school_info"].apply(lambda x: str(x).split(",")[-1].strip())
输出:
country 3
country c
谢谢!
好的 - 现在我明白了你的要求
- 构建一个
tuples
的临时list
,您希望将其变成行 - 使用
explode()
将列表扩展为行 - 在每行的
tuple
中挑选出值来形成列。为了示例的目的,我已经挑选出所有组件并将原始编码字符串留在原处
data = """name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3
name a, age a, country a; name b, age b, country b; name c, age c, country c"""
df = pd.DataFrame({"school_info":data.split("\n")})
# df["data_tuple"] = df["school_info"].apply(lambda s: [tuple(t.split(",")) for t in s.split(";")])
df = df.assign(data_tuple=lambda dfa: dfa["school_info"].apply(
# build a list of tuples - delimiter is ";" each tuple contains (name,age,country)
lambda s: [tuple(t.split(",")) for t in s.split(";")]))\
# explode the list and pick out each of the elements of resultant tuple
.explode("data_tuple").assign(
name=lambda dfa: dfa["data_tuple"].apply(lambda t: t[0]),
age=lambda dfa: dfa["data_tuple"].apply(lambda t: t[1]),
country=lambda dfa: dfa["data_tuple"].apply(lambda t: t[2]),
).drop("data_tuple", axis=1) # this was a temporary contruct drop it
print(df.to_string(index=False))
输出
school_info name age country
name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3 name 1 age 1 country 1
name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3 name 2 age 2 country 2
name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3 name 3 age 3 country 3
name a, age a, country a; name b, age b, country b; name c, age c, country c name a age a country a
name a, age a, country a; name b, age b, country b; name c, age c, country c name b age b country b
name a, age a, country a; name b, age b, country b; name c, age c, country c name c age c country c
如果您的行位于名为 school_info 的一列中:
df["school_info"].apply(lambda r: ', '.join([c.split(",")[-1].strip() for c in r.split(";")]))
输入:
data = [["name 1, age 1, country 1; name 2, age 2, country 2; name 3, age 3, country 3"],
["name a, age a, country a; name b, age b, country b; name c, age c, country c"]]
df = pd.DataFrame(data, columns=['school_info'])
输出:
0 country 1, country 2, country 3
1 country a, country b, country c
Name: school_info, dtype: object