从 pandas 数据框中的列中提取字符串中的数字
Extract the numbers in a string from a column in pandas dataframe
我需要使用数据框 house_price 中的列 'Amenities' 进行特征提取。
Amenities 列有以下一组数据
house_data['Amenities']
3 3 beds 1 bath
4 1 bed 1 bath 1 parking
5 3 beds 1 bath
6 2 beds 2 baths 2 parking
7 3 beds 1 bath 2 parking
...
2096 3 beds 2 baths 1 parking 419m
2097 4 beds 1 bath 2 parking
2098 3 beds 2 baths 2 parking
2099 2 beds 2 baths 1 parking
2100 3 beds 2 baths 1 parking 590m
Name: Amenities, Length: 1213, dtype: object
我需要提取床位、浴室和停车位的数量并将它们存储到 3 个单独的列中。
house_data["bedrooms"] = ''
house_data["bedrooms"] = house_data["Amenities"].str.extract("(\d*\.?\d+)", expand=True)
3 3
4 1
5 3
6 2
7 3
..
2096 3
2097 4
2098 3
2099 2
2100 3
Name: bedrooms, Length: 1213, dtype: object
以上代码仅提取整个字符串的第一位数字。如何提取代表 baths/parking 个数的数字并将它们存储在不同的列中?
你可以试试这个:
df = df['Amenities'].str.split(r'[a-zA-Z ]+', expand=True).drop(columns=[3, 4])
print(df)
0 1 2
0 3 1
1 1 1 1
2 3 1
3 2 2 2
4 3 1 2
5 3 2 1
6 4 1 2
7 3 2 2
8 2 2 1
9 3 2 1
我们可以在这里使用 named groups
和 Series.str.extract
:
regex = r'(?P<beds>\d)\sbeds?\s(?P<bath>\d+)\sbaths?\s?(?P<parking>\d)?'
df = pd.concat([df, df['Amenities'].str.extract(regex)], axis=1)
Amenities beds bath parking
0 3 beds 1 bath 3 1 NaN
1 1 bed 1 bath 1 parking 1 1 1
2 3 beds 1 bath 3 1 NaN
3 2 beds 2 baths 2 parking 2 2 2
4 3 beds 1 bath 2 parking 3 1 2
5 3 beds 2 baths 1 parking 419m 3 2 1
6 4 beds 1 bath 2 parking 4 1 2
7 3 beds 2 baths 2 parking 3 2 2
8 2 beds 2 baths 1 parking 2 2 1
9 3 beds 2 baths 1 parking 590m 3 2 1
我需要使用数据框 house_price 中的列 'Amenities' 进行特征提取。
Amenities 列有以下一组数据
house_data['Amenities']
3 3 beds 1 bath
4 1 bed 1 bath 1 parking
5 3 beds 1 bath
6 2 beds 2 baths 2 parking
7 3 beds 1 bath 2 parking
...
2096 3 beds 2 baths 1 parking 419m
2097 4 beds 1 bath 2 parking
2098 3 beds 2 baths 2 parking
2099 2 beds 2 baths 1 parking
2100 3 beds 2 baths 1 parking 590m
Name: Amenities, Length: 1213, dtype: object
我需要提取床位、浴室和停车位的数量并将它们存储到 3 个单独的列中。
house_data["bedrooms"] = ''
house_data["bedrooms"] = house_data["Amenities"].str.extract("(\d*\.?\d+)", expand=True)
3 3
4 1
5 3
6 2
7 3
..
2096 3
2097 4
2098 3
2099 2
2100 3
Name: bedrooms, Length: 1213, dtype: object
以上代码仅提取整个字符串的第一位数字。如何提取代表 baths/parking 个数的数字并将它们存储在不同的列中?
你可以试试这个:
df = df['Amenities'].str.split(r'[a-zA-Z ]+', expand=True).drop(columns=[3, 4])
print(df)
0 1 2
0 3 1
1 1 1 1
2 3 1
3 2 2 2
4 3 1 2
5 3 2 1
6 4 1 2
7 3 2 2
8 2 2 1
9 3 2 1
我们可以在这里使用 named groups
和 Series.str.extract
:
regex = r'(?P<beds>\d)\sbeds?\s(?P<bath>\d+)\sbaths?\s?(?P<parking>\d)?'
df = pd.concat([df, df['Amenities'].str.extract(regex)], axis=1)
Amenities beds bath parking
0 3 beds 1 bath 3 1 NaN
1 1 bed 1 bath 1 parking 1 1 1
2 3 beds 1 bath 3 1 NaN
3 2 beds 2 baths 2 parking 2 2 2
4 3 beds 1 bath 2 parking 3 1 2
5 3 beds 2 baths 1 parking 419m 3 2 1
6 4 beds 1 bath 2 parking 4 1 2
7 3 beds 2 baths 2 parking 3 2 2
8 2 beds 2 baths 1 parking 2 2 1
9 3 beds 2 baths 1 parking 590m 3 2 1