从数据框格式的文本列中提取单词
Extract words from dataframe formatted text column
我需要从另一个列创建一个新列。
数据集是由这段代码创建的(我只提取了几行):
import pandas as pd
new_dataframe = pd.DataFrame({
"Name": ['John', 'Lukas', 'Bridget', 'Carol','Madison'],
"Notes": ["__ years old. NA", "__ years old. NA",
"__ years old. NA", "__ years old. Old account.",
"__ years old. New VIP account."],
"Status": [True, False, True, True, True]})
生成以下
Name Notes Status
John 23 years old. NA True
Lukas 52 years old. NA False
Bridget 64 years old. NA True
Carol 31 years old. Old account True
Madison 54 years old. New VIP account. True
我需要创建两个包含年龄信息的新列,格式如下:
- __岁(三个字):例如23岁;
- __(仅限数字):例如23
最后我应该
Name Notes Status L_Age S_Age
John 23 years old. NA True 23 years old 23
Lukas 52 years old. NA False 52 years old 52
Bridget 64 years old. NA True 64 years old 64
Carol 31 years old. Old account True 31 years old 31
Madison 54 years old. New VIP account. True 54 years old 54
我不知道如何提取前三个词,然后只提取第一个,以创建新列。我试过
new_dataframe.loc[new_dataframe.Notes == '', 'L_Age'] = new_dataframe.Notes.str.split()[:3]
new_dataframe.loc[new_dataframe.Notes == '', 'S_Age'] = new_dataframe.Notes.str.split()[0]
但这是错误的 (ValueError: Must have equal len keys and value when setting with an iterable
)。
我们将不胜感激。
IIUC:
def get_first_n_words(txt, n):
l = txt.split(' ')
assert(len(l)>=n)
return ' '.join(l[:n])
new_dataframe['L_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 3))
new_dataframe['S_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 1))
您可以使用此模式提取信息并加入:
pattern = '^(?P<L_Age>(?P<S_Age>\d+) years? old)'
new_dataframe = new_dataframe.join(new_dataframe.Notes.str.extract(pattern))
输出:
Name Notes Status L_Age S_Age
0 John 23 years old. NA True 23 years old 23
1 Lukas 52 years old. NA False 52 years old 52
2 Bridget 64 years old. NA True 64 years old 64
3 Carol 31 years old. Old account True 31 years old 31
4 Madison 54 years old. New VIP account. True 54 years old 54
我需要从另一个列创建一个新列。 数据集是由这段代码创建的(我只提取了几行):
import pandas as pd
new_dataframe = pd.DataFrame({
"Name": ['John', 'Lukas', 'Bridget', 'Carol','Madison'],
"Notes": ["__ years old. NA", "__ years old. NA",
"__ years old. NA", "__ years old. Old account.",
"__ years old. New VIP account."],
"Status": [True, False, True, True, True]})
生成以下
Name Notes Status
John 23 years old. NA True
Lukas 52 years old. NA False
Bridget 64 years old. NA True
Carol 31 years old. Old account True
Madison 54 years old. New VIP account. True
我需要创建两个包含年龄信息的新列,格式如下:
- __岁(三个字):例如23岁;
- __(仅限数字):例如23
最后我应该
Name Notes Status L_Age S_Age
John 23 years old. NA True 23 years old 23
Lukas 52 years old. NA False 52 years old 52
Bridget 64 years old. NA True 64 years old 64
Carol 31 years old. Old account True 31 years old 31
Madison 54 years old. New VIP account. True 54 years old 54
我不知道如何提取前三个词,然后只提取第一个,以创建新列。我试过
new_dataframe.loc[new_dataframe.Notes == '', 'L_Age'] = new_dataframe.Notes.str.split()[:3]
new_dataframe.loc[new_dataframe.Notes == '', 'S_Age'] = new_dataframe.Notes.str.split()[0]
但这是错误的 (ValueError: Must have equal len keys and value when setting with an iterable
)。
我们将不胜感激。
IIUC:
def get_first_n_words(txt, n):
l = txt.split(' ')
assert(len(l)>=n)
return ' '.join(l[:n])
new_dataframe['L_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 3))
new_dataframe['S_Age'] = new_dataframe['Notes'].apply(lambda x: get_first_n_words(x, 1))
您可以使用此模式提取信息并加入:
pattern = '^(?P<L_Age>(?P<S_Age>\d+) years? old)'
new_dataframe = new_dataframe.join(new_dataframe.Notes.str.extract(pattern))
输出:
Name Notes Status L_Age S_Age
0 John 23 years old. NA True 23 years old 23
1 Lukas 52 years old. NA False 52 years old 52
2 Bridget 64 years old. NA True 64 years old 64
3 Carol 31 years old. Old account True 31 years old 31
4 Madison 54 years old. New VIP account. True 54 years old 54