从 R 中的文本中提取年龄
Extract age from text in R
我有一个 .csv 文件,其中有一列包含从网上抓取的书籍描述,我将其导入 R 以供进一步分析。我的目标是从R中的这一列中提取主角的年龄,所以我的想象是这样的:
- 用正则表达式匹配 "age" 和 "-year-old" 这样的字符串
- 将包含这些字符串的句子复制到一个新的列中(这样我可以确定句子不是,例如"In the middle ages 50 people lived in xy"
- 从该列中提取数字(如果可能的话,提取一些数字词)到一个新列中。
最终的 table(或者可能 data.frame)会看起来像这样
|Description |Sentence |Age
|YY is a novel by Mr. X |The 12-year-old boy| 12
|about a boy. The 12-year|is named Dave. |
|-old boy is named Dave..| |
如果你能帮帮我,那就太好了,因为我的 R 技能仍然非常有限,而且我还没有找到解决这个问题的方法!
您可以尝试以下方法
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The 12-year-old boy is named Dave. Dave is happy."
sentence <- str_extract(description, pattern = "\.[^\.]*[0-9]+[^\.]*.") %>%
str_replace("^\. ", "")
> sentence
[1] "The 12-year-old boy is named Dave."
age <- str_extract(sentence, pattern = "[0-9]+")
> age
[1] "12"
如果字符串包含除年龄之外的其他 numbers/descriptions 的另一种选择,但您只需要年龄。
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The boy is 5 feet tall. The 12-year-old boy is named Dave. Dave is happy. Dave lives at 42 Washington street."
sentence <- str_split(description, "\.")[[1]][which(grepl("-year-old", unlist(str_split(description, "\."))))]
> sentence
[1] " The 12-year-old boy is named Dave"
age <- as.numeric(str_extract(description, "\d+(?=-year-old)"))
> age
[1] 12
这里我们使用字符串“-year-old”来告诉我们要提取哪个句子,然后我们提取该字符串后面的年龄。
我有一个 .csv 文件,其中有一列包含从网上抓取的书籍描述,我将其导入 R 以供进一步分析。我的目标是从R中的这一列中提取主角的年龄,所以我的想象是这样的:
- 用正则表达式匹配 "age" 和 "-year-old" 这样的字符串
- 将包含这些字符串的句子复制到一个新的列中(这样我可以确定句子不是,例如"In the middle ages 50 people lived in xy"
- 从该列中提取数字(如果可能的话,提取一些数字词)到一个新列中。
最终的 table(或者可能 data.frame)会看起来像这样
|Description |Sentence |Age
|YY is a novel by Mr. X |The 12-year-old boy| 12
|about a boy. The 12-year|is named Dave. |
|-old boy is named Dave..| |
如果你能帮帮我,那就太好了,因为我的 R 技能仍然非常有限,而且我还没有找到解决这个问题的方法!
您可以尝试以下方法
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The 12-year-old boy is named Dave. Dave is happy."
sentence <- str_extract(description, pattern = "\.[^\.]*[0-9]+[^\.]*.") %>%
str_replace("^\. ", "")
> sentence
[1] "The 12-year-old boy is named Dave."
age <- str_extract(sentence, pattern = "[0-9]+")
> age
[1] "12"
如果字符串包含除年龄之外的其他 numbers/descriptions 的另一种选择,但您只需要年龄。
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The boy is 5 feet tall. The 12-year-old boy is named Dave. Dave is happy. Dave lives at 42 Washington street."
sentence <- str_split(description, "\.")[[1]][which(grepl("-year-old", unlist(str_split(description, "\."))))]
> sentence
[1] " The 12-year-old boy is named Dave"
age <- as.numeric(str_extract(description, "\d+(?=-year-old)"))
> age
[1] 12
这里我们使用字符串“-year-old”来告诉我们要提取哪个句子,然后我们提取该字符串后面的年龄。