如何按定义的术语分解字符向量?
How to break Char Vectors up at defined terms?
我第一次使用 rvest() 从网站上抓取数据。
它给了我一个字符向量,我正在尝试将其拆分并转换为包含列的数据框。
这个向量怎么转:
char.vector <- c("John DoeTeacherSpeaksEnglishJapaneseRateUSD 10Video Intro","JaneTutorSpeaksJapaneseFrenchRateUSD 15Video Intro")
...进入这个包含列的数据框:
Name
Role
English
Japanese
French
Rate_USD
John Doe
Teacher
1
1
0
10
Jane
Tutor
0
1
1
15
按空格或字符位置拆分有问题。有没有办法创建一个包含我要拆分的所有单词的向量并将其用作拆分参数?
split.vector <- c("老师", "导师", "说", "英语", "日语", "法语", "评分", "视频")
我的代码和url:
EN.char <- read_html("https://www.italki.com/teachers/english") %>%
html_nodes(".teacher-card") %>%
html_text()
EN.char
因此,由于每个条目的语言数量可能不同,我无耻地采用了 @akrun in their answer , whereby read.dcf
is used to map out all the languages present, and put NA where a language is not present for a given entry. After having read this 我看到的文章所使用的方法:
The DCF rules as implemented in R are:
- A database consists of one or more records, each with one or more
named fields. Not every record must contain each field, a field may
appear only once in a record.
- Regular lines start with a non-whitespace character.
- Regular lines are of form tag:value, i.e.,
have a name tag and a value for the field, separated by : (only the
first : counts). The value can be empty (=whitespace only).
- Lines starting with whitespace are continuation lines (to the preceding
field) if at least one character in the line is non-whitespace.
- Records are separated by one or more empty (=whitespace only) lines.
为了解决格式错误的行错误,我需要修复规则 3 并插入“:”以便有 tag:value 对。
按照@akrun 的示例,我将 read.dcf
包装在 as.data.frame
调用中。为了避免丢失条目,我保留了不必要的 if
语句。鉴于网站的性质,我怀疑语言会丢失。
然后我使用 replace_na()
将 NA 切换为“0”并将所有语言列转换为整数。
然后我生成了另一个 DataFrame,其中包含所有其他所需的详细信息,记录(行)中的每个字段只有一个条目。因为我确定匹配并排序了行,所以我使用 cbind()
按列连接 DataFrames。
N.B。根据我的位置,美元价值实际上显示为英镑。
R:
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
page <- read_html("https://www.italki.com/teachers/english")
entries <- page %>% html_elements(".teacher-card")
language_df <- map_dfr(entries, ~ {
new <- .x %>%
html_elements("p + div > div") %>%
html_text() %>%
paste0(., ":1")
if (length(new) > 0) {
as.data.frame(read.dcf(textConnection(new)))
} else {
NULL
}
}) %>%
mutate(across(.cols = everything(), ~ as.integer(tidyr::replace_na(.x, 0))))
details_df <- map_dfr(entries, ~
data.frame(
name = .x %>% html_element("div > .overflow-hidden") %>% html_text(),
role = .x %>% html_element("span + div > .text-tiny") %>% html_text(),
rate_usd = .x %>% html_element(".flex-1:nth-child(1) > div > span") %>% html_text()
))
results <- cbind(details_df, language_df)
结果:
我第一次使用 rvest() 从网站上抓取数据。
它给了我一个字符向量,我正在尝试将其拆分并转换为包含列的数据框。
这个向量怎么转:
char.vector <- c("John DoeTeacherSpeaksEnglishJapaneseRateUSD 10Video Intro","JaneTutorSpeaksJapaneseFrenchRateUSD 15Video Intro")
...进入这个包含列的数据框:
Name | Role | English | Japanese | French | Rate_USD |
---|---|---|---|---|---|
John Doe | Teacher | 1 | 1 | 0 | 10 |
Jane | Tutor | 0 | 1 | 1 | 15 |
按空格或字符位置拆分有问题。有没有办法创建一个包含我要拆分的所有单词的向量并将其用作拆分参数?
split.vector <- c("老师", "导师", "说", "英语", "日语", "法语", "评分", "视频")
我的代码和url:
EN.char <- read_html("https://www.italki.com/teachers/english") %>%
html_nodes(".teacher-card") %>%
html_text()
EN.char
因此,由于每个条目的语言数量可能不同,我无耻地采用了 @akrun in their answer read.dcf
is used to map out all the languages present, and put NA where a language is not present for a given entry. After having read this 我看到的文章所使用的方法:
The DCF rules as implemented in R are:
- A database consists of one or more records, each with one or more named fields. Not every record must contain each field, a field may appear only once in a record.
- Regular lines start with a non-whitespace character.
- Regular lines are of form tag:value, i.e., have a name tag and a value for the field, separated by : (only the first : counts). The value can be empty (=whitespace only).
- Lines starting with whitespace are continuation lines (to the preceding field) if at least one character in the line is non-whitespace.
- Records are separated by one or more empty (=whitespace only) lines.
为了解决格式错误的行错误,我需要修复规则 3 并插入“:”以便有 tag:value 对。
按照@akrun 的示例,我将 read.dcf
包装在 as.data.frame
调用中。为了避免丢失条目,我保留了不必要的 if
语句。鉴于网站的性质,我怀疑语言会丢失。
然后我使用 replace_na()
将 NA 切换为“0”并将所有语言列转换为整数。
然后我生成了另一个 DataFrame,其中包含所有其他所需的详细信息,记录(行)中的每个字段只有一个条目。因为我确定匹配并排序了行,所以我使用 cbind()
按列连接 DataFrames。
N.B。根据我的位置,美元价值实际上显示为英镑。
R:
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
page <- read_html("https://www.italki.com/teachers/english")
entries <- page %>% html_elements(".teacher-card")
language_df <- map_dfr(entries, ~ {
new <- .x %>%
html_elements("p + div > div") %>%
html_text() %>%
paste0(., ":1")
if (length(new) > 0) {
as.data.frame(read.dcf(textConnection(new)))
} else {
NULL
}
}) %>%
mutate(across(.cols = everything(), ~ as.integer(tidyr::replace_na(.x, 0))))
details_df <- map_dfr(entries, ~
data.frame(
name = .x %>% html_element("div > .overflow-hidden") %>% html_text(),
role = .x %>% html_element("span + div > .text-tiny") %>% html_text(),
rate_usd = .x %>% html_element(".flex-1:nth-child(1) > div > span") %>% html_text()
))
results <- cbind(details_df, language_df)
结果: