使用 refinr 包比较和优化单独列中的字符串
Compare and refine strings in separate columns with refinr package
我的很多时间都花在合并国家、城市、名称或政党列的两个数据框上。现在,它是 refinr
package,一个 OpenRefine 的 R 端口,派上用场了。只是我还没有弄清楚如何比较 'the same' 列中的两个并命名字符串,就像我在单个向量上使用 refinr
一样。我在 R 方面没有那么丰富的经验,所以这听起来可能有点含糊。也许我的例子让事情更清楚一些。
library(tidyverse)
library(refinr)
# I would like to add the values (and the right name's) of this example df...
df1 <- tribble(
~uid, ~name, ~value,
"A", "Red", 13,
"A", "violet", 145,
"B", "Blue", 3,
"B", "yellow", 56,
"C", "yellow-purple", 789,
"C", "green", 17
)
# ...to the following df
df2 <- tribble(
~uid, ~name,
"A", "red",
"B", "blu",
"C", "YellowPurple",
"C", "green"
)
# The following code of course produces NA values
df3 <- left_join(df1, df2, by = c("uid", "name"))
# While the following is the desired outcome
# A tibble: 4 x 3
uid name value
<chr> <chr> <dbl>
1 A Red 13
2 B Blue 3
3 C yellow-purple 789
4 C green 17
key_collision_merge()
和 n_gram_merge()
处理单个向量中的字符串。我的问题是,我可以在两列而不是一列之间比较和更改字符串吗?
如果这是可能的,我会节省很多时间!
提前致谢。
你可以试试
library(refinr)
library(tidyverse)
df1 %>%
bind_rows(df2, .id = "id") %>%
mutate(key=key_collision_merge(name)) %>%
split(.$id) %>%
inner_join(x=select(.[[1]],-id), y=select(.[[2]], uid, key), by=c("uid", "key"))
# A tibble: 3 x 4
uid name value key
<chr> <chr> <dbl> <chr>
1 A Red 13. Red
2 C yellow-purple 789. YellowPurple
3 C green 17. green
但是 "blu"
没有被两个 refiner
函数识别为 "blue"
。因此,您可以通过添加此行 mutate(name=gsub("blu","blue",name))
来包含一个 gsub 来更改此特定字符串
我不确定这是 refinr
的最佳用途,它主要用于协调单个列中的单词拼写。你想做的事情看起来像一个模糊连接,并且有一个R package for that。使用示例可以是:
library(tidyverse)
library(fuzzyjoin)
df1 <- tribble(
~uid, ~name, ~value,
"A", "Red", 13,
"A", "violet", 145,
"B", "Blue", 3,
"B", "yellow", 56,
"C", "yellow-purple", 789,
"C", "green", 17
)
# ...to the following df
df2 <- tribble(
~uid, ~name,
"A", "red",
"B", "blu",
"C", "YellowPurple",
"C", "green"
)
df3 <- df2 %>%
stringdist_left_join(df1,
distance_col = "dist",
method='soundex') %>%
select(uid=uid.x, name=name.y, value)
df3
# A tibble: 4 x 3
uid name value
<chr> <chr> <dbl>
1 A Red 13
2 B Blue 3
3 C yellow-purple 789
4 C green 17
我用的是soundex算法,但是还有其他方法,都是基于stringdist package.
我的很多时间都花在合并国家、城市、名称或政党列的两个数据框上。现在,它是 refinr
package,一个 OpenRefine 的 R 端口,派上用场了。只是我还没有弄清楚如何比较 'the same' 列中的两个并命名字符串,就像我在单个向量上使用 refinr
一样。我在 R 方面没有那么丰富的经验,所以这听起来可能有点含糊。也许我的例子让事情更清楚一些。
library(tidyverse)
library(refinr)
# I would like to add the values (and the right name's) of this example df...
df1 <- tribble(
~uid, ~name, ~value,
"A", "Red", 13,
"A", "violet", 145,
"B", "Blue", 3,
"B", "yellow", 56,
"C", "yellow-purple", 789,
"C", "green", 17
)
# ...to the following df
df2 <- tribble(
~uid, ~name,
"A", "red",
"B", "blu",
"C", "YellowPurple",
"C", "green"
)
# The following code of course produces NA values
df3 <- left_join(df1, df2, by = c("uid", "name"))
# While the following is the desired outcome
# A tibble: 4 x 3
uid name value
<chr> <chr> <dbl>
1 A Red 13
2 B Blue 3
3 C yellow-purple 789
4 C green 17
key_collision_merge()
和 n_gram_merge()
处理单个向量中的字符串。我的问题是,我可以在两列而不是一列之间比较和更改字符串吗?
如果这是可能的,我会节省很多时间!
提前致谢。
你可以试试
library(refinr)
library(tidyverse)
df1 %>%
bind_rows(df2, .id = "id") %>%
mutate(key=key_collision_merge(name)) %>%
split(.$id) %>%
inner_join(x=select(.[[1]],-id), y=select(.[[2]], uid, key), by=c("uid", "key"))
# A tibble: 3 x 4
uid name value key
<chr> <chr> <dbl> <chr>
1 A Red 13. Red
2 C yellow-purple 789. YellowPurple
3 C green 17. green
但是 "blu"
没有被两个 refiner
函数识别为 "blue"
。因此,您可以通过添加此行 mutate(name=gsub("blu","blue",name))
我不确定这是 refinr
的最佳用途,它主要用于协调单个列中的单词拼写。你想做的事情看起来像一个模糊连接,并且有一个R package for that。使用示例可以是:
library(tidyverse)
library(fuzzyjoin)
df1 <- tribble(
~uid, ~name, ~value,
"A", "Red", 13,
"A", "violet", 145,
"B", "Blue", 3,
"B", "yellow", 56,
"C", "yellow-purple", 789,
"C", "green", 17
)
# ...to the following df
df2 <- tribble(
~uid, ~name,
"A", "red",
"B", "blu",
"C", "YellowPurple",
"C", "green"
)
df3 <- df2 %>%
stringdist_left_join(df1,
distance_col = "dist",
method='soundex') %>%
select(uid=uid.x, name=name.y, value)
df3
# A tibble: 4 x 3
uid name value
<chr> <chr> <dbl>
1 A Red 13
2 B Blue 3
3 C yellow-purple 789
4 C green 17
我用的是soundex算法,但是还有其他方法,都是基于stringdist package.