R Tidyverse - 按组计算单词在列表中出现的次数

R Tidyverse - Counting the number a word appears in a list by group

我目前正在进行以下工作:

我有两个数据框。一个数据框包含每个公司的许多发明人,我想知道他们的名字在两个数据框中的相同 company.The 公司标识符(df_itemnumber_rounded)中出现在另一个数据框中的频率被称为相同并存在于两个数据框。

示例:

第一个数据框包括:

df_itemnumber_rounded <- c(df_2002_77, df_2002_77, df_2002_77, df_2002_78
,df_2002_767)

assignees_split <- c("DSM IP ASSETS BV","DSM NV,GIST BROCADES NV","INRA INST NAT RECH AGRONOMIQUE","DE FORENEDE BRYGGERIER AS", "FORENEDE BRYGGERIER" )

第二个数据框包括:

df_itemnumber_rounded <- c(df_2002_77,df_2002_77,df_2002_77,df_2002_77,df_2002_77)

citedp_assignee <- c("LANGEJAN AREND","PELLETIER RENE FRANCOIS ROGER ","LESAFFRE &amp; CIE","GIST BROCADES NV ", "DISTILLERS CO YEAST LTD ")

目标是拥有第一个包含新变量(“计数”)的数据框,该变量显示名称在公司中出现的频率。

像这样:

df_itemnumber_rounded assignees_split count
df_2002_77 DSM IP ASSETS BV 0
df_2002_77 GIST BROCADES NV 1

我尝试使用 str_detect 和 sum 对其进行处理,但我不知道如何按组进行处理,而不是让它 运行 处理整个数据帧。

counts <- test_distinct_cleaned %>% 
  group_by(df_itemnumber_rounded,assignees_split) %>% 
mutate(counts=map_int(tolower(test_distinct_cleaned$assignees_split),~sum(str_detect(tolower(match_with_cleaned$citedp_assignee),.x))))

不过时间很长,似乎没有尽头。我已经在较小的 df 上尝试了没有 group_by 函数的上述解决方案,它计算了一个名字的所有出现次数,而不仅仅是同一家公司的名字。所以我不确定这是否适用于 group_by 以及是否有更快的方法。第一个 df 有 17000 行,第二个匹配有超过 150000 行...

下面是数据的真实示例:

Dataframe 1 是受让人。

    structure(list(df_itemnumber_rounded = c("df_2012_2175", "df_2012_2175", 
"df_2012_2175", "df_2012_2175", "df_2012_2175", "df_2012_2175", 
"df_2012_2175", "df_2002_4897", "df_2002_4897", "df_2012_9460", 
"df_2012_9460", "df_2012_9460", "df_2012_9460", "df_2016_6247", 
"df_2016_6247", "df_2016_6247", "df_2016_6248", "df_2016_6248", 
"df_2016_6248", "df_2016_6248", "df_2016_6248", "df_2016_6248", 
"df_2012_9459", "df_2012_9459", "df_2016_14178", "df_2016_14178", 
"df_2016_14689", "df_2016_14689", "df_2016_15814", "df_2016_15814", 
"df_2012_2360", "df_2012_2360", "df_2012_2360", "df_2012_2360", 
"df_2012_2360", "df_2012_2360", "df_2012_2360", "df_2012_8944", 
"df_2012_8944", "df_2012_8944", "df_2012_3604", "df_2012_3604", 
"df_2012_3604", "df_2012_3604", "df_2012_4967", "df_2012_4967", 
"df_2012_4967", "df_2012_7883", "df_2012_7883", "df_2012_7883", 
"df_2012_4836", "df_2012_4836", "df_2012_4836", "df_2002_18770", 
"df_2012_1305", "df_2012_2576", "df_2012_10710", "df_2012_5541", 
"df_2012_5578", "df_2012_5635", "df_2012_6044", "df_2012_6219", 
"df_2012_6317", "df_2012_6488", "df_2012_6500", "df_2012_6613", 
"df_2012_6615", "df_2012_6679", "df_2012_6928", "df_2012_6977", 
"df_2012_7489", "df_2012_7552", "df_2012_7667", "df_2012_8017", 
"df_2012_8302", "df_2012_8555", "df_2012_8809", "df_2012_9085", 
"df_2012_9171", "df_2012_9298", "df_2012_9813", "df_2012_10236", 
"df_2012_10437", "df_2012_10532", "df_2012_10602", "df_2012_11037", 
"df_2012_11070", "df_2012_11183", "df_2012_11606", "df_2012_12362", 
"df_2012_12618", "df_2016_678", "df_2016_790", "df_2016_1079", 
"df_2016_1414", "df_2016_1539", "df_2016_2074", "df_2016_2167", 
"df_2016_2314", "df_2016_2769"), `Publication Number` = c("WO2006046567A1", 
"WO2006046567A1", "WO2006046567A1", "WO2006046567A1", "WO2006046567A1", 
"WO2006046567A1", "WO2006046567A1", "DE3149931A1", "DE3149931A1", 
"WO2013002007A1", "WO2013002007A1", "WO2013002007A1", "WO2013002007A1", 
"WO2016114276A1", "WO2016114276A1", "WO2016114276A1", "WO2016114277A1", 
"WO2016114277A1", "WO2016114277A1", "WO2016114277A1", "WO2016114277A1", 
"WO2016114277A1", "JP2013005781A", "JP2013005781A", "WO2017094654A1", 
"WO2017094654A1", "JP2017112924A", "JP2017112924A", "WO2017169107A1", 
"WO2017169107A1", "WO2006070828A1", "WO2006070828A1", "WO2006070828A1", 
"WO2006070828A1", "WO2006070828A1", "WO2006070828A1", "WO2006070828A1", 
"JP2012183063A", "JP2012183063A", "JP2012183063A", "WO2007097088A1", 
"WO2007097088A1", "WO2007097088A1", "WO2007097088A1", "WO2009017116A1", 
"WO2009017116A1", "WO2009017116A1", "WO2011145670A1", "WO2011145670A1", 
"WO2011145670A1", "WO2008153118A1", "WO2008153118A1", "WO2008153118A1", 
"JP2013066497A", "JP2011030577A", "JP2011142922A", "JP2012105673A", 
"JP2009213393A", "JP2009225740A", "JP2009254247A", "AU2008297027A1", 
"JP2010130902A", "JP2010136658A", "JP2010207213A", "JP2010207217A", 
"JP2010220529A", "JP2010220536A", "JP2010252640A", "JP2011036129A", 
"JP2011030517A", "JP2011135841A", "JP2011142890A", "JP2011206047A", 
"JP2012024081A", "JP2012055235A", "JP2012105572A", "JP2012147775A", 
"JP2012213373A", "WO2012147465A1", "JP2012244965A", "JP2013042751A", 
"WO2013073628A1", "JP2013143938A", "JP2013150602A", "JP2013165707A", 
"JP2013255490A", "JP2013243970A", "JP2014000055A", "JP2014057537A", 
"EP2737810A1", "JP2014128251A", "JP2014217315A", "WO2014192826A1", 
"JP2015015926A", "WO2015029605A1", "JP2015053920A", "WO2015064748A1", 
"JP2015116187A", "JP2015100294A", "WO2015098744A1"), assignees_split = structure(c(19L, 
15L, 17L, 1L, 4L, 5L, 18L, 20L, 16L, 21L, 11L, 12L, 14L, 21L, 
12L, 14L, 21L, 12L, 15L, 2L, 3L, 8L, 21L, 14L, 21L, 14L, 21L, 
14L, 21L, 14L, 21L, 15L, 17L, 6L, 7L, 9L, 10L, 22L, 15L, 16L, 
23L, 13L, 15L, 16L, 23L, 15L, 16L, 23L, 15L, 16L, 24L, 15L, 16L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L), .Label = c(" FURUKUBO S ", 
" IBUSUKI D ", " ICHIMURA A ", " IZUMI T ", " KAKUDO Y ", " KIMURA M ", 
" MAKI H ", " MIYAO Y", " NAGAO K ", " NAKAHARA K", " SUNTORY BEVERAGE&FOOD LTD ", 
" SUNTORY FOOD & BEVERAGE CO LTD ", " SUNTORY HOLDINGS CO LTD ", 
" SUNTORY HOLDINGS LTD", " SUNTORY HOLDINGS LTD ", " SUNTORY LTD", 
" SUNTORY LTD ", " TAKAOKA S", "KOMATSU MFG CO LTD ", "SUN CHEM CORP ", 
"SUNTORY BEVERAGE & FOOD LTD ", "SUNTORY FOOD & BEVERAGE CO LTD ", 
"SUNTORY HOLDING LTD ", "SUNTORY HOLDINGS CO LTD ", "SUNTORY HOLDINGS LTD"
), class = "factor")), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame"))

Dataframe 是引用专利的受让人,因此 cited_assignees。

structure(list(df_itemnumber_rounded = c("df_2012_2175", "df_2012_2175", 
"df_2012_2175", "df_2012_2175", "df_2012_2175", "df_2012_2175", 
"df_2002_4897", "df_2012_9460", "df_2012_9460", "df_2012_9460", 
"df_2012_9460", "df_2012_9460", "df_2012_9460", "df_2012_9460", 
"df_2016_6247", "df_2016_6247", "df_2016_6247", "df_2016_6247", 
"df_2016_6247", "df_2016_6247", "df_2016_6248", "df_2016_6248", 
"df_2016_6248", "df_2016_6248", "df_2016_6248", "df_2016_6248", 
"df_2012_9459", "df_2012_9459", "df_2016_14178", "df_2016_14178", 
"df_2016_14178", "df_2016_14178", "df_2016_14178", "df_2016_14178", 
"df_2016_14178", "df_2016_14178", "df_2016_14178", "df_2016_14178", 
"df_2016_14178", "df_2016_14178", "df_2016_14178", "df_2016_14689", 
"df_2016_14689", "df_2016_14689", "df_2016_14689", "df_2016_14689", 
"df_2016_14689", "df_2016_15814", "df_2016_15814", "df_2016_15814", 
"df_2016_15814", "df_2016_15814", "df_2012_2360", "df_2012_2360", 
"df_2012_2360", "df_2012_2360", "df_2012_2360", "df_2012_2360", 
"df_2012_2360", "df_2012_2360", "df_2012_2360", "df_2012_2360", 
"df_2012_2360", "df_2012_2360", "df_2012_8944", "df_2012_3604", 
"df_2012_3604", "df_2012_3604", "df_2012_3604", "df_2012_3604", 
"df_2012_3604", "df_2012_3604", "df_2012_3604", "df_2012_3604", 
"df_2012_4967", "df_2012_4967", "df_2012_4967", "df_2012_4967", 
"df_2012_4967", "df_2012_4967", "df_2012_4967", "df_2012_4967", 
"df_2012_4967", "df_2012_4967", "df_2012_4967", "df_2012_4967", 
"df_2012_4967", "df_2012_4967", "df_2012_4967", "df_2012_4967", 
"df_2012_4967", "df_2012_4967", "df_2012_4967", "df_2012_4967", 
"df_2012_7883", "df_2012_7883", "df_2012_7883", "df_2012_7883", 
"df_2012_7883", "df_2012_4836"), citedp_assignee = structure(c(40L, 
23L, 48L, 6L, 15L, 13L, 12L, 30L, 1L, 52L, 53L, 58L, 56L, 3L, 
52L, 52L, 16L, 66L, 52L, 67L, 16L, 66L, 52L, 68L, 52L, 51L, 63L, 
3L, 60L, 27L, 45L, 52L, 9L, 73L, 33L, 25L, 52L, 50L, 62L, 73L, 
64L, 2L, 59L, 28L, 28L, 52L, 46L, 4L, 42L, 1L, 42L, 43L, 31L, 
70L, 32L, 57L, 39L, 14L, 35L, 11L, 56L, 32L, 32L, 55L, 17L, 22L, 
22L, 56L, 56L, 5L, 36L, 56L, 41L, 21L, 20L, 29L, 18L, 19L, 47L, 
7L, 63L, 4L, 38L, 44L, 65L, 56L, 44L, 26L, 34L, 65L, 69L, 10L, 
24L, 8L, 71L, 49L, 54L, 61L, 72L, 37L), .Label = c("  ", "ABURADA MASAKI,TOKIWA PHYTOCHEMICAL CO LTD ", 
"ASAHI BREWERIES LTD", "ASAHI BREWERIES LTD ", "ASAHI BREWERIES LTD,JIBIKI MAKIKO ", 
"CERESTAR HOLDING BV ", "CHOYA UMESHU CO LTD ", "CHUKO SHUZO KK", 
"CREAGRI INC,CREA ROBERTO,MATTEUZZI FRANCESCO,ASTORE STEFANO,MILIONI IVANO ", 
"DEZAINAA FOODS KYOKAI KK ", "FUKUI HISASHI ", "GULF OIL CORP", 
"HOKKAIDO WAIN KK", "IKEDA SHOKKEN KK ", "INSUCHI PUROBUREMU NADEJINOSUT ", 
"ITO EN LTD ", "ITOEN KK", "JAPAN MAIZE PROD ", "KANEBO LTD ", 
"KANEBO LTD,HASEGAWA T CO LTD ", "KIRIN BREWERY", "KIRIN BREWERY ", 
"KIRIN BREWERY,JAPAN MAIZE PROD ", "KOHJIN CO ", "LION CORP ", 
"MANNS WINE CO LTD ", "MARUNAKA SHOKUHIN KK ", "MARUZEN PHARMA ", 
"MATSUSHITA ELECTRIC IND CO LTD ", "MATSUTANI KAGAKU KOGYO KK ", 
"MEIJI DAIRIES CORP ", "MEIJI MILK PROD CO LTD ", "MILLER BREWING INTERNATIONAL INC,LUSK LANCE T,RYDER DAVID S ", 
"MITSUBISHI CHEM IND ", "MITSUBISHI HEAVY IND LTD,KADO TAKASHI,ISOZAKI TOSHIKAZU ", 
"MIYAGI PREFECTURE,MIYAGIKEN SHUZO KYODO KUMIAI ", "NAT TAX ADMINISTRATION AGENCY ", 
"NIPPON BEET SUGAR MFG ", "OGAWA &amp; CO LTD ", "OKURA SYUZO KK ", 
"OZEKI KK ", "PEPSICO INC ", "PEPSICO INC,LEE THOMAS,NATTRESS LAURA ANN,RIHA WILLIAM", 
"POLA CHEM IND INC ", "RIVERSON KK ", "SAN EI GEN FFI INC", "SAN EI GEN FFI INC ", 
"SAPPORO BREWERIES ", "SAPPORO BREWERIES,OONO MASAO,SANO TOMOHIRO ", 
"SHOWA PHARM CHEM IND ", "SUNTORY HOLDINGS LTD", "SUNTORY HOLDINGS LTD ", 
"SUNTORY HOLDINGS LTD,IDO YOSHIHIRO,KOMINE TETSUYA ", "SUNTORY HOLDINGS LTD,KAGEYAMA NORIHIKO,INUI TAKAKO,TAKAGI DAISUKE ", 
"SUNTORY LTD", "SUNTORY LTD ", "SUNTORY LTD,KAGEYAMA NORIHIKO,NAKAHARA KOICHI,INUI TAKAKO,TAKAOKA SEISUKE,NAGAMI KENZO ", 
"SUNTORY LTD,WATANABE TOKUTOMI,DAIDO HIROMI,YOSHIHIRO AKIRA ", 
"SYMRISE GMBH &amp; CO KG ", "TAISHO PHARMA CO LTD ", "TAKARA HOLDINGS INC ", 
"TAKARA SHOKUHIN KK ", "TAKARA SHUZO CO ", "TAKASAGO PERFUMERY CO LTD", 
"TOKYO SHIBAURA ELECTRIC CO ", "TOYO SEIKAN KAISHA LTD ", "TROPICANA PROD INC", 
"TROPICANA PROD INC ", "TSUKISHIMA KIKAI CO ", "UNILEVER PLC,UNILEVER NV,LEVER HINDUSTAN LTD ", 
"WOONGJIN FOODS CO LTD ", "YAKULT HONSHA KK,KUMAMOTOKEN KAJITSU NOGYO KYOD", 
"YAMADA YASUYUKI "), class = "factor")), row.names = c(NA, -100L
), class = c("tbl_df", "tbl", "data.frame"))

这是一个可能的解决方案。请注意,您的 assignee 和 citetp 变量在 beginning/end 处有空格,您可能不想考虑您的字符串搜索:

图书馆(tidyverse)

df1 <- data.frame(df_itemnumber_rounded = c("df_2002_77", "df_2002_77", "df_2002_77", "df_2002_78", "df_2002_767"),
                  assignees_split = c("DSM IP ASSETS BV","DSM NV,GIST BROCADES NV","INRA INST NAT RECH AGRONOMIQUE","DE FORENEDE BRYGGERIER AS", "FORENEDE BRYGGERIER" ))


df2 <- data.frame(citedp_assignee = c("LANGEJAN AREND","PELLETIER RENE FRANCOIS ROGER ","LESAFFRE &amp; CIE","GIST BROCADES NV ", "DISTILLERS CO YEAST LTD "))

下面的解决方案为您提供了有关哪些受让人出现在 citedp 变量中的完整信息:

df1 |> 
  mutate(is_match = str_detect(tolower(assignees_split),
                               paste0(tolower(trimws(df2$citedp_assignee, which = "both")),
                                      collapse = "|"))) |>
  group_by(df_itemnumber_rounded, assignees_split) |> 
  summarize(counts = sum(is_match))

给出:

# A tibble: 5 x 3
# Groups:   df_itemnumber_rounded [3]
  df_itemnumber_rounded assignees_split                counts
  <chr>                 <chr>                           <int>
1 df_2002_767           FORENEDE BRYGGERIER                 0
2 df_2002_77            DSM IP ASSETS BV                    0
3 df_2002_77            DSM NV,GIST BROCADES NV             1
4 df_2002_77            INRA INST NAT RECH AGRONOMIQUE      0
5 df_2002_78            DE FORENEDE BRYGGERIER AS           0

如果您只想获取至少有一个匹配项的 ID,您可以这样做:

df1 |> 
  mutate(is_match = str_detect(tolower(assignees_split),
                               paste0(tolower(trimws(df2$citedp_assignee, which = "both")),
                                      collapse = "|"))) |>  
  group_by(df_itemnumber_rounded, assignees_split) |> 
  summarize(counts = sum(is_match)) |> 
  filter(!all(counts == 0))

给出:

# A tibble: 3 x 3
# Groups:   df_itemnumber_rounded [1]
  df_itemnumber_rounded assignees_split                counts
  <chr>                 <chr>                           <int>
1 df_2002_77            DSM IP ASSETS BV                    0
2 df_2002_77            DSM NV,GIST BROCADES NV             1
3 df_2002_77            INRA INST NAT RECH AGRONOMIQUE      0

TO 提供了一些示例数据后的另一个镜头。

想法是这样的:

  • 注意:我将两个数据框命名为 df1(受让人)和 df2(公司)。
  • 我们首先对字符列进行一些基本清理,例如将所有内容转换为小写字符并去掉 beginning/end.
  • 处的一些空格
  • 然后我们遍历 df1 中的每个 row/assignee(因此 rowwise)并计算此指定名称在 df2 中出现相同公司标识符的频率。

library(tidyverse)

df1 <- df1 |> 
  mutate(assignee_new = tolower(trimws(assignees_split, "both")))

df2 <- df2 |> 
  mutate(citedp_new = tolower(trimws(citedp_assignee, "both")))


df1 |> 
  rowwise() |>  
  mutate(assignee_count = length(str_which(df2$citedp_new[df2$df_itemnumber_rounded == df_itemnumber_rounded],
                                           assignee_new))) |> 
  ungroup() |> 
  print(n = 100)

给出(只打印前 20 行):

# A tibble: 100 x 5
    df_itemnumber_rounded `Publication Number` assignees_split                    assignee_new                   assignee_count
    <chr>                 <chr>                <fct>                              <chr>                                   <int>
  1 df_2012_2175          WO2006046567A1       "KOMATSU MFG CO LTD "              komatsu mfg co ltd                          0
  2 df_2012_2175          WO2006046567A1       " SUNTORY HOLDINGS LTD "           suntory holdings ltd                        0
  3 df_2012_2175          WO2006046567A1       " SUNTORY LTD "                    suntory ltd                                 0
  4 df_2012_2175          WO2006046567A1       " FURUKUBO S "                     furukubo s                                  0
  5 df_2012_2175          WO2006046567A1       " IZUMI T "                        izumi t                                     0
  6 df_2012_2175          WO2006046567A1       " KAKUDO Y "                       kakudo y                                    0
  7 df_2012_2175          WO2006046567A1       " TAKAOKA S"                       takaoka s                                   0
  8 df_2002_4897          DE3149931A1          "SUN CHEM CORP "                   sun chem corp                               0
  9 df_2002_4897          DE3149931A1          " SUNTORY LTD"                     suntory ltd                                 0
 10 df_2012_9460          WO2013002007A1       "SUNTORY BEVERAGE & FOOD LTD "     suntory beverage & food ltd                 0
 11 df_2012_9460          WO2013002007A1       " SUNTORY BEVERAGE&FOOD LTD "      suntory beverage&food ltd                   0
 12 df_2012_9460          WO2013002007A1       " SUNTORY FOOD & BEVERAGE CO LTD " suntory food & beverage co ltd              0
 13 df_2012_9460          WO2013002007A1       " SUNTORY HOLDINGS LTD"            suntory holdings ltd                        2
 14 df_2016_6247          WO2016114276A1       "SUNTORY BEVERAGE & FOOD LTD "     suntory beverage & food ltd                 0
 15 df_2016_6247          WO2016114276A1       " SUNTORY FOOD & BEVERAGE CO LTD " suntory food & beverage co ltd              0
 16 df_2016_6247          WO2016114276A1       " SUNTORY HOLDINGS LTD"            suntory holdings ltd                        3
 17 df_2016_6248          WO2016114277A1       "SUNTORY BEVERAGE & FOOD LTD "     suntory beverage & food ltd                 0
 18 df_2016_6248          WO2016114277A1       " SUNTORY FOOD & BEVERAGE CO LTD " suntory food & beverage co ltd              0
 19 df_2016_6248          WO2016114277A1       " SUNTORY HOLDINGS LTD "           suntory holdings ltd                        3
 20 df_2016_6248          WO2016114277A1       " IBUSUKI D "                      ibusuki d                                   0

我通过将 df1 的大小增加到 10000 行并将 df2 的大小增加到 150000 行来测试它(当然,你的 real-life 模式可能更复杂)并且它在我的机器上运行 20 秒后运行良好:

df1_new <- df1 |> 
  slice(rep(1:n(), each = 100))

df2_new <- df2 |> 
  slice(rep(1:n(), each = 1500))

t1 <- Sys.time()

df1_new |> 
  rowwise() |>  
  mutate(assignee_count = length(str_which(df2_new$citedp_new[df2_new$df_itemnumber_rounded == df_itemnumber_rounded],
                                           assignee_new))) |> 
  ungroup()

Sys.time() - t1

Time difference of 20.36699 secs