申请多个参数

Question

我有一个包含大约 2000 万个观测值的大型数据集。我想计算每行中 TitleAbstract.x1 和 TitleAbstract.y1 之间的 Jaccard 指数。

这是一个 2-观察样本：

    structure(list(Patent = c(6326004L, 6514936L), TitleAbstract.x = c("mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof.", 
"mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof."
), cited = c(4261928L, 4261928L), TitleAbstract.y = c("antiviral methods using fragments human rhinovirus receptor (icam-1) ", 
"antiviral methods using human rhinovirus receptor (icam-1) method substantially inhibiting initiation spread infection rhinovirus coxsackie virus host cells expressing major human rhinovirus receptor (icam-1), comprising step contacting virus soluble polypeptide comprising hrv binding site domains ii icam-1; polypeptide capable binding virus reducing infectivity thereof; contact conditions permit virus bind polypeptide."
), Jaccard = c(0, 0.00909090909090909)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f9c8f801778>, sorted = "cited", .Names = c("Patent", 
"TitleAbstract.x", "cited", "TitleAbstract.y", "Jaccard"))

按照之前的帖子，我使用了自制的 equation to calculate the Jaccard Index, and created a function 然后运行和 Mapply 但是我得到了一个错误 'this is not a function'.

Jaccard_Index <- function(x,y)
{
  return(mapply(length(intersect(unlist(strsplit(df$TitleAbstract.x1, "\s+")),unlist(strsplit(df$TitleAbstract.y1, "\s+")))) / length(union(unlist(strsplit(df$TitleAbstract.x1, "\s+")),unlist(strsplit(df$TitleAbstract.y1, "\s+"))))))
}

mapply(Jaccard_Index,df$TitleAbstract.x1,df$TitleAbstract.y1)

我尝试将 TitleAbstract.x1 和 TitleAbstract.y1 更改为 x 和 y，但仍然出现相同的错误。

这可能是一个新手问题，但是谁能帮我写出正确的函数吗？

另外，我还有两个问题：

Q2如何使用parallel & mcapply来加速这个过程？

Q3 R 在内存存储和速度方面的限制是什么，你会推荐使用不同的方法吗（即使用 python 通过 bash) 对于长时间和内存密集型进程？

编辑

我已经上传了正确的数据集，我必须更新我的 RStudio 以避免 t运行分类数据集。

Answer 1

我稍微简化了您的数据集。您可以使用同名包中的 stringdist()，尽管这不会应用以单词为单位的 Jaccard 索引，因此我改为修复了您的 Jaccard_Index()。这是使用 mapply()，但如果您想并行化它，只需将其替换为 mcmapply()

df <- data.frame(
Patent=1:3, 
TitleAbstract.x1=c(
"methods testing oligonucleotide arrays methods testing oligonucleotide",
"isolation cellular material microscopic visualization method microdissection",
"support method determining analyte method producing support method producing"), 
TitleAbstract.y1=c(
"support method determining analyte method producing support method producing",
"method utilizing convex geometry laser capture microdissection process",
"methods testing oligonucleotide arrays methods testing oligonucleotide"),
stringsAsFactors=FALSE)


Jaccard_Index <- function(x, y) {
    if (length(x) == 1) {
        x <- strsplit(x, "\s+")[[1]]
    }
    if (length(y) == 1) {
        y <- strsplit(y, "\s+")[[1]]
    }
    length(intersect(x, y)) / length(union(x, y))
}

# Appears to be that splitting the strings outside the loop is quicker
df$TitleAbstract.x1 <- strsplit(df$TitleAbstract.x1, "\s+")
df$TitleAbstract.y1 <- strsplit(df$TitleAbstract.y1, "\s+")

mapply(Jaccard_Index, df$TitleAbstract.x1, df$TitleAbstract.y1, USE.NAMES=FALSE)
# [1] 0.0000000 0.1538462 0.0000000

申请多个参数

Mapply for multiple arguments

r

mapply