对于不同的值和计数，如何匹配两个单独数据框中行中的元素？

Question

我有两个数据框，我在下面复制了真实数据。第一个数据框如下所示：

FirstDataFrame <- data.frame("GroupID"   = c(1902, 1905, 1905, 1905, 1906, 1906, 1914, 1914, 1932, 1932, 1964, 1964, 1964), 
                         "SubjectID" = c(24626, 13300, 14126, 2619, 914, 872, 13325, 12539, 12597, 13314, 13343, 1723, 13333),
                         "Age"       = c(17, 13, 16, 17, 5, 9, 8, 14, 10, 13, 7, 14, 16))

第二个数据框看起来像，每个 Age 的值都是计数：

SecondDataFrame <- data.frame("OtherID" = c(1, 2, 3, 4, 5, 6),
                          "Age5" =c(0, 0, 0, 11, 12, 57),
                          "Age6"= c(0, 0, 0, 12, 8, 52),
                          "Age7" = c(0, 0, 0, 12, 9, 42),
                          "Age8" = c(0, 0, 0, 9, 11, 50),
                          "Age9" = c(0, 0, 0, 12, 7, 46),
                          "Age10" = c(0, 0, 0, 12, 11, 41),
                          "Age11" = c(19, 0, 0, 9, 8, 42),
                          "Age12" = c(14, 0, 0, 13, 12, 39),
                          "Age13" = c(54, 78, 83, 0, 3, 13),
                          "Age14" = c(69, 101, 145, 0, 0, 0),
                          "Age15" = c(59, 114, 128, 0, 0, 0),
                          "Age16" = c(77, 127, 107, 0, 0, 0),
                          "Age17" = c(61, 91, 82, 0, 0, 0))

SecondDataFrame 中的每个 AgeX 列都对应一个特定的单岁年龄，如列名所示。

我的 objective 是，对于 FirstDataFrame 中的每个 GroupID：

提取年龄值（可能提取到向量中）。请注意 GroupID 中的某些年龄可能相同。比如我可能有两个14岁的。
在 SecondDataFrame 中，找到包含与 GroupID 和中年龄匹配频率最高的 ID 行在 GroupID 中至少计数 1（或者至少 2，如果我的年龄翻倍）。也就是说，找到列（向量索引？）匹配频率最高的 OtherID 行（或者可能是向量，我考虑过为每个 OtherID 设置一个向量）。
在 GroupID 中，将所选的 OtherID 分配给符合该条件的每个 SubjectID。
对于 OtherID 的每个匹配项，将关联 AgeX 列中的计数减少 1。
在 GroupID 内重复，直到所有 SubjectID 都具有来自 SecondDataFrame 的年龄匹配。
循环到下一个GroupID。

正如您在 FirstDataFrame 中看到的那样，我有 GroupID 个科目不能全部分配到 SecondDataFrame 中的同一个 OtherID。我在每个 GroupID.

中也有不同数量的科目

让事情变得更加复杂的是，OtherID 年龄列没有截止值，这样如果所有 OtherID 的非零值 Age11 都有列 Age5 到 Age10 或 Age12 到 Age17 之间的所有非零计数.

我已经清理了数据，使得 SecondDataFrame 中的每个 AgeX 计数包含至少 [=17] 中该年龄段的相同数量的受试者=]. FirstDataFrame 和 SecondDataFrame 中的最小和最大年龄值已设置为完全匹配。

如何保证最大匹配数并适当减少匹配数？我发现了一些与获得最大匹配数相关的 questions/answers。但是：

他们正在对一个向量与另一个向量进行简单测试，and/or
它们不会减少匹配向量中的计数，而只是测试元素是否存在（或不存在），或者一个向量中有多少值与另一个向量中的值匹配。

我可以使用嵌套的 for () 循环，但我对如何进行频率匹配和减少计数感到困惑。我在想我需要在 GroupID 中的最小年龄开始比赛，并在 GroupID 中随着年龄的增长而工作，这就是我卡住的地方。

编辑：最终的 FirstDataFrame 看起来像：

FirstDataFrame <- data.frame("GroupID"   = c(1902, 1905, 1905, 1905, 1906, 1906, 1914, 1914, 1932, 1932, 1964, 1964, 1964),
                         "SubjectID" = c(24626, 13300, 14126, 2619, 914, 872, 13325, 12539, 12597, 13314, 13343, 1723, 13333),
                         "Age"       = c(17, 13, 16, 17, 5, 9, 8, 14, 10, 13, 7, 14, 16), 
                         "OtherID"   = c(2, 3, 3, 3, 6, 6, 6, 3, 6, 6, 6, 3, 3)

然而，OtherID 也会被概率选中。比如GroupID1905的三个少年也有概率在OtherID1或2

相应地，在SecondDataFrame中每匹配一次匹配的年龄单元格将减1。因此，对于 GroupID 1905，OtherID 将以计数结束： Count13 = 82，Count16 = 127，Count17 = 81，表示比原始计数减少 1，因为每个子匹配减少 1 个可用匹配。

Answer 1

这是一个长期而棘手的问题，我不确定我是否已经回答了所有问题。

这是我解决这个问题的方法，首先根据 GroupID 拆分 FirstDataFrame，然后得到一个列表：

split_df <- split(FirstDataFrame, FirstDataFrame$GroupID)
split_df
#$`1902`
#  GroupID SubjectID Age
#1    1902     24626  17
#
#$`1905`
#  GroupID SubjectID Age
#2    1905     13300  13
#3    1905     14126  16
#4    1905      2619  17
#
#$`1906`
#  GroupID SubjectID Age
#5    1906       914   5
#6    1906       872   9
#
#$`1914`
#  GroupID SubjectID Age
#7    1914     13325   8
#8    1914     12539  14
# ...

现在我将重点关注一种情况，然后我们使用 for 循环遍历它。我选择列表的第二个元素 1905。首先提取该组的年龄，然后我想创建一个频率向量（而不是年龄）。我不知道更好的方法，所以这是不雅的解决方案

i = 2
ages <- split_df[[i]]$Age
ages
#[1] 13 16 17

ind_ages <- ages - 4 # "Indexize" ages: Age 5 become 1, 6 become 2, ..., 17 become 13
ind_ages
#[1]  9 12 13

freq <- tabulate(ind_ages, nbins = 13)
freq
#[1] 0 0 0 0 0 0 0 0 1 0 0 1 1

频率向量长度为13，第9、12、13位其余为零...这是为了匹配SecondDataFrame中的第2至14列。

现在您可以推导出一种方法来为这些孩子随机分配一个 OtherID。一种可能性是使用多项式似然：给定一组分配给容器的概率，从容器 9、12 和 13 中获得 3 个球的机会。

对于SecondDataFrame中的每一行，我们可以计算出每个年龄的比例（并将其作为概率）：

props <- apply(SecondDataFrame[,2:14], 1, function (x) x/sum(x))
props
#            [,1]      [,2]      [,3]      [,4]       [,5]       [,6]
#Age5  0.00000000 0.0000000 0.0000000 0.1222222 0.14814815 0.14736842
#Age6  0.00000000 0.0000000 0.0000000 0.1333333 0.09876543 0.13684211
#Age7  0.00000000 0.0000000 0.0000000 0.1333333 0.11111111 0.11052632
#Age8  0.00000000 0.0000000 0.0000000 0.1000000 0.13580247 0.13157895
#Age9  0.00000000 0.0000000 0.0000000 0.1333333 0.08641975 0.11842105
#Age10 0.00000000 0.0000000 0.0000000 0.1333333 0.13580247 0.10789474
#Age11 0.05428571 0.0000000 0.0000000 0.1000000 0.09876543 0.11052632
#Age12 0.04000000 0.0000000 0.0000000 0.1444444 0.14814815 0.10263158
#Age13 0.15142857 0.1529412 0.1522936 0.0000000 0.03703704 0.03421053
#Age14 0.19714286 0.1980392 0.2660550 0.0000000 0.00000000 0.00000000
#Age15 0.16857143 0.2235294 0.2348624 0.0000000 0.00000000 0.00000000
#Age16 0.21714286 0.2490196 0.1963303 0.0000000 0.00000000 0.00000000
#Age17 0.17142857 0.1764706 0.1504587 0.0000000 0.00000000 0.00000000

同样，使用 apply()，我们可以计算三个 children 出现在行中的可能性（注意在 props 中它变成了列）。

likelihood <- apply(props, 2, function (x) dmultinom(freq, size = sum(freq), prob = x))
likelihood
#[1] 0.03382111 0.04032567 0.02699215 0.00000000 0.00000000 0.00000000

prob_OtherID <- likelihood / sum(likelihood)
prob_OtherID
#[1] 0.3344025 0.3987156 0.2668819 0.0000000 0.0000000 0.0000000

孩子属于OtherID1的概率是33.4%，2是39.9%...这只是可能性的加权平均值。这种计算方法仅适用于您的孩子数量较少的情况。如果你说一组有 100 多个孩子，则此代码会因数字问题而中断。

现在使用sample()为孩子们选择一个OtherID，更新列表。

chosenID <- sample(SecondDataFrame$OtherID, size = 1, prob = prob_OtherID)
split_df[[i]]$OtherID <- chosenID

最后，到SecondDataFrame中的相应行，用这组孩子的年龄频率减去年龄频率：

SecondDataFrame[SecondDataFrame$OtherID == chosenID, 2:14] <- 
    SecondDataFrame[SecondDataFrame$OtherID == chosenID, 2:14] - freq

现在将它们放入 for 循环中，工作就完成了！更多注意事项：在本例中，for 循环在 i = 4 处中断，因为 SecondDataFrame 中没有一行同时包含 8 岁和 14 岁的孩子。其次，这个算法不能保证你能够用 OtherID 分配它们，因为随着 SecondDataFrame 中频率的降低，你越来越有可能运行遇到这样的问题i = 4。也许你会很幸运地把它们全部填满而没有错误，或者也许容量比科目数量大得多那么你就没事了。否则你就得想办法解决这个问题了。

对于不同的值和计数，如何匹配两个单独数据框中行中的元素？

How do I match elements in rows in two separate data frames, for differing values and counts?

r

vectorization

matching

dataframe