合并两个表的内容（找Matlab或者Pseudo Code）

Question

此问题不仅适用于 MATLAB 用户 - 如果您知道 PSEUDOCODE 中问题的答案，请随时留下您的答案！

我有两个 tables Ta 和 Tb，它们具有不同的行数和不同的列数。内容都是单元格文本，但也许将来它也可以包含单元格编号。

我想根据以下规则集将这些 table 的内容合并在一起：

如果Tb(i*,j*)为空，取Ta(i,j)的值，反之亦然。
如果两者都可用，则取Ta(i,j)的值（并可选地，检查它们是否相同）。

棘手的部分然而我们没有唯一的行键，我们只有唯一的列键。注意上面我区分了 i* 和 i。原因是 Ta 中的行可能位于与 Tb 不同的索引处，这同样适用于列 j* 和 j。含义是：

我们首先需要确定 Ta 的哪一行对应于 Tb 的行，反之亦然。我们可以通过尝试交叉匹配 table 共享的任何列来做到这一点。但是，我们可能找不到匹配项（在这种情况下，我们不会将一行与另一行合并）。

问题

我们如何才能以最有效的方式将这两个 table 的内容合并在一起？

这里有一些资源可以更详细地解释这个问题：

1.可以玩的 Matlab 示例：

Ta = cell2table({...
     'a1', 'b1', 'c1'; ...
     'a2', 'b2', 'c2'}, ...
      'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
     'b2*', 'c2', 'd2'; ...
     'b3', 'c3', 'd3'; ...
     'b4', 'c4', 'd4'}, ...
      'VariableNames', {'B','C', 'D'})

结果 table Tc 应该是这样的：

Tc = cell2table({...
    'a1' 'b1' 'c1'   ''; ...
    'a2' 'b2' 'c2' 'd2'; ...
    ''   'b3' 'c3' 'd3'; ...
    ''   'b4' 'c4' 'd4'}, ...
     'VariableNames', {'A', 'B','C', 'D'})

2。可能的第一步

我尝试了以下方法：

Tc = outerjoin(Ta, Tb, 'MergeKeys', true)

工作顺利，但问题是它缺少看起来相似的行的堆叠。例如。上面的命令产生：

 A        B       C       D  
____    _____    ____    ____
''      'b2*'    'c2'    'd2'
''      'b3'     'c3'    'd3'
''      'b4'     'c4'    'd4'
'a1'    'b1'     'c1'    ''  
'a2'    'b2'     'c2'    ''

这里是行

''      'b2*'    'c2'    'd2'
'a2'    'b2'     'c2'    ''

应该合并为一个：

'a2'    'b2'     'c2'    'd2'

所以我们还需要一步将这两个叠加在一起？

3。障碍示例

如果我们有类似的东西：

Ta = 
     A        B       C       
    ____    _____    ____
    'a1'    'b1'     'c1' 
    'a2'    'b2'     'c2'

Tb = 
     A        B       C       
    ____    _____    ____
    'a1'    'b2'     'c3'

那么问题来了，b 中的行是应该与 a 的第 1 行还是第 2 行合并，还是应该合并所有行，还是只作为一个单独的行？关于如何处理此类情况的想法也很好。

Answer 1

这是一个概念性的答案，可以帮助您：

定义一个 'scoring function' 告诉您 Tb 的每一行与 Ta 中的一行的匹配程度。
用Ta填充Tc
对于 Ta 中的每一行，确定与 Tb 的最佳匹配。如果匹配质量高于您的标准，则将最佳匹配定义为成功匹配。
如果找到一个成功的匹配项，'consume'它（使用来自 Tb 的信息在需要的地方丰富 Tc 中的相应行）
一直走到Ta的尽头，Tb没有消耗掉的现在可以'appended'到Tc

改进空间：

选配注意事项

尝试使用 Ta 而不是 Tb，或者使用更复杂的启发式方法来确定消费顺序（例如计算所有 'distances' 并根据成本函数优化匹配）。

请注意，只有当您在基本解决方案中的匹配出现大量误报时，才需要进行这些改进。

关于匹配质量定义的说明

我建议您从这个开始非常简单，例如，如果您有 4 个字段，只需计算匹配的字段数，或者是否所有非空字段都匹配。

如果您想走得更远，请考虑评估值相距多远（例如 mse）或文本相距多远（例如 levensteihn 距离）。

Answer 2

这是一个试图完成这项工作的函数。您输入两个 table，一个决定是否合并两行的阈值，以及一个逻辑来说明当合并冲突出现时您是否更愿意从第一个 table 中获取值。我没有为极端情况做准备，但看看它能给你带来什么：

TkeepAll=mergeTables(Tb,Ta,1,true)
TmergeSome=mergeTables(Tb,Ta,0.25,true)
TmergeAll=mergeTables(Tb,Ta,-1,true)

函数如下：

function Tmerged=mergeTables(Ta,Tb,threshold,preferA)
%% parameters
% Ta and Tb are two the two tables to merge
% threshold=0.25; minimal ratio of identical values in rows for merge.
%   example: you have one row in table A with 3 values, but you only have two
%   values for the same columns in data B. if one of the values is identical
%   and one isn't, you have ratio of 1/2 aka 0.5, which passes a threshold of
%   0.25
% preferA=true; which to take when there is merge conflict
%% see how well rows fit to each other
% T1 is the table with fewer rows
if size(Ta,1)<=size(Tb,1)
    T1=Ta;
    T2=Tb;
    prefer1=preferA;
else
    T1=Tb;
    T2=Ta;
    prefer1=~preferA;
end
[commonVar1,commonVar2]=ismember(T1.Properties.VariableNames,...
    T2.Properties.VariableNames);
commonVar1=find(commonVar1);
commonVar2(commonVar2==0)=[];
% fit is a table with the size of N rows T1 by M rows T2, with values
% describing what ratio of identical items between each row in
% table 1 (shorter) and each row in table 2 (longer), among all not-missing
% points
for ii=1:size(T1,1) %rows of T1
    for jj=1:size(T2,1)
        fit(ii,jj)=sum(ismember(T1{ii,commonVar1},T2{jj,commonVar2}))/length(commonVar1);
    end
end
%% pair rows according to fit
% match has two columns, first one has T1 row number and secone one has the
% matching T2 row number
unpaired1=true(size(T1,1),1);
unpaired2=true(size(T2,1),1);
count=0;
match=[];
maxv=max(fit,[],2);
[~,order]=sort(maxv,'descend');
order=order';
for ii=order %1:size(T1,1)
    [maxv,maxi]=max(fit,[],2);
    if maxv(ii)>threshold
        count=count+1;
        match(count,1)=ii;
        match(count,2)=maxi(ii);
        unpaired1(ii)=false;
        unpaired2(match(count,2))=false;
        fit(:,match(count,2))=nan; %exclude paired row from next pairing
    end
end

%% prepare new variables
% first variables common to the two tables
Nrows=sum(unpaired1)+sum(unpaired2)+size(match,1);
namesCommon={};
namesCommon(1:length(commonVar1))={T1.Properties.VariableNames{commonVar1}};
for vari=1:length(commonVar1)
    if isempty(match)
        mergedData={};
    else
        if prefer1
            mergedData=T1{match(:,1),commonVar1(vari)}; %#ok<*NASGU>
        else
            mergedData=T2{match(:,2),commonVar2(vari)};
        end
    end
    data1=T1{unpaired1,commonVar1(vari)};
    data2=T2{unpaired2,commonVar2(vari)};
    eval([namesCommon{vari},'=[data1;mergedData;data2];']);
end
% variables only in 1
uncommonVar1=1:size(T1,2);
uncommonVar1(commonVar1)=[];
names1={};
names1(1:length(uncommonVar1))={T1.Properties.VariableNames{uncommonVar1}};
for vari=1:length(uncommonVar1)
    data1=T1{:,uncommonVar1(vari)};
    tmp=repmat({''},Nrows-size(data1,1),1);
    eval([names1{vari},'=[data1;tmp];']);
end
% variables only in 2
uncommonVar2=1:size(T2,2);
uncommonVar2(commonVar2)=[];
names2={};
names2(1:length(uncommonVar2))={T2.Properties.VariableNames{uncommonVar2}};
for vari=1:length(uncommonVar2)
    data2=T2{:,uncommonVar2(vari)};
    tmp=repmat({''},Nrows-size(data2,1),1);
    eval([names2{vari},'=[tmp;data2];']);
end
%% collect variables to a table
names=sort([namesCommon,names1,names2]);
str='table(';
for vari=1:length(names)
    str=[str,names{vari},','];
end
str=[str(1:end-1),');'];
Tmerged=eval(str);

合并两个表的内容（找Matlab或者Pseudo Code）

Merge the content of two tables (looking for Matlab or Pseudo Code)

algorithm

matlab

pseudocode

inner-join

matlab-table

选配注意事项

关于匹配质量定义的说明