DataflowAssert 未通过 TableRow 测试

Question

我们不知道为什么当运行这个简单的测试时，DataflowAssert 失败：

  @Test
  @Category(RunnableOnService.class)
  public void testTableRow() throws Exception {
      Pipeline p = TestPipeline.create();
      PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
      PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
      PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
      DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
      p.run();
  }

我们遇到以下异常：

    Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement 
SEVERE: DataflowAssert failed expectations.
 java.lang.AssertionError: 
   Expected: iterable over [<{id=x}>] in any order
     but: Not matched: <{id=x}>
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
    at org.junit.Assert.assertThat(Assert.java:865)
    at org.junit.Assert.assertThat(Assert.java:832)
    at ...

为了简化 DataflowAssert 测试，我们硬编码了 Table.join 的输出以匹配 DataflowAssert，具有：

private static final TableRow TABLEROW_TEST = new TableRow()
        .set("id", "x");


static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
        PCollection<TableRow> pCollectionTable2) throws Exception {

    final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
    final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();

    PCollection<KV<String, String>> table1Data = pCollectionTable1
            .apply(ParDo.of(new ExtractTable1DataFn()));
    PCollection<KV<String, String>> table2Data = pCollectionTable2
            .apply(ParDo.of(new ExtractTable2DataFn()));

    PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
            .of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
            .apply(CoGroupByKey.<String> create());

    PCollection<KV<String, String>> resultCollection = kvpCollection
            .apply(ParDo.named("Process join")
                    .of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
                        private static final long serialVersionUID = 0;

                        @Override
                        public void processElement(ProcessContext c) {
                            // System.out.println(c);
                            KV<String, CoGbkResult> e = c.element();
                            String key = e.getKey();
                            String value = null;
                            for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {

                                for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
                                    value = table1Value + "," + table2Value;
                                }
                            }
                            c.output(KV.of(key, value));
                        }
                    }));

    PCollection<TableRow> formattedResults = resultCollection.apply(
            ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
                private static final long serialVersionUID = 0;

                public void processElement(ProcessContext c) {
                    TableRow row = new TableRow().set("id", "x");
                    c.output(row);                      
                }
            }));

    return formattedResults;
}

有谁知道我们哪里做错了吗？

Answer 1

我认为错误消息是在告诉您实际集合包含的该元素的副本多于预期。

Expected: iterable over [<{id=x}>] in any order
 but: Not matched: <{id=x}>

这是 hamcrest，表示您想要对单个元素进行迭代，但实际集合中有一个项目不匹配。由于来自 "format join" 的所有项目都具有相同的值，这使得它比应该的更难阅读。

具体来说，这是我运行以下测试时产生的消息，该测试检查具有两份 row 的集合是否恰好包含一份 row:

@Category(RunnableOnService.class)
@Test
public void testTableRow() throws Exception {
  Pipeline p = TestPipeline.create();

  TableRow row = new TableRow().set("id", "x");

  PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
  DataflowAssert.that(rows).containsInAnyOrder(row);

  p.run();
}

为了用你的代码得到那个结果，我不得不利用你只迭代表 2 中的条目这一事实。具体来说：

// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]

// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
 ("keyB", ([], ["B2"]))]

// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
 "B2,B2"]

// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]

请注意，"Process join" 的 DoFn 可能不会产生如下评论的预期结果：

String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
    for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
        // NOTE: this updates value, and doesn't output it. So for each
        // key there will be a single output with the *last* value
        // rather than one for each pair.
        value = table1Value + "," + table2Value;
    }
}
c.output(KV.of(key, value));

DataflowAssert 未通过 TableRow 测试

DataflowAssert doesn't pass TableRow test

google-cloud-dataflow