如何通过级联连接两个文件

Question

让我们看看我们有什么。第一个文件 [接口 Class]:

list arrayList
list linkedList

第二个文件[Class countOfInstanse]:

arrayList 120
linkedList 4

我想通过键 [Class] 加入这两个文件并获取每个接口的计数：

list 124

和代码：

public class Main
{
  public static void main( String[] args )
  {
    String docPath = args[ 0 ];
    String wcPath = args[ 1 ];
    String stopPath = args[ 2 ];

    Properties properties = new Properties();
    AppProps.setApplicationJarClass( properties, Main.class );
    AppProps.setApplicationName( properties, "Part 1" );
    AppProps.addApplicationTag( properties, "lets:do:it" );
    AppProps.addApplicationTag( properties, "technology:Cascading" );
    FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

    // create source and sink taps
    Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
    Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

    Fields stop = new Fields( "class" );
    Tap classTap = new Hfs( new TextDelimited( true, "\t" ), stopPath );

    // specify a regex operation to split the "document" text lines into a token stream
    Fields token = new Fields( "token" );
    Fields text = new Fields( "interface" );
    RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \[\]\(\),.]" );
    Fields fieldSelector = new Fields( "interface", "class" );
    Pipe docPipe = new Each( "token", text, splitter, fieldSelector );

    // define "ScrubFunction" to clean up the token stream
    Fields scrubArguments = new Fields( "interface", "class" );
    docPipe = new Each( docPipe, scrubArguments, new ScrubFunction( scrubArguments ), Fields.RESULTS );

    Fields text1 = new Fields( "amount" );
    // RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \[\]\(\),.]" );
    Fields fieldSelector1 = new Fields( "class", "amount" );
    Pipe stopPipe = new Each( "token1", text1, splitter, fieldSelector1 );
    Pipe tokenPipe = new CoGroup( docPipe, token, stopPipe, text, new InnerJoin() );
    tokenPipe = new Each( tokenPipe, text, new RegexFilter( "^$" ) );

    // determine the word counts
    Pipe wcPipe = new Pipe( "wc", tokenPipe );
    wcPipe = new Retain( wcPipe, token );
    wcPipe = new GroupBy( wcPipe, token );
    wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

    // connect the taps, pipes, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ).addSource( stopPipe, classTap ).addTailSink( wcPipe, wcTap );

    // write a DOT file and run the flow
    Flow wcFlow = flowConnector.connect( flowDef );
    wcFlow.writeDOT( "dot/wc.dot" );
    wcFlow.complete();
  }
}

[我决定一步步解决这个问题，最后的结果留给其他人。所以第一步 - （尚未完成）]

Answer 1

我会将这两个文件转换为两个 Map 对象，遍历键并对数字求和。然后你可以将它们写回文件。

  Map<String,String> nameToType = new HashMap<String,String>();
  Map<String,Integer> nameToCount = new HashMap<String,Integer>();
  //fill Maps from file here
  Map<String,Integer> result = new HashMap<String,Integer>();
  for (String name: nameToType.keyset())
  {
        String type = nameToType.get(name);
        int count = nameToCount.get(type);

        if (!result.containsKey(type))
            result.put(type,0);
        result.put(type, result.get(type) + count);
   }

如何通过级联连接两个文件

How to join two files via Cascading

java

cascading

bigdata