在 mallet 中为我的数据集创建自定义模式

Question

我在 java 中使用 Mallet 2.0.7 来挖掘推文。根据文档，对于主题建模，我必须使用 CsvIterator 读取数据集。

Reader fileReader = new InputStreamReader(new FileInputStream(new File(args[0])), "UTF-8");
    instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\S*)[\s,]*(\S*)[\s,]*(.*)$"),
                                           3, 2, 1)); // data, label, name fields

我的数据集是这样的：row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment

对于标签，我添加了 x 列。第一次，我想运行 algorithm in column text (6) 后来又加了一个专栏。我写了这个模式，但它没有按预期工作，它获取第 6 列直到最后一个数据。如何更改模式的正则表达式？

 Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
    instances.addThruPipe(new CsvIterator(fileReader,
            Pattern.compile("^(\S*)[\s,]*(\S*)[\s,]*(\S*)[\s,]*(\S*)[\s,]*(\S*)[\s,]*(.*)$"),
            6, 2, 1)); // data, label, name fields

Answer 1

查找正则表达式文档以了解模式中每个元素的含义。原始模式将整行分为三组：从开头到第一个逗号或空格的所有字符，直到第二个逗号或空格的所有字符，然后是其他所有字符。

新模式的作用相同，但捕获了六个组。这就是为什么您要获取从文本到行尾的所有内容。

我会推荐一些修复方法：

如果某个字段不相关，例如 label，您可以只使用 0 来指定它不存在。您不需要添加虚拟字段。
() 中的任何内容都是捕获组。如果您不想包含某个字段，请不要捕获它。只需删除括号但保留模式。
原始模式之所以有效，是因为我们可以对名称和标签字段做出假设：它们不包含逗号或空格，之后的所有内容都是文本。抢占一行中间的字段，需要更加小心。您必须找到文本字段的末尾。我强烈建议使用制表符分隔的字段，假设没有字段包含制表符。

尝试这样的事情（未测试）：

// row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
instances.addThruPipe(new CsvIterator(fileReader,
        Pattern.compile("^(\d+)\t[^\t]*\t[^\t]*\t[^\t]*\t([^\t]*)\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$"),
        2, 0, 1)); // data, label, name fields

在 mallet 中为我的数据集创建自定义模式

Create customized Pattern for my data-set in mallet

java

regex

mallet

topic-modeling