提取研究文档中参考编号周围的文本

Extract text around the reference number in research document

我想提取参考号周围的文本。
例如:
文字是:

The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]. While the first assumption has been questioned recently in [8], where it is shown that even the honest subgraph may have some cuts that disrupt the algorithmic property on which Sybil defenses are based, the trust, though being a crucial requirement for these designs, was not considered carefully. Even worse, these defense [10, 11, 2, 4] — when verified against real-world networks — have considered samples of online social graphs, which are known to possess weaker value of rust.

这里我想提取参考编号 [8] 的引用文本,[10]、[11] [2] 和 [4] 的引用文本也是如此。

您实际上没有给出要收集的输出示例。您也没有提供任何用于尝试此操作的代码。

无论如何。如果我假设您想要引用之前的所有文本,那么您的正则表达式将类似于:

(.*?)\[(.*?)\]

这会捕获两组,第一组是文本,第二组是引文。您可以使用以下代码将其应用于文本:

private static final Pattern pattern = Pattern.compile("(.*?)\[(.*?)\]");

public static void extract(String input) {
    Matcher matcher = pattern.matcher(input);

    while (matcher.find()) {
        String text = matcher.group(1);
        String citation = matcher.group(2);

        System.out.println("The text: '" + text + "'\n\thas citation(s): " + citation);
    }
}

对于您提供的输入,此收集以下内容:

The text: 'The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other '
    has citation(s): 10, 11
The text: '. While the first assumption has been questioned recently in '
    has citation(s): 8

阅读您的评论后,您似乎想查找可能出现在任何给定句子中的引文。由于句子以句号结束,并且可能包含多个引用,因此您需要分两步处理:

public static void main(String[] args) {
    String input = "...";

    List<CitedSentence> citations = new ArrayList<CitedSentence>();
    for (String sentence : convertToSentences(input)) {
        citations.addAll(findCitations(sentence));
    }

    for (CitedSentence citation : citations) {
        System.out.println(citation);
    }
}

public static String[] convertToSentences(String input) {
    return input.split("\s*\.\s*");
}

private static final Pattern pattern = Pattern.compile("\[(.*?)\]");
public static List<CitedSentence> findCitations(String sentence) {
    Matcher matcher = pattern.matcher(sentence);
    List<CitedSentence> result = new ArrayList<CitedSentence>();

    while (matcher.find()) {
        String citation = matcher.group(1);

        for (String currentCitation : citation.split(",")) {
            result.add(new CitedSentence(sentence, currentCitation.trim()));
        }
    }

    return result;
}

static class CitedSentence {
    String sentence, citation;

    public CitedSentence(String sentence, String citation) {
        this.sentence = sentence;
        this.citation = citation;
    }

    public String toString() {
        return "[" + citation + "]: " + sentence;
    }
}

当我运行这个时,它会产生以下内容:

[10]: The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]
[11]: The sociological assumption is a constraint on the trust in the underlying social graph: the graph needs to have strong trust as evidenced, for example, by face to face interaction demonstrating social nodes knowledge of each other [10, 11]
[8]: While the first assumption has been questioned recently in [8], where it is shown that even the Copyright is held by the author/owner(s)

我只使用了部分示例文本。

这应该有效

 (^.|[.?!])+([^?!.]{0,})\[[0-9, ]{0,}?\]