如何获取节点周围的文本？

Question

我在玩坚果。我正在尝试编写一些内容，其中还包括检测 DOM 结构中的特定节点并从节点周围提取文本数据。例如来自父节点、兄弟节点等的文本。我研究并阅读了一些示例，然后尝试编写一个插件来为图像节点执行此操作。一些代码，

    if("img".equalsIgnoreCase(nodeName) && nodeType == Node.ELEMENT_NODE){
            String imageUrl = "No Url"; 
            String altText = "No Text";
            String imageName = "No Image Name"; //For the sake of simpler code, default values set to
                                                //avoid nullpointerException in findMatches method

            NamedNodeMap attributes = currentNode.getAttributes();
            List<String>ParentNodesText = new ArrayList<String>();
            ParentNodesText = getSurroundingText(currentNode);

            //Analyze the attributes values inside the img node. <img src="xxx" alt="myPic"> 
            for(int i = 0; i < attributes.getLength(); i++){
                Attr attr = (Attr)attributes.item(i);   
                if("src".equalsIgnoreCase(attr.getName())){
                    imageUrl = getImageUrl(base, attr);
                    imageName = getImageName(imageUrl);
                }
                else if("alt".equalsIgnoreCase(attr.getName())){
                    altText = attr.getValue().toLowerCase();
                }
            }

  private List<String> getSurroundingText(Node currentNode){

    List<String> SurroundingText = new ArrayList<String>();
    while(currentNode  != null){
        if(currentNode.getNodeType() == Node.TEXT_NODE){
            String text = currentNode.getNodeValue().trim();
            SurroundingText.add(text.toLowerCase());
        }

        if(currentNode.getPreviousSibling() != null && currentNode.getPreviousSibling().getNodeType() == Node.TEXT_NODE){
            String text = currentNode.getPreviousSibling().getNodeValue().trim();
            SurroundingText.add(text.toLowerCase());
        }
        currentNode = currentNode.getParentNode();
    }   
    return SurroundingText;
}

这似乎无法正常工作。检测到 img 标签，图像名称和 URL 被检索但没有更多帮助。 getSurroundingText 模块看起来太丑了，我尝试了但无法改进它。我不清楚从哪里以及如何提取可能与图像相关的文本。有什么帮助吗？

Answer 1

你走在正确的轨道上，另一方面，看看这个代码示例 HTML：

<div>
   <span>test1</span>
   <img src="http://example.com" alt="test image" title="awesome title">
   <span>test2</span>
</div>

在你的情况下，我认为问题出在 img 节点的兄弟节点上，例如你正在寻找直接兄弟节点，你可能认为在前面的例子中这些会是 span 节点，但在这种情况下是一些虚拟文本节点，因此当您请求 img 的兄弟节点时，您将得到这个没有实际文本的空节点。

如果我们将前面的 HTML 重写为： <div><span>test1</span><img src="http://example.com" alt="test image" title="awesome title"><span>test2</span></div> 那么 img 的兄弟节点将是您想要的 span 个节点。

我假设在前面的示例中你想同时获得 "text1" 和 "text2"，在这种情况下你需要实际继续移动直到找到一些 Node.ELEMENT_NODE 和然后获取该节点内的文本。一个好的做法是不要抓取你找到的任何东西，而是将你的范围限制在 p、span、div 以提高准确性。

如何获取节点周围的文本？

How to get surrounding text of a node?

java

search

text

nutch

dom-node