如何检查网络上的选定文本是否仅包含 JavaScript 中的单词？

Question

在 vanilla Javascript 中，我试图确定用户在网页上选择的文本是否全部由单词组成（不包括符号）。

举个例子，

假设我们在网页某处有如下文本。

Hello, a text for the example! (When selected all)

应该导致 ['Hello', 'a', 'text', 'for', 'the', 'example']

然而，

Hello, a text for the example! (Leaving out the first three letters)

应该导致 ['a', 'text', 'for', 'the', 'example']，因为 Hello 没有被完全选为一个词。

到目前为止，我有一个 getSelectionText 函数可以显示所有选定的文本。

function getSelectionText() {
    var text = "";
    if (window.getSelection) {
        text = window.getSelection().toString();
    } else if (document.selection && document.selection.type !== "Control") {
        text = document.selection.createRange().text;
    }
    return text;
}

// Just adding the function as listeners.
document.onmouseup = document.onkeyup = function() {
    console.log(getSelectionText());
};

有什么好的方法可以调整我的功能以使其像我提到的那样工作吗？

Answer 1

实现目标的主要障碍是如何告诉程序 "word" 实际上是什么。

一种方法是拥有所有英语单词的完整词典。

const setOfAllEnglishWords = new Set([
  "Hello",
  "a",
  "text",
  "for",
  "the",
  "example"
  // ... many many more
]);

const selection = "lo, a text for the example!";
const result = selection
  .replace(/[^A-Za-z0-9\s]/g, "") // remove punctuation by replacing anything that is not a letter or a digit with the empty string
  .split(/\s+/)                   // split text into words by using 1 or more whitespace as the break point
  .filter(word => setOfAllEnglishWords.has(word));

console.log(result);

这可能需要大量内存。根据快速 Google 搜索，牛津英语词典大约有 218632 个单词。平均字长是 4.5 个字母，JS 每个字符存储 2 个字节，给我们 218632 * (4.5 * 2) = 1967688 B = 1.967 MB，在慢速 3G 连接上可能需要 1 分钟才能下载。

更好的方法可能是在每次加载页面时通过收集页面上所有唯一的单词来自己构建单词词典。

function getSetOfWordsOnPage() {
  const walk = document.createTreeWalker(
    document.body,
    NodeFilter.SHOW_TEXT
  );

  const dict = new Set();
  let n;
  while ((n = walk.nextNode())) {
    for (const word of n.textContent
      .replace(/[^A-Za-z0-9\s]/g, "")
      .split(/\s+/)
      .map(word => word.trim())
      .filter(word => !!word)) {
      dict.add(word);
    }
  }
  return dict;
}

const setOfWordsOnThePage = getSetOfWordsOnPage();

function getSelectionText() {
  if (window.getSelection) {
    return window.getSelection().toString();
  } else if (document.selection && document.selection.type !== "Control") {
    return document.selection.createRange().text;
  }
  return "";
}

// Just adding the function as listeners.
document.querySelector("#button").addEventListener("click", () => {
  const result = getSelectionText()
    .replace(/[^A-Za-z0-9\s]/g, "") // remove punctuation
    .split(/\s+/) // split text into words
    .filter(word => setOfWordsOnThePage.has(word));
  console.log(result);
});

<button id="button">Show result</button>
<p>this is some text</p>
<p>again this is a text!!!!!</p>
<p>another,example,of,a,sentence</p>

也许我们可以更进一步。我们甚至需要记住单词吗？好像定义"a word is text surrounded by spaces"就够了。

此外，正如OP在下面的评论中提到的，如果所选部分也是有效单词，我们也存在上述解决方案匹配部分所选单词的错误。

为了减少不必要的记忆页面单词的开销以及解决部分选择有效单词的错误，我们可以检查最左边（锚点）和最右边的内容（焦点）选择区域后的节点，如果它们包含其他未选择的文本，则忽略它们。

我们在这里做的假设是，对于任意文本选择，我们最多可以有 2 个部分选择的单词，每个选择结束一个。

注意： 下面的方法还通过假设 THIS、tHiS 和 this 都是同一个词来处理大写。

function removePunctuation(string) {
  return string.replace(/[^A-Za-z0-9\s]/g, " ");
}

function splitIntoWords(string) {
  return removePunctuation(string)
    .split(/\s+/)
    .map(word => word.toLowerCase().trim())
    .filter(word => !!word);
}

function getSelectedWords() {
  const selection = window.getSelection();
  const words = splitIntoWords(selection.toString());

  if (selection.anchorNode) {
    const startingsWords = splitIntoWords(selection.anchorNode.textContent);
    if (words[0] !== startingsWords[0]) {
      words.shift(); // remove the start since it's not a whole word
    }
  }

  if (selection.focusNode) {
    const endingWords = splitIntoWords(selection.focusNode.textContent);
    if (words[words.length - 1] !== endingWords[endingWords.length - 1]) {
      words.pop(); // remove the end since it's not a whole word
    }
  }

  return words;
}

// Just adding the function as listeners.
document.querySelector("#button").addEventListener("click", () => {
  console.log(getSelectedWords());
});

<button id="button">Show result</button>
<p><div>this is</div> <div>some text</div></p>
<p><span>again</span><span> </span><span>this</span><span> </span><span>is</span><span> </span><span>a</span> <span>text</span><span>!!!!!</span></p>
<p>another,example,of,a,sentence</p>

注意：如果您将单词分解为多个 html 元素，如 word，此代码仍会中断。这种情况打破了我们对单词的定义，要解决这个问题，您还需要包含某种字典来测试单词的有效性，本质上是结合上面的最后两个解决方案。

如何检查网络上的选定文本是否仅包含 JavaScript 中的单词？

How do I check if selected text on a web only consists of words in JavaScript?

html

javascript

selection