如何从http://www.ranks.nl/stopwords考虑"Long Stopword List"？

Question

我有兴趣使用 R 从我的文本中删除所有停用词。我想删除的停用词列表可以在“长停用词列表”（一个非常长的列表版本）部分下的 http://www.ranks.nl/stopwords 中找到。我正在使用 tm 包。有人能帮帮我吗？谢谢！

Answer 1

您可以复制该列表（在浏览器中 select 之后）然后将其粘贴到 R 中的这个表达式中：

LONGSWS <- " <paste into this position> "

您可以将编辑器或 IDE 控制台设备的光标放在两个引号内。然后这样做：

sw.vec <- scan(text=LONGSWS, what="")
#Read 474 items

扫描函数需要通过给定 what 参数的示例指定输入类型，为此仅使用 "" 字符类型就足够了。然后您应该能够应用您在评论中提供的代码：

 tm_map(text, removeWords, sw.vec)

您没有提供示例 text 对象。仅使用字符向量不成功：

 tm_map("test of my text", removeWords, sw.vec )
#Error in UseMethod("tm_map", x) : 
#  no applicable method for 'tm_map' applied to an object of class "character"

所以我们需要假设您有一个合适的 class 对象放在 tm_map 参数的第一个位置。因此，使用 ?tm_map 帮助页面中的示例：

> res <- tm_map(crude, removeWords, sw.vec )
> str(res)
List of 20
 $ 127:List of 2
  ..$ content: chr "Diamond Shamrock Corp said \neffective today   cut  contract prices  crude oil \n1.50 dlrs  barrel.\n    The re"| __truncated__
  ..$ meta   :List of 15
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "1987-02-26 17:00:56"
  .. ..$ description  : chr ""
  .. ..$ heading      : chr "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
  .. ..$ id           : chr "127"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr "Reuters-21578 XML"
  .. ..$ topics       : chr "YES"
  .. ..$ lewissplit   : chr "TRAIN"
  .. ..$ cgisplit     : chr "TRAINING-SET"
   # ----------------snipped remainder of long output.

如何从http://www.ranks.nl/stopwords考虑"Long Stopword List"？

How to consider "Long Stopword List" from http://www.ranks.nl/stopwords?

text

r

tm