stopwords_tr中的部分字符未出现土耳其字符

Question

stopwords_tr <- data.frame(word = stopwords::stopwords("tr",source="stopwords-iso"), stringsAsFactors = FALSE)
stopwords_tr

stopwords_tr 中的某些字符不是土耳其语。例如;

1   acaba
2   acep
3   adamakıllı
4   adeta
5   ait
6   altmýþ   <-Here must be: altmış
7   altmış
8   altý     <-Here must be: altı

我正在寻找修复它们的方法。

stopwords_tr$word<-gsub("ý","ı",stopwords_tr$word)

结果没有改变。我试过这些，但没有。

Encoding (stopwords_tr $ word) <- "WINDOWS-1254"
Encoding (stopwords_tr $ word) <- "LATIN-5"
Encoding (stopwords_tr $ word) <- "UTF-8"

另一个有趣的事情。

在R Studio中双击stopwords_tr显示时，字符出现"ý"。在控制台中，它看起来像 "y".

是否有设置编码的参数？谢谢大家。

Answer 1

如果您确定这是一个错误，我认为解决此问题的最佳方法是修复原始来源：post https://github.com/stopwords-iso/stopwords-iso/issues or https://github.com/stopwords-iso/stopwords-tr/issues 的问题（不确定哪个更好；试一试，如果你做错了，他们会告诉你的！）

但是检查确实是错误的。我不懂土耳其语，但是当我 Google 搜索 "altmýþ" 时，我在几个对我来说看起来像土耳其语的页面上找到了它，例如https://greatsong.net/PAROLES-ISMAIL-YK,ISTEMIYORUM-SENI,101646494.html。可能是一个编码错误，但如果它是一个常见的错误，也许你真的想要它在列表中。

关于显示问题：听起来您在 Windows。 Windows 上的 R 在显示 non-native 个字符时出现问题。您可能没有安装冰岛语，所以它会无法显示像 altmýþ 这样的词。

Answer 2

我听从了#user2554330 的建议。但是，我申请的地址与他显示的地址不同。我联系了 stopwords-tr 的创建者 (Kenneth Benoit)。问题源于 mis-encoded 数据源。我还注意到重复的单词并报告了它们。我们一起解决了性格问题。 stopwords-tr 已更新。在以下地址；

（修复土耳其语 #16）

https://github.com/quanteda/stopwords/pull/16

devtools::install_github("quanteda/stopwords", ref = "fix-tr")

stopwords("tr", source = "stopwords-iso")

"Turkish Stopwords" 现在似乎已正确编码。问候..

stopwords_tr中的部分字符未出现土耳其字符

Some characters in stopwords_tr do not appear Turkish character

replace

r

stop-words

gsub