R - 如何简化特殊字符的文本清理?
R - How to simplify this text clean-up of special characters?
我怀疑有一种方法可以简化此文本预处理。但是,我找不到如何将所有这些字符替换合并到一行中的解决方案。因此,为了避免我当前解决方案中的所有重复(见下文):
Encoding(posts2$caption_clean) <- "UTF-8"
posts2$caption_clean <- iconv(posts2$caption_clean, "latin1", "UTF-8")
posts2$caption_clean <- gsub("Ã\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("â\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ð\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Â\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("å\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ð\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ñ\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ù\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ø\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ú\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ì\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Õ\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ã\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Û\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ë\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ê\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("追\S*","",posts2$caption_clean)
有谁知道我该如何简化这个?
谢谢!
# construct regex where each target pattern is a group ()
# enclose groups in [] to target any of those groups
regex <- "[(Ã\S*)(â\S*)(ð\S*)]"
string <- "Ã x â x ð y "
gsub(regex, "", string)
结果:
[1] " x x y "
我怀疑有一种方法可以简化此文本预处理。但是,我找不到如何将所有这些字符替换合并到一行中的解决方案。因此,为了避免我当前解决方案中的所有重复(见下文):
Encoding(posts2$caption_clean) <- "UTF-8"
posts2$caption_clean <- iconv(posts2$caption_clean, "latin1", "UTF-8")
posts2$caption_clean <- gsub("Ã\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("â\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ð\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Â\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("å\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ð\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ñ\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ù\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ø\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ú\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ì\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Õ\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ã\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Û\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ë\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ê\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("追\S*","",posts2$caption_clean)
有谁知道我该如何简化这个?
谢谢!
# construct regex where each target pattern is a group ()
# enclose groups in [] to target any of those groups
regex <- "[(Ã\S*)(â\S*)(ð\S*)]"
string <- "Ã x â x ð y "
gsub(regex, "", string)
结果:
[1] " x x y "