从多语言文本中删除停用词
Remove Stop words from multi-lingual Text
我正在 运行 对医疗保健行业的多语言文本文件进行文本和情感分析,我想一次从所有语言中删除停用词。我不想在代码中写下每种语言的名称来删除停用词。有什么方法可以快速完成吗?
这是我的代码:文件总数是596
files = list.files(path = getwd(), pattern = "txt", all.files = FALSE,
full.names = TRUE, recursive = TRUE)
txt = {}
for (i in 1:596)
try(
{
txt[[i]] <- readLines(files[i], warn = FALSE)
filename <- txt[[i]]
filename <- trimws(filename)
corpus <- iconv(filename, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
# Clean Text
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords("english"))
cleanset <- tm_map(cleanset, removeWords, stopwords("spanish"))
cleanset <- tm_map(cleanset, content_transformer(tolower))
cleanset <- tm_map(cleanset, stripWhitespace)
# Remove spaces and newlines
cleanset <- tm_map("\n", " ", cleanset)
cleanset <- tm_map("^\s+", "", cleanset)
cleanset <- tm_map("\s+$", "", cleanset)
cleanset <- tm_map("[ |\t]+", " ", cleanset)
}, silent = TRUE)
使用spacy where it has more than 15 language models with stopwords. For R language spacyr.
I want to remove stopwords from all the languages at once.
合并每个 stopwords(cc)
调用的结果,并将其传递给单个 tm_map(corpus, removeWords, allStopwords)
调用。
I don't want to write the name of every language in the code to remove the stopwords
您可以使用 stopwords_getlanguages()
获取所有受支持语言的列表,并将其作为一个循环执行。请参阅 https://www.rdocumentation.org/packages/stopwords/versions/2.3
中的示例
就其价值而言,我认为这(使用所有语言的停用词)是个坏主意。一种语言中的停用词在另一种语言中可能是高信息词。例如。略读 https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt 我发现了“embargo”、“final”、“mayor”、“salvo”、“sea”,它们不在英文停用词列表中,但可以携带信息。
当然,这取决于您在删除所有这些单词后对数据执行的操作。
但如果搜索药物名称或其他关键字,只需对原始数据执行此操作,不要删除停用词。
我正在 运行 对医疗保健行业的多语言文本文件进行文本和情感分析,我想一次从所有语言中删除停用词。我不想在代码中写下每种语言的名称来删除停用词。有什么方法可以快速完成吗?
这是我的代码:文件总数是596
files = list.files(path = getwd(), pattern = "txt", all.files = FALSE,
full.names = TRUE, recursive = TRUE)
txt = {}
for (i in 1:596)
try(
{
txt[[i]] <- readLines(files[i], warn = FALSE)
filename <- txt[[i]]
filename <- trimws(filename)
corpus <- iconv(filename, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
# Clean Text
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords("english"))
cleanset <- tm_map(cleanset, removeWords, stopwords("spanish"))
cleanset <- tm_map(cleanset, content_transformer(tolower))
cleanset <- tm_map(cleanset, stripWhitespace)
# Remove spaces and newlines
cleanset <- tm_map("\n", " ", cleanset)
cleanset <- tm_map("^\s+", "", cleanset)
cleanset <- tm_map("\s+$", "", cleanset)
cleanset <- tm_map("[ |\t]+", " ", cleanset)
}, silent = TRUE)
使用spacy where it has more than 15 language models with stopwords. For R language spacyr.
I want to remove stopwords from all the languages at once.
合并每个 stopwords(cc)
调用的结果,并将其传递给单个 tm_map(corpus, removeWords, allStopwords)
调用。
I don't want to write the name of every language in the code to remove the stopwords
您可以使用 stopwords_getlanguages()
获取所有受支持语言的列表,并将其作为一个循环执行。请参阅 https://www.rdocumentation.org/packages/stopwords/versions/2.3
就其价值而言,我认为这(使用所有语言的停用词)是个坏主意。一种语言中的停用词在另一种语言中可能是高信息词。例如。略读 https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt 我发现了“embargo”、“final”、“mayor”、“salvo”、“sea”,它们不在英文停用词列表中,但可以携带信息。
当然,这取决于您在删除所有这些单词后对数据执行的操作。
但如果搜索药物名称或其他关键字,只需对原始数据执行此操作,不要删除停用词。