处理多语言数据时需要遵循哪些数据准备步骤或技术？

Question

我正在研究多语言词嵌入代码，我需要用英语训练我的数据并用西班牙语测试它。我将使用 Facebook 的 MUSE 库进行词嵌入。我正在寻找一种以相同方式预处理我的数据的方法。我研究了变音符号修复来处理口音。

我无法想出一种方法来小心地删除停用词、标点符号和天气，我是否应该进行词形还原。

我怎样才能统一地预处理两种语言以创建一个词汇表，我以后可以将其与 MUSE 库一起使用。

Answer 1

嗨 Chandana 我希望你过得很好。我会考虑使用库 spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing 停止词过滤器（自 2.9 起）和 Snowball 词干节点可以应用于西班牙语。确保 select 节点对话框中的正确语言。不幸的是，到目前为止还没有西班牙语的词性标注器节点。

# Create functions to lemmatize stem, and preprocess

# turn beautiful, beautifuly, beautified into stem beauti 
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        newStopWords = ['your_stopword1', 'your_stop_word2']
        if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

如果您有任何问题，希望这对您有所帮助:)

处理多语言数据时需要遵循哪些数据准备步骤或技术？

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

nlp

word-embedding