使用 python 删除阿拉伯文文本文件中的特殊字符、数字

Question

我只想保留阿拉伯字符，没有数字，我从 github 那里得到了这个正则表达式指令。

    generalPath="C:/Users/Desktop/Code/dataset/"
    outputPath= "C:/Users/Desktop/Code/output/"
    files = os.listdir(generalPath)

    for onefile in files:
    # relative or absolute file path, e.g.:
        localPath=generalPath+onefile
        localOutputPath=outputPath+onefile
        print(localPath)
        print(localOutputPath)
        with open(localPath, 'rb') as infile, open(localOutputPath, 'w') as outfile:
            data = infile.read().decode('utf-8')
            new_data = t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', data)
            outfile.write(new_data)

在这段代码中我得到了这个错误：追溯（最近一次通话）：文件“.\cleanText.py”，第 23 行，位于 outfile.write(new_data) 文件 "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py"，第 19 行，在编码中 return codecs.charmap_encode(输入,self.errors,encoding_table)[0] UnicodeEncodeError：'charmap' 编解码器无法对位置 0-2 中的字符进行编码：字符映射到

我的阿拉伯语文字有变音符号，我想保持这样

Answer 1

您的程序似乎正在尝试使用 CP1252 编码而不是 UTF-8 来读取您的文本文件。在打开时指定 unicode，如下所示。此外，由于它是一个文本文件，您可以使用 'r' 而不是 'rb'.

来阅读

with open(localPath, 'r', encoding='utf8') as infile

至于你的正则表达式，如果你只想删除数字，你可以使用

data = re.sub(r'[0-9]+', '', data)

您不需要将整个阿拉伯字母表指定为要保留的字符。但看起来你有像“（1/6）”这样的字符串。要同时删除所有括号和斜杠，请使用：

data = re.sub(r'[0-9\(\)/]+', '', data)

使用 python 删除阿拉伯文文本文件中的特殊字符、数字

Remove special caracters,numbers in an ARABIC text file with python

python

regex

arabic

arabic-support

python-re