如何从 lua 中的字符串中删除 tashkeel?

How to remove tashkeel from a string in lua?

我正在做一个简单的功能,应该从阿拉伯语文本中删除 tashkeel,替换技术适用于英语,但不适用于阿拉伯语,你有什么建议?

lua代码:-

function replacePartOfString(arg,old,new)
  local zzz = arg.gsub(arg, old, new) 
  return zzz
end

function wordLengthIgnoringTashkeel(arg)
  local tashkeelArray = {"َ","ً","ُ","ٌ","ِ","ٍ","ْ","َ"}

  local tempWord = arg

  print("tempWord Before"..tempWord)
  for x=1,#tashkeelArray do
      replacePartOfString(tempWord,tashkeelArray[x],"")
  end
  print("tempWord After"..tempWord)
end

result

tempWord Beforeاليَوْمَ tempWord Afterاليَوْمَ

而预期的结果

expected result

tempWord Beforeاليَوْمَ tempWord Afterاليوم

这有效

function replacePartOfString(arg,old,new) 
  return arg.gsub(arg, old, new) 
end

function wordLengthIgnoringTashkeel(arg)
  local tashkeelArray = {"َ","ً","ُ","ٌ","ِ","ٍ","ْ","َّ"}
  local tempWord = arg
  for x=1,#tashkeelArray do
      tempWord = replacePartOfString(tempWord,tashkeelArray[x],"")
  end
  return #tempWord
end

函数 wordLengthIgnoringTashkeel(arg) local tashkeelArray = {"َ","ً","ُ","ٌ","ِ","ٍ","ْ","̶"}

本地 tempWord = arg

print("tempWord Before"..tempWord) 对于 x=1,#tashkeelArray 做 tempWord = string.gsub(tempWord, tashkeelArray[x],"") 结尾 打印 ( "tempWord After"..tempWord ) 结束

wordLengthIgnoringTashkeel("يَوْمو")

这段代码可能对你有帮助,它对我有用,一个文件:

perl  -CS -pe 's/[\x{064B}-\x{0650}]|[\x{0618}-\x{061A}]|[\x{0652}-\x{0653}]|[\x{0652}-\x{0653}]+//g' < "$f" > "$f.txt" ;

对于文件夹中的所有文件:

for f in *.txt; do 

perl  -CS -pe 's/[\x{064B}-\x{0650}]|[\x{0618}-\x{061A}]|[\x{0652}-\x{0653}]|[\x{0652}-\x{0653}]+//g' < "$f" > "$f.txt" ;

done

此致