如何组合平假名和片假名之间转换的两个相似函数?
How to combine two similar functions that convert between hiragana and katakana?
我有两个在片假名和平假名之间转换的函数,它们看起来一样:
katakana_minus_hiragana = 0x30a1 - 0x3041 # KATAKANA LETTER A - HIRAGANA A
def is_hirgana(char):
return 0x3040 < ord(char[0]) and ord(char[0]) < 0x3097
def is_katakana(char):
return 0x30a0 < ord(char[0]) and ord(char[0]) < 0x30f7
def hiragana_to_katakana(hiragana_text):
katakana_text = ""
max_len = 0
for i, char in enumerate(hiragana_text):
if is_hirgana(char):
katakana_text += chr(ord(char) + katakana_minus_hiragana)
max_len += 1
else:
break
return katakana_text, max_len
def katakana_to_hiragana(katakana_text):
hiragana_text = ""
max_len = 0
for i, char in enumerate(katakana_text):
if is_katakana(char):
hiragana_text += chr(ord(char) - katakana_minus_hiragana)
max_len += 1
else:
break
return hiragana_text, max_len
有没有办法将hiragana_to_katakana()
和katakana_to_hiragana()
简化为鸭子函数或super/meta函数?
例如像
def convert_hk_kh(text, charset_range, offset):
charset_start, charset_end = charset_range
output_text = ""
max_len = 0
for i, char in enumerate(text):
if charset_start < ord(char[0]) and ord(char[0]) < charset_end:
output_text += chr(ord(char) + offset)
max_len +=1
else:
break
return output_text, max_len
def katakana_to_hiragana(katakana_text):
return convert_hk_kh(katakana_text, (0x30a0, 0x30f7), -katakana_minus_hiragana)
def hiragana_to_katakana(hiragana_text):
return convert_hk_kh(hiragana_text, (0x3040, 0x3097), katakana_minus_hiragana)
是否有其他 pythonic 方法来简化这两个非常相似的函数?
已编辑
还有 https://github.com/olsgaard/Japanese_nlp_scripts 似乎与 str.translate
做同样的事情。这样更有效率吗?更像蟒蛇?
我会这样做:
KATAKANA_HIRGANA_SHIFT = 0x30a1 - 0x3041 # KATAKANA LETTER A - HIRAGANA A
def shift_chars_prefix(text, amount, condition):
output = ''
for last_index, char in enumerate(text):
if not condition(char):
break
output += chr(ord(char) + amount)
return output, last_index
def katakana_to_hiragana(text):
return shift_chars_prefix(text, -KATAKANA_HIRGANA_SHIFT, lambda c: '\u30a0' < c < '\u30f7')
def hiragana_to_katakana(text):
return shift_chars_prefix(text, KATAKANA_HIRGANA_SHIFT, lambda c: '\u3040' < c < '\u3097')
如果你不return替换前缀的长度,你也可以使用正则表达式:
import re
KATAKANA_HIRGANA_SHIFT = 0x30a1 - 0x3041 # KATAKANA LETTER A - HIRAGANA A
def shift_by(n):
def replacer(match):
return ''.join(chr(ord(c) + n) for c in match.group(0))
return replacer
def katakana_to_hiragana(text):
return re.sub(r'^[\u30a1-\u30f6]+', shift_by(KATAKANA_HIRGANA_SHIFT), text)
def hiragana_to_katakana(text):
return re.sub(r'^[\u3041-\u3096]+', shift_by(-KATAKANA_HIRGANA_SHIFT), text)
这里有一个函数可以将每种假名转换为另一种假名。
与给定的函数不同,它遇到并不会停止
非假名,只是简单地传递这些字符而不改变
他们。
请注意,假名类型之间的转换并不像这样简单;为了
例如,在平假名中,长音“e”由 ええ 或 えい 表示
(例如おねえ姐姐,せんせい老师),而在片假名一
使用 chōonpu(おネー、せんせー)。外面有假名字符
您也使用的范围。
def switch_kana_type(kana_text):
"""Replace each kind of kana with the other kind. Other characters are
passed through unchanged."""
output_text = ''
for c in kana_text:
if is_hiragana(c): # Note typo fix of "is_hirgana"
output_text += chr(ord(c) + katakana_minus_hiragana)
elif is_katakana(char):
output_text += chr(ord(c) - katakana_minus_hiragana)
else:
output_text += c;
return output_text, len(output_text)
我有两个在片假名和平假名之间转换的函数,它们看起来一样:
katakana_minus_hiragana = 0x30a1 - 0x3041 # KATAKANA LETTER A - HIRAGANA A
def is_hirgana(char):
return 0x3040 < ord(char[0]) and ord(char[0]) < 0x3097
def is_katakana(char):
return 0x30a0 < ord(char[0]) and ord(char[0]) < 0x30f7
def hiragana_to_katakana(hiragana_text):
katakana_text = ""
max_len = 0
for i, char in enumerate(hiragana_text):
if is_hirgana(char):
katakana_text += chr(ord(char) + katakana_minus_hiragana)
max_len += 1
else:
break
return katakana_text, max_len
def katakana_to_hiragana(katakana_text):
hiragana_text = ""
max_len = 0
for i, char in enumerate(katakana_text):
if is_katakana(char):
hiragana_text += chr(ord(char) - katakana_minus_hiragana)
max_len += 1
else:
break
return hiragana_text, max_len
有没有办法将hiragana_to_katakana()
和katakana_to_hiragana()
简化为鸭子函数或super/meta函数?
例如像
def convert_hk_kh(text, charset_range, offset):
charset_start, charset_end = charset_range
output_text = ""
max_len = 0
for i, char in enumerate(text):
if charset_start < ord(char[0]) and ord(char[0]) < charset_end:
output_text += chr(ord(char) + offset)
max_len +=1
else:
break
return output_text, max_len
def katakana_to_hiragana(katakana_text):
return convert_hk_kh(katakana_text, (0x30a0, 0x30f7), -katakana_minus_hiragana)
def hiragana_to_katakana(hiragana_text):
return convert_hk_kh(hiragana_text, (0x3040, 0x3097), katakana_minus_hiragana)
是否有其他 pythonic 方法来简化这两个非常相似的函数?
已编辑
还有 https://github.com/olsgaard/Japanese_nlp_scripts 似乎与 str.translate
做同样的事情。这样更有效率吗?更像蟒蛇?
我会这样做:
KATAKANA_HIRGANA_SHIFT = 0x30a1 - 0x3041 # KATAKANA LETTER A - HIRAGANA A
def shift_chars_prefix(text, amount, condition):
output = ''
for last_index, char in enumerate(text):
if not condition(char):
break
output += chr(ord(char) + amount)
return output, last_index
def katakana_to_hiragana(text):
return shift_chars_prefix(text, -KATAKANA_HIRGANA_SHIFT, lambda c: '\u30a0' < c < '\u30f7')
def hiragana_to_katakana(text):
return shift_chars_prefix(text, KATAKANA_HIRGANA_SHIFT, lambda c: '\u3040' < c < '\u3097')
如果你不return替换前缀的长度,你也可以使用正则表达式:
import re
KATAKANA_HIRGANA_SHIFT = 0x30a1 - 0x3041 # KATAKANA LETTER A - HIRAGANA A
def shift_by(n):
def replacer(match):
return ''.join(chr(ord(c) + n) for c in match.group(0))
return replacer
def katakana_to_hiragana(text):
return re.sub(r'^[\u30a1-\u30f6]+', shift_by(KATAKANA_HIRGANA_SHIFT), text)
def hiragana_to_katakana(text):
return re.sub(r'^[\u3041-\u3096]+', shift_by(-KATAKANA_HIRGANA_SHIFT), text)
这里有一个函数可以将每种假名转换为另一种假名。 与给定的函数不同,它遇到并不会停止 非假名,只是简单地传递这些字符而不改变 他们。
请注意,假名类型之间的转换并不像这样简单;为了 例如,在平假名中,长音“e”由 ええ 或 えい 表示 (例如おねえ姐姐,せんせい老师),而在片假名一 使用 chōonpu(おネー、せんせー)。外面有假名字符 您也使用的范围。
def switch_kana_type(kana_text):
"""Replace each kind of kana with the other kind. Other characters are
passed through unchanged."""
output_text = ''
for c in kana_text:
if is_hiragana(c): # Note typo fix of "is_hirgana"
output_text += chr(ord(c) + katakana_minus_hiragana)
elif is_katakana(char):
output_text += chr(ord(c) - katakana_minus_hiragana)
else:
output_text += c;
return output_text, len(output_text)