删除非英语子标题和段落

Question

您好，我有一个脚本可以删除副标题和段落，但我无法删除包含非英语副标题和单词的段落。

例如(原文):

=== Personal finance ===
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

=== Corporate finance ===
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

== External links ==
Business acronyms and abbreviations
Business acronyms

== Kūrybinės Industrijos ==
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.

我从我的代码中得到的（结果）是：

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.

这就是我希望达到的效果（期望的结果）:

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

脚本如下：

import re
from subprocess import call

f1 = open('asd.text', 'r') # read file that contains the orginal text
f2 = open('NoRef.text', 'w') # write to new file

section_title_re = re.compile("^=+\s+.*\s+=+$")

content = []
skip = False
for l in f1.read().splitlines():
    line = l.strip()

    if "== external links ==" in line.lower():
        skip = True  
        continue

    if section_title_re.match(line):
        skip = False
        continue
    if skip:
        continue
    content.append(line)

content = '\n'.join(content) + '\n'
f2.write(content+"\n")
f2.close()

问题： 到目前为止，我的代码能够删除带有已知名称副标题的段落，例如 "External Links"。

但是我要删除那些非英语的副标题和段落吗？

谢谢。

Answer 1

如果您只想检测字符串是否包含非英文字符，那很简单：只需尝试将其解码为 ascii：如果失败，则它包含代码高于 127 的字符：

try:
     utxt = txt.decode('ascii')
except:
     # txt contains non "english" characters
     ...

如果你想检测它是否包含非英文单词，那是一个更复杂的问题，你应该想知道是否要接受写得不好的英文单词，例如englich woerds badli writed。如果你想走那条路，祝你好运...

删除非英语子标题和段落

Removing Non English Sub headings and Paragraphs

python

wikipedia

non-english

wikipedia-api

python-2.7