如何使用 Python 从文本文件中去除 SGML 标签？

Question

我遇到了 Standard Generalized Markup Language lately. I have acquired the corpus which is in SGML format from EMILLE/CIIL Corpus。这是这个语料库的文档：

EMILLE Corpus Documentation

我只想提取文件中的文本。来自文档的语料库的编码和标记信息是：

The text is encoded as two-byte Unicode text. For more information on Unicode. The texts are marked up in SGML using level 1 CES-compliant markup. Each file also includes a full header, which specifies the provenance of the text.

我很难剥离这些标签。我尝试使用 'regular expression' 和 'beautiful soup' 但它不起作用。这是示例文本文件。我要保留的语言是旁遮普语。

Answer 1

尝试以下操作：

from bs4 import BeautifulSoup
import requests

# Assuming this is the url where the file is
html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/MANUAL.htm').content

bsObj = BeautifulSoup(html)

textData = bsObj.findAll('p')

for item in textData:
    print item.get_text()

Answer 2

或者您可以使用简单的正则表达式；如果数据是包含以 < 开头并以 > 结尾的标签的字符串，则这些标签之间的所有内容都将被丢弃，您可以将多个空格限制为一个并删除数据。

data = re.sub(r'<.*?>', '', data)
data = re.sub(r'\s+', ' ', data)
data = data.strip()

如何使用 Python 从文本文件中去除 SGML 标签？

How to strip SGML tags from a text file using Python?

python

regex

unicode

sgml

beautifulsoup