如何清理一段文本 Python 3 没有外部模块?

How to sanitise a block of text Python 3 no external modules?

最近设置了一个 hackerrank 任务,我无法在不破坏 Python 3.

中的文本的情况下从标签中正确清理一段文本

提供了两个示例输入(如下),挑战是清除它们以使它们成为安全的普通文本块。完成挑战的时间已经结束,但我很困惑我怎么把这么简单的东西弄错了。任何有关我应该如何处理的帮助将不胜感激。

测试输入一个

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
var y=window.prompt("Hello")
window.alert(y)
</script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

测试输入二

In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work.  The full details of your in-text references, <script language="JavaScript">
document.write("Page. Last update:" + document.lastModified); </script>When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. 
The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

测试建议输出 1

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

测试建议输出 2

  In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work. The full details of your in-text references, When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

提前致谢!

编辑(使用@YakovDan 的消毒): 代码:

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


    return out_str

inp=input()
print(sanitize(inp))

输入:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
 var y=window.prompt("Hello")
 window.alert(y)
 </script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

输出:

众所周知,reader 在查看布局时会被页面的可读内容分散注意力。使用 Lorem Ipsum 的要点在于它或多或少具有正态分布的字母,而不是使用 'Content here, content here',使其看起来像可读的英语。许多桌面出版包和网页编辑器现在使用 Lorem Ipsum 作为他们的默认模型文本,搜索 'lorem ipsum' 将发现许多网站仍处于起步阶段。

输出应该是什么:

众所周知,reader 在查看布局时会被页面的可读内容分散注意力。使用 Lorem Ipsum 的要点在于它或多或少具有正态分布的字母,而不是使用 'Content here, content here',使其看起来像可读的英语。许多桌面出版包和网页编辑器现在使用 Lorem Ipsum 作为他们的默认模型文本,搜索 'lorem ipsum' 会发现许多网站仍然处于 infancy.Contrary 普遍认为,Lorem Ipsum 不仅仅是随机的文本。它起源于公元前 45 年的一部古典拉丁文学,至今已有 2000 多年的历史。理查德·麦克林托克 (Richard McClintock) 是弗吉尼亚州汉普登-悉尼学院 (Hampden-Sydney College) 的拉丁语教授,他从 Lorem Ipsum 段落中查找了一个比较晦涩的拉丁词 consectetur。

一般来说,正则表达式是解析 HTML 标签 (see here) 的错误工具,但它适用于这项工作,因为标签很简单 - 如果您有非常规 (没有结束标签等的标签)输入,它将失败。

也就是说,对于这两个示例,您可以使用 this regex:

<.*?>.*?<\s*?\/.*?>

在 Python 中实施:

import re
s = one of your long strings
r = re.sub('<.*?>.*?<\s*?\/.*?>', '', s, flags=re.DOTALL)
print(r)

给出了预期的结果(太啰嗦了,无法复制!)。

这是一种无需正则表达式即可执行此操作的方法。

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


     return out_str

这应该可以做到(取决于关于标签的假设)