将 .txt 文件中的 HTML 代码隐藏为纯文本

Covert HTML code in .txt files into plain text

我有一个包含数百个包含 HTML 代码的 .txt 文件的文件夹。所有文件名和文件路径都存储在一个 .csv 文件中。 我想将每个 .txt 文件中的 HTML 代码转换为纯文本并再次保存文件。

我读到 html2text 是一个 python 脚本,可以满足我的需要。

你能帮助我如何继续吗?

main.py

from csv import DictReader
import requests
from bs4 import BeautifulSoup
import html2text

with open('Test.csv', 'r') as read_obj:
    csv_dict_reader = DictReader(read_obj)
    for row in csv_dict_reader:
        r = requests.get(row['FilePath'])
        content = r.content
        h = html2text.HTML2Text()

Test.csv

| FilePath,File | 
| -------- | 

| file:///C:/Users/UserUser/Desktop/Files/FirstFile.txt,FirstFile| 

| file:///C:/Users/UserUser/Desktop/Files/SecondFile.txt,SecondFile| 

更新后的答案:

在下面的评论中进行了一些讨论后,我原来的答案不会削减它。

文件 Test.csv 的结构不是 CSV 模块中的 DictReader 可以解析的。这很容易通过创建一个简单的文件解析器来解决。

2种方法下面的部分没有太大变化。我们不从 CSV 模块解析 DictReader 的结果,而是解析函数 readcsv

的结果

更新代码:

import html2text


h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False

def cleanline(instring: str) -> list:
    """
    removes the offending crap and returns a list of strings
    """
    return instring.replace('|', '').replace('file:///', '').strip().split(',')


def readcsv(filename: str) -> list:
    """
    read the CSV file and create a list of dict items based on it.
    the result will be similar to what DictReader from the CSV module did, 
    but tailored to the specific file formatting that you are processing.
    """
    result = []
    with open(filename) as csv_infile:
        # get headers & clean the line
        header_list = cleanline(csv_infile.readline())
        
        # skip the line "| -------- |" by just reading the line and not processing it
        # note that this is not actually needed as the logic below
        # only handles lines that contain a comma character
        csv_infile.readline()

        # process the rest of the lines
        for line in csv_infile:
            # the check below is to check if it's an empty line or not
            # (by looking for the comma separator)
            if ',' in line:
                # basically I use the header_list to turn the current line
                # into a dict and add it to the result list
 
                # set/reset values
                line_list = cleanline(line)
                line_dict = {}
 
                # use the index to get the header from the headerline
                for index, item in enumerate(line_list):
                    line_dict[header_list[index]] = item
                result.append(line_dict)
    return result


for row in readcsv('Test.csv'):
    print(row)
    infilename = row['FilePath']
    # create a filename based on the File column
    outfilename = f"{row['File']}.txt"
    with open(infilename) as html_infile:
        text = h.handle(html_infile.readlines())
    with open(outfilename, 'w') as html_outfile:
        html_outfile.write(text)

原回答:

您错过了最后一部分,from the docs

注意内容变量的赋值从 content = r.contentcontent = r.text 的变化。

我还添加了一个打印语句,所以你可以看到 contenttext 之间的区别。

from csv import DictReader
import requests
import html2text


with open('Test.csv', 'r') as read_obj:
    csv_dict_reader = DictReader(read_obj)
    for row in csv_dict_reader:
        r = requests.get(row['FilePath'])
        content = r.text
        print(content)
        h = html2text.HTML2Text()
        h.ignore_links = True
        h.bypass_tables = False
        text = h.handle(content)
        print(text)
        # edit to save the converted text to a file
        # for the filename I'm using the url, with some stripping
        # you need to test this code though as I wrote it on mobile
        with open(row['FilePath'].replace('/', '_').replace(':', ''), 'w') as outfile:
            outfile.write(text)

以上答案是基于执行 HTTP 请求的错误假设。以下是评论互动后调整后的答案

from csv import DictReader
import html2text


h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False

with open('Test.csv', 'r') as read_obj:
    csv_dict_reader = DictReader(read_obj)
    for row in csv_dict_reader:
        infilename = row['FilePath']
        outfilename = row['File']  # I'm using this as it seems this column is what this is meant for
        with open(infilename) as html_infile:
            text = h.handle(html_infile.readlines())
        with open(outfilename, 'w') as html_outfile:
            html_outfile.write(text)