将 .txt 文件中的 HTML 代码隐藏为纯文本
Covert HTML code in .txt files into plain text
我有一个包含数百个包含 HTML 代码的 .txt 文件的文件夹。所有文件名和文件路径都存储在一个 .csv 文件中。
我想将每个 .txt 文件中的 HTML 代码转换为纯文本并再次保存文件。
我读到 html2text 是一个 python 脚本,可以满足我的需要。
你能帮助我如何继续吗?
main.py
from csv import DictReader
import requests
from bs4 import BeautifulSoup
import html2text
with open('Test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
r = requests.get(row['FilePath'])
content = r.content
h = html2text.HTML2Text()
Test.csv
| FilePath,File |
| -------- |
| file:///C:/Users/UserUser/Desktop/Files/FirstFile.txt,FirstFile|
| file:///C:/Users/UserUser/Desktop/Files/SecondFile.txt,SecondFile|
更新后的答案:
在下面的评论中进行了一些讨论后,我原来的答案不会削减它。
文件 Test.csv
的结构不是 CSV 模块中的 DictReader
可以解析的。这很容易通过创建一个简单的文件解析器来解决。
2种方法下面的部分没有太大变化。我们不从 CSV 模块解析 DictReader
的结果,而是解析函数 readcsv
的结果
更新代码:
import html2text
h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False
def cleanline(instring: str) -> list:
"""
removes the offending crap and returns a list of strings
"""
return instring.replace('|', '').replace('file:///', '').strip().split(',')
def readcsv(filename: str) -> list:
"""
read the CSV file and create a list of dict items based on it.
the result will be similar to what DictReader from the CSV module did,
but tailored to the specific file formatting that you are processing.
"""
result = []
with open(filename) as csv_infile:
# get headers & clean the line
header_list = cleanline(csv_infile.readline())
# skip the line "| -------- |" by just reading the line and not processing it
# note that this is not actually needed as the logic below
# only handles lines that contain a comma character
csv_infile.readline()
# process the rest of the lines
for line in csv_infile:
# the check below is to check if it's an empty line or not
# (by looking for the comma separator)
if ',' in line:
# basically I use the header_list to turn the current line
# into a dict and add it to the result list
# set/reset values
line_list = cleanline(line)
line_dict = {}
# use the index to get the header from the headerline
for index, item in enumerate(line_list):
line_dict[header_list[index]] = item
result.append(line_dict)
return result
for row in readcsv('Test.csv'):
print(row)
infilename = row['FilePath']
# create a filename based on the File column
outfilename = f"{row['File']}.txt"
with open(infilename) as html_infile:
text = h.handle(html_infile.readlines())
with open(outfilename, 'w') as html_outfile:
html_outfile.write(text)
原回答:
您错过了最后一部分,from the docs。
注意内容变量的赋值从 content = r.content
到 content = r.text
的变化。
我还添加了一个打印语句,所以你可以看到 content
和 text
之间的区别。
from csv import DictReader
import requests
import html2text
with open('Test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
r = requests.get(row['FilePath'])
content = r.text
print(content)
h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False
text = h.handle(content)
print(text)
# edit to save the converted text to a file
# for the filename I'm using the url, with some stripping
# you need to test this code though as I wrote it on mobile
with open(row['FilePath'].replace('/', '_').replace(':', ''), 'w') as outfile:
outfile.write(text)
以上答案是基于执行 HTTP 请求的错误假设。以下是评论互动后调整后的答案
from csv import DictReader
import html2text
h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False
with open('Test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
infilename = row['FilePath']
outfilename = row['File'] # I'm using this as it seems this column is what this is meant for
with open(infilename) as html_infile:
text = h.handle(html_infile.readlines())
with open(outfilename, 'w') as html_outfile:
html_outfile.write(text)
我有一个包含数百个包含 HTML 代码的 .txt 文件的文件夹。所有文件名和文件路径都存储在一个 .csv 文件中。 我想将每个 .txt 文件中的 HTML 代码转换为纯文本并再次保存文件。
我读到 html2text 是一个 python 脚本,可以满足我的需要。
你能帮助我如何继续吗?
main.py
from csv import DictReader
import requests
from bs4 import BeautifulSoup
import html2text
with open('Test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
r = requests.get(row['FilePath'])
content = r.content
h = html2text.HTML2Text()
Test.csv
| FilePath,File |
| -------- |
| file:///C:/Users/UserUser/Desktop/Files/FirstFile.txt,FirstFile|
| file:///C:/Users/UserUser/Desktop/Files/SecondFile.txt,SecondFile|
更新后的答案:
在下面的评论中进行了一些讨论后,我原来的答案不会削减它。
文件 Test.csv
的结构不是 CSV 模块中的 DictReader
可以解析的。这很容易通过创建一个简单的文件解析器来解决。
2种方法下面的部分没有太大变化。我们不从 CSV 模块解析 DictReader
的结果,而是解析函数 readcsv
更新代码:
import html2text
h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False
def cleanline(instring: str) -> list:
"""
removes the offending crap and returns a list of strings
"""
return instring.replace('|', '').replace('file:///', '').strip().split(',')
def readcsv(filename: str) -> list:
"""
read the CSV file and create a list of dict items based on it.
the result will be similar to what DictReader from the CSV module did,
but tailored to the specific file formatting that you are processing.
"""
result = []
with open(filename) as csv_infile:
# get headers & clean the line
header_list = cleanline(csv_infile.readline())
# skip the line "| -------- |" by just reading the line and not processing it
# note that this is not actually needed as the logic below
# only handles lines that contain a comma character
csv_infile.readline()
# process the rest of the lines
for line in csv_infile:
# the check below is to check if it's an empty line or not
# (by looking for the comma separator)
if ',' in line:
# basically I use the header_list to turn the current line
# into a dict and add it to the result list
# set/reset values
line_list = cleanline(line)
line_dict = {}
# use the index to get the header from the headerline
for index, item in enumerate(line_list):
line_dict[header_list[index]] = item
result.append(line_dict)
return result
for row in readcsv('Test.csv'):
print(row)
infilename = row['FilePath']
# create a filename based on the File column
outfilename = f"{row['File']}.txt"
with open(infilename) as html_infile:
text = h.handle(html_infile.readlines())
with open(outfilename, 'w') as html_outfile:
html_outfile.write(text)
原回答:
您错过了最后一部分,from the docs。
注意内容变量的赋值从 content = r.content
到 content = r.text
的变化。
我还添加了一个打印语句,所以你可以看到 content
和 text
之间的区别。
from csv import DictReader
import requests
import html2text
with open('Test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
r = requests.get(row['FilePath'])
content = r.text
print(content)
h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False
text = h.handle(content)
print(text)
# edit to save the converted text to a file
# for the filename I'm using the url, with some stripping
# you need to test this code though as I wrote it on mobile
with open(row['FilePath'].replace('/', '_').replace(':', ''), 'w') as outfile:
outfile.write(text)
以上答案是基于执行 HTTP 请求的错误假设。以下是评论互动后调整后的答案
from csv import DictReader
import html2text
h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False
with open('Test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
infilename = row['FilePath']
outfilename = row['File'] # I'm using this as it seems this column is what this is meant for
with open(infilename) as html_infile:
text = h.handle(html_infile.readlines())
with open(outfilename, 'w') as html_outfile:
html_outfile.write(text)