BeautifulSoup4 和 Python3 - 如何使用规则将我的输出数据分隔并写入不同的文件？

Question

这是我的代码：

from bs4 import BeautifulSoup
import requests

getUrl= 'https://ta.wikipedia.org/wiki/அலெக்சா இணையம்'
url = getUrl
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
heading = soup.title
refError = soup.findAll ('span', { 'class' : "error mw-ext-cite-error"})
for error in refError:
    err_str = str(error)
    err_str=err_str.replace("<span", heading.text+"~ <span").replace(" - தமிழ் விக்கிப்பீடியா", "")
    print(err_str)

这是我的输出数据，以页面名称~开头，以.结尾

例如，（记住这是一行）

அல்த்தாய் பிரதேசம்~ <span class="error mw-ext-cite-error" dir="ltr"
lang="ta" xml:lang="ta">பிழை காட்டு: Invalid <code>&lt;ref&gt;</code>
tag; name "2010Census" defined multiple times with different
content</span> Before this closing tag </span>

输出数据的末尾总是有一条引用错误消息，具体取决于维基百科页面。

在前面的文本中没有使用。

或

குறிச்சொல்லுக்குஉரையேதும்வழங்கப்படவக்குஉரையேதும்வழங்கப்படவில்[=16ை]=6ை=]

或

定义多次，内容不同

如果我运行此代码用于 1000 个 getUrl's（页面名称），我将获得 1000 个输出数据。现在我想将具有相同错误消息的页面分组到 .txt 文件中？如下所示，

带有引用错误消息的页面 -->未在之前的文本中使用.txt
带有ref-error消息的页面 - > குறிச்சொல்லுக்கு 。txt
包含引用错误消息的页面 -->多次定义但内容不同.txt

如何？

Answer 1

这是解决问题的一种方法。请尝试以下源代码：

from bs4 import BeautifulSoup
import requests
import re

# Url of the webpage to be scraped
getUrl= 'https://ta.wikipedia.org/wiki/அலெக்சா இணையம்'
url = getUrl
content = requests.get(url).content

# Patterns to be checked
pattern1 = re.compile(r'not used in prior text')
pattern2 = re.compile(r'குறிச்சொல்லுக்கு உரையேதும் வழங்கப்படவில்லை')
pattern3 = re.compile(r'defined multiple times with different content')

# Respective Error files
error_file1 = open("not_used_in_prior_text.txt", "w", encoding="utf-8")
error_file2 = open("குறிச்சொல்லுக்கு_உரையேதும்_வழங்கப்படவில்லை.txt", "w", encoding = "utf-8")
error_file3 = open("defined_multiple_times_with_different_content.txt", "w", encoding = "utf-8")
error_file4 = open("Anomalous_Errors.txt","w", encoding = "utf-8")

soup = BeautifulSoup(content,'lxml')
heading = soup.title
refError = soup.findAll ('span', { 'class' : "error mw-ext-cite-error"})

# Check for error patterns and save it in respective files
for error in refError:
    err_str = str(error)
    err_str=err_str.replace("<span", heading.text+"~ <span").replace(" - தமிழ் விக்கிப்பீடியா", "")
    if pattern1.search(err_str):
        error_file1.write(err_str)
    elif pattern2.search(err_str):
        error_file2.write(err_str)
    elif pattern3.search(err_str):
        error_file3.write(err_str)        
    else:
        error_file4.write(err_str)
    print(err_str)

# Close the files
error_file1.close()
error_file2.close()
error_file3.close()
error_file4.close()

已编辑源代码 2

from bs4 import BeautifulSoup 
import requests 
import re 

# Url of the webpage to be scraped 
getUrl= 'https://ta.wikipedia.org/wiki/அலெக்சா இணையம்' 
url = getUrl 
content = requests.get(url).content 

# Patterns to be checked 
pattern1 = re.compile(r'not used in prior text') 
pattern2 = re.compile(r'குறிச்சொல்லுக்கு உரையேதும் வழங்கப்படவில்லை') 
pattern3 = re.compile(r'defined multiple times with different content') 

# Respective Error files 
error_file1 = open("not_used_in_prior_text.txt", "w", encoding="utf-8") 
error_file2 = open("குறிச்சொல்லுக்கு_உரையேதும்_வழங்கப்படவில்லை.txt", "w", encoding = "utf-8") 
error_file3 = open("defined_multiple_times_with_different_content.txt", "w", encoding = "utf-8") 
error_file4 = open("Anomalous_Errors.txt","w", encoding = "utf-8") 

soup = BeautifulSoup(content,'lxml') 

heading = soup.title.text
heading = heading.replace(" - தமிழ் விக்கிப்பீடியா", "")
print(heading) # you can comment this line out
refError = soup.findAll ('span', { 'class' : "error mw-ext-cite-error"}) 

# Check for error patterns and save it in respective files 
for error in refError: 
    err_str = error.text
    print_error = heading+" ~ "+err_str
    if pattern1.search(err_str):
        error_file1.write(print_error) 
    elif pattern2.search(err_str):
        error_file2.write(print_error) 
    elif pattern3.search(err_str):
        error_file3.write(print_error) 
    else: 
        error_file4.write(print_error)
    print(print_error) 

# Close the files 
error_file1.close() 
error_file2.close() 
error_file3.close() 
error_file4.close()

BeautifulSoup4 和 Python3 - 如何使用规则将我的输出数据分隔并写入不同的文件？

BeautifulSoup4 with Python3 - How to separate and write my ouput data in different files with a rule?

python

wikipedia

beautifulsoup

python-3.x