在 Python 中排序 HTML Table（正确计算行数）

Question

我希望能够计算列中的条目数：Change、Status、Req._Type 按值。例如，NEW OBJECT 出现了两次。

更改列具有以下值：NEW OBJECT、OBJECT DELETED、属性“Object Text”已更改、属性“Object Heading”已更改

状态列的值：审查中（或其他虚构值）

Req._Type 列具有以下值：功能要求、信息、Überschrift（或其他虚构值）

试过的解决方案（repl.it网上有不错的IDE）：

#!/usr/bin/python

import re

from bs4 import BeautifulSoup

with open('Test2.html', 'rb') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    strings = soup.find_all(string=re.compile('NEW OBJECT'))
    strings2 = soup.find_all(string=re.compile('OBJECT DELETED'))
    strings3 = soup.find_all(string=re.compile('Attribute "Object Text" Changed'))
    strings4 = soup.find_all(string=re.compile('Attribute "Object Heading" Changed'))
    strings5 = soup.find_all(string=re.compile("Info."))
    strings6 = soup.find_all(string=re.compile("functional Req."))
    strings7 = soup.find_all(string=re.compile('Überschrift'))
    strings8 = soup.find_all(string=re.compile('In Review'))
    print(len(strings))
    print(len(strings2))
    print(len(strings3))
    print(len(strings4))
    print(len(strings5))
    print(len(strings6))
    print(len(strings7))
    print(len(strings8))
    #strings3 = soup.find_all(string=re.compile('Changed'))
    #print(strings3)

    #for txt in strings3:
        #print(' '.join(txt.split()))

    #for tag in soup.find_all('th'):
    #    print(f'{tag.name}: {tag.text}')
    
    for tag in soup.find_all('td'):
        new = f'{tag.text}'
        if(new.find('Info.') != -1):
            print ("Found!")
            #print(soup.select('b:nth-of-type(3)'))
        else:
            print ("Not found!")

对应输出：

2
1
1
0
3
1
0
3
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Found!
Not found!
Not found!
Found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!

我尝试的解决方案不是动态的并且不匹配列，而是使用 find_all 搜索以匹配表达式并计算它们，这不是最优的。

a)如何使其动态化，以便只考虑提到的列，我们得到这三列的每个值类别的计数器？在给定的示例中，“信息”。 value 被错误地找到了 3 次，尽管它只需要找到两次，这是正确的答案。这需要为三列的每个值完成。

b)如何输出过滤器的计数器：新对象和功能要求。 (=0), 新对象和信息。 (=1), 对象已删除且功能要求。（=1）？尝试了 here 中的不同方法，但无法正常工作。

c)可选问题：Status 或者 Req._Type 列可以有不同的值，这取决于定义table。这意味着值可以改变而不是固定的。我们能否计算这些值（通过过滤掉数组或列表中的唯一值），然后计算受影响列中包含的每个唯一值的数量。

Answer 1

我不确定我是否理解你的第二个和第三个问题（并且，根据 SO 政策的要求，你应该 post 每个问题分开），但这是解决第一个问题的方法，它可能其他的也帮你解决。

import pandas as pd
ht = """[your html]"""
targets = ['Change', 'Status', 'Req._Type']
df = pd.read_html(ht)[1]

for target in targets:
    print(df[target].value_counts())
    print('---')

输出：

NEW OBJECT                         2
Attribute "Object Text" Changed    1
OBJECT DELETED                     1
Name: Change, dtype: int64
---
In Review    3
Name: Status, dtype: int64
---
Info.              2
functional Req.    1
Name: Req._Type, dtype: int64

在 Python 中排序 HTML Table（正确计算行数）

Sorting out HTML Table in Python (counting rows correctly)

html

beautifulsoup

html-parsing

python-3.x

pandas