BeautifulSoup 获取一行标签之间的文本
BeautifulSoup get text between tags for one line
我有一堆HTML GCOV分支和线覆盖工具的文档,
文件如下所示:
<tr>
<td align="right" class="lineno"><pre>224</pre></td>
<td align="right" class="linebranch"><span class="takenBranch" title="Branch 1 taken 329 times">✓</span><span class="notTakenBranch" title="Branch 2 not taken">✗</span><span class="notTakenBranch" title="Branch 4 not taken">✗</span><span class="takenBranch" title="Branch 5 taken 329 times">✓</span><br/><span class="notTakenBranch" title="Branch 6 not taken">✗</span><span class="takenBranch" title="Branch 7 taken 329 times">✓</span></td>
<td align="right" class="linecount coveredLine"><pre>329</pre></td>
<td align="left" class="src coveredLine"><pre> line of C++ code</pre></td>
</tr>
<tr>
<td align="right" class="lineno"><pre>225</pre></td>
<td align="right" class="linebranch"></td>
<td align="right" class="linecount uncoveredLine"><pre></pre></td>
<td align="left" class="src uncoveredLine"><pre> another line of C++ code;</pre></td>
</tr>
我想提取文本“(another) line of C++”代码,理想情况下还提取行号,以便输出如下所示:
224 line of C++ code
225 another line of C++ code
我尝试使用 BeautifulSoup 但它没有提供请求的输出,我的代码如下所示:
from itertools import islice
import codecs
import glob
from ntpath import join
import os
from bs4 import BeautifulSoup
lineNo = "<td align=\"right\" class=\"lineNo\"><pre>"
linetextCovered = "<td align=\"left\" class=\"src coveredLine\"><pre>"
linetextNotCovered = "<td align=\"left\" class=\"src uncoveredLine\"><pre>"
open('Output.txt', 'w').close() #Erase any content of Output.txt file
for filepath in glob.iglob('path/To/Reports/*.html'):
with codecs.open(os.path.join(filepath), "r") as inputFile, open('Output.txt',"a") as outputFile:
for num, line in enumerate(inputFile, 1):
if lineNo in line:
inputSoup = BeautifulSoup(line)
text = inputSoup.getText()
outputFile.write("".join(islice(text, 1) + "\t"))
if linetextCovered or linetextNotCovered in line:
inputSoup = BeautifulSoup(line)
text = inputSoup.getText()
outputFile.write("".join(islice(text, 4)))
outputFile.write("\n")
print("Done")
但输出看起来像这样
/* L
a:li
{
colo
text
}
我做错了什么?
非常感谢您的帮助。
你可以这样做:
from bs4 import BeautifulSoup
html = '''
<tr>
<td align="right" class="lineno"><pre>224</pre></td>
<td align="right" class="linebranch"><span class="takenBranch" title="Branch 1 taken 329 times">✓</span><span class="notTakenBranch" title="Branch 2 not taken">✗</span><span class="notTakenBranch" title="Branch 4 not taken">✗</span><span class="takenBranch" title="Branch 5 taken 329 times">✓</span><br/><span class="notTakenBranch" title="Branch 6 not taken">✗</span><span class="takenBranch" title="Branch 7 taken 329 times">✓</span></td>
<td align="right" class="linecount coveredLine"><pre>329</pre></td>
<td align="left" class="src coveredLine"><pre> line of C++ code</pre></td>
</tr>
<tr>
<td align="right" class="lineno"><pre>225</pre></td>
<td align="right" class="linebranch"></td>
<td align="right" class="linecount uncoveredLine"><pre></pre></td>
<td align="left" class="src uncoveredLine"><pre> another line of C++ code;</pre></td>
</tr>
'''
for tr in BeautifulSoup(html.encode(), 'html.parser').find_all('tr'):
lineno = tr.find('td',{'class':'src'}).text.strip()
src = tr.find('td', {'class':'lineno'}).text.strip()
print(lineno, src)
我有一堆HTML GCOV分支和线覆盖工具的文档, 文件如下所示:
<tr>
<td align="right" class="lineno"><pre>224</pre></td>
<td align="right" class="linebranch"><span class="takenBranch" title="Branch 1 taken 329 times">✓</span><span class="notTakenBranch" title="Branch 2 not taken">✗</span><span class="notTakenBranch" title="Branch 4 not taken">✗</span><span class="takenBranch" title="Branch 5 taken 329 times">✓</span><br/><span class="notTakenBranch" title="Branch 6 not taken">✗</span><span class="takenBranch" title="Branch 7 taken 329 times">✓</span></td>
<td align="right" class="linecount coveredLine"><pre>329</pre></td>
<td align="left" class="src coveredLine"><pre> line of C++ code</pre></td>
</tr>
<tr>
<td align="right" class="lineno"><pre>225</pre></td>
<td align="right" class="linebranch"></td>
<td align="right" class="linecount uncoveredLine"><pre></pre></td>
<td align="left" class="src uncoveredLine"><pre> another line of C++ code;</pre></td>
</tr>
我想提取文本“(another) line of C++”代码,理想情况下还提取行号,以便输出如下所示:
224 line of C++ code
225 another line of C++ code
我尝试使用 BeautifulSoup 但它没有提供请求的输出,我的代码如下所示:
from itertools import islice
import codecs
import glob
from ntpath import join
import os
from bs4 import BeautifulSoup
lineNo = "<td align=\"right\" class=\"lineNo\"><pre>"
linetextCovered = "<td align=\"left\" class=\"src coveredLine\"><pre>"
linetextNotCovered = "<td align=\"left\" class=\"src uncoveredLine\"><pre>"
open('Output.txt', 'w').close() #Erase any content of Output.txt file
for filepath in glob.iglob('path/To/Reports/*.html'):
with codecs.open(os.path.join(filepath), "r") as inputFile, open('Output.txt',"a") as outputFile:
for num, line in enumerate(inputFile, 1):
if lineNo in line:
inputSoup = BeautifulSoup(line)
text = inputSoup.getText()
outputFile.write("".join(islice(text, 1) + "\t"))
if linetextCovered or linetextNotCovered in line:
inputSoup = BeautifulSoup(line)
text = inputSoup.getText()
outputFile.write("".join(islice(text, 4)))
outputFile.write("\n")
print("Done")
但输出看起来像这样
/* L
a:li
{
colo
text
}
我做错了什么? 非常感谢您的帮助。
你可以这样做:
from bs4 import BeautifulSoup
html = '''
<tr>
<td align="right" class="lineno"><pre>224</pre></td>
<td align="right" class="linebranch"><span class="takenBranch" title="Branch 1 taken 329 times">✓</span><span class="notTakenBranch" title="Branch 2 not taken">✗</span><span class="notTakenBranch" title="Branch 4 not taken">✗</span><span class="takenBranch" title="Branch 5 taken 329 times">✓</span><br/><span class="notTakenBranch" title="Branch 6 not taken">✗</span><span class="takenBranch" title="Branch 7 taken 329 times">✓</span></td>
<td align="right" class="linecount coveredLine"><pre>329</pre></td>
<td align="left" class="src coveredLine"><pre> line of C++ code</pre></td>
</tr>
<tr>
<td align="right" class="lineno"><pre>225</pre></td>
<td align="right" class="linebranch"></td>
<td align="right" class="linecount uncoveredLine"><pre></pre></td>
<td align="left" class="src uncoveredLine"><pre> another line of C++ code;</pre></td>
</tr>
'''
for tr in BeautifulSoup(html.encode(), 'html.parser').find_all('tr'):
lineno = tr.find('td',{'class':'src'}).text.strip()
src = tr.find('td', {'class':'lineno'}).text.strip()
print(lineno, src)