一旦 bs4 抓取了代码片段,我该如何清理我的代码内容?
how can I clean my code content once bs4 scrapes the code snippet?
我正在尝试抓取代码中的所有数据内容,但是我的代码在 code_snippet = soup.find('code')
上看起来有点奇怪,因为它显示不同的数据,如下所示:
<code class="language-plaintext highlighter-rouge">backend/src</code>
None
hh2019/09/22/dragonteaser19-rms/
<code>What do?
list [p]ending requests
list [f]inished requests
[v]iew result of request
[a]dd new request
[q]uit
Choice? [pfvaq]
</code>
None
hh2019/01/02/exploiting-math-expm1-v8/
<code class="language-plaintext highlighter-rouge">nc 35.246.172.142 1</code>
None
hh2018/12/23/xmas18-white-rabbit/
<code class="MathJax_Preview">n</code>
None
hh2018/12/02/pwn2win18-tpm20/
<code>Welcome to my trusted platform. Tell me what do you want:
hh2018/05/21/rctf18-stringer/
<code class="language-plaintext highlighter-rouge">calloc</code>
None
但是,打印 soup = BeautifulSoup(content['value'], "html.parser")
它 returns 正确的数据 pre > code
我只对这些标签中的内容感兴趣,看起来像这样
<h3 id="overview">Overview</h3>
<p>The challenge shipped with several cave templates.
A user can build a cave from an existing template and populate it with treasures in random positions.
For caves created by the gamebot, the treasures are flags.
Any user can visit a cave by providing a program written in a custom programming language.
The program has to navigate around the cave.
If it terminates on a treasure, the treasure’s contents will be printed.</p>
<p>I was drawn to this challenge because the custom programming language is compiled to machine code using LLVM, and then executed.
It seemed like a fun place to look for bugs.</p>
<p>The challenge ships the backend’s source code in <code class="language-plaintext highlighter-rouge">backend/src</code>, some program samples in <code class="language-plaintext highlighter-rouge">backend/samples</code>, and the prebuilt binaries in <code class="language-plaintext highlighter-rouge">backend/build</code>.
The <code class="language-plaintext highlighter-rouge">backend/build/SaarlangCompiler</code> executable is a standalone compiler for the language.
It’s useful for testing, but it is not used in the challenge.
The actual server is <code class="language-plaintext highlighter-rouge">backend/build/SchlossbergCaveServer</code>.
It binds to the local port 9081, and it is exposed to other teams through a nginx reverse proxy on port 9080.
I will use port 9081 in examples and exploits so that they can be tested locally without nginx.</p>
<h3 id="api-interactions">API interactions</h3>
<p>The APIs are defined in <code class="language-plaintext highlighter-rouge">backend/src/api.cpp</code>.
We will take a look at some typical API interactions.
I will prettify JSON responses for your convenience.</p>
<p>First, we need to register a user:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl -c cookies -X POST -H 'Content-Type: application/json' \
-d '{"username": "abiondo", "password": "secret"}' \
http://localhost:9081/api/users/register
{
"username": "abiondo"
}
</code></pre></div></div>
我想刮掉所有 <pre *><code>
并用 code_snippet.get_text()
清理它,但我不确定,我在这方面遗漏了什么,但是,我正在使用 asyncio + feedparser + bs4
scraper,但在某些时候,它给了我错误的数据。
for entrie in entries:
print(entrie['link'])
for content in entrie['content']:
soup = BeautifulSoup(content['value'], "html.parser")
code_snippet = soup.find('code')
print(soup)
您可以尝试使用
soup.findAll("code",{"class":"language-plaintext"}.text
您可以使用CSSselect或pre > code
。这将 select 所有 <code>
直接在 <pre>
:
下
import requests
from bs4 import BeautifulSoup
url = 'https://abiondo.me/2020/03/22/saarctf20-schlossberg/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for code in soup.select('pre > code'):
print(code.get_text())
print('-' * 80)
打印:
$ curl -c cookies -X POST -H 'Content-Type: application/json' \
-d '{"username": "abiondo", "password": "secret"}' \
http://localhost:9081/api/users/register
{
"username": "abiondo"
}
--------------------------------------------------------------------------------
$ curl -b cookies -X POST -H 'Content-Type: application/json' \
-d '{"name": "MyFancyCave", "template": 1}' \
http://localhost:9081/api/caves/rent
{
"created": 1584867401,
"id": "1584867401_1345632849",
"name": "MyFancyCave",
"owner": "abiondo",
"template_id": 1,
"treasure_count": 0,
"treasures": []
}
--------------------------------------------------------------------------------
$ curl -b cookies -X POST -H 'Content-Type: application/json' \
-d '{"cave_id": "1584867401_1345632849", "names": [ \
"SAAR{OneFancyFlagOneFancyFlag00000000}", \
"SAAR{TwoFancyFlagsTwoFancyFlags000000}"]}' \
http://localhost:9081/api/caves/hide-treasures
{
"created": 1584867401,
"id": "1584867401_1345632849",
"name": "MyFancyCave",
"owner": "abiondo",
"template_id": 1,
"treasure_count": 2,
"treasures": [
{
"name": "SAAR{OneFancyFlagOneFancyFlag00000000}",
"x": 645,
"y": 97
},
{
"name": "SAAR{TwoFancyFlagsTwoFancyFlags000000}",
"x": 505,
"y": 14
}
]
}
--------------------------------------------------------------------------------
...and so on.
我正在尝试抓取代码中的所有数据内容,但是我的代码在 code_snippet = soup.find('code')
上看起来有点奇怪,因为它显示不同的数据,如下所示:
<code class="language-plaintext highlighter-rouge">backend/src</code>
None
hh2019/09/22/dragonteaser19-rms/
<code>What do?
list [p]ending requests
list [f]inished requests
[v]iew result of request
[a]dd new request
[q]uit
Choice? [pfvaq]
</code>
None
hh2019/01/02/exploiting-math-expm1-v8/
<code class="language-plaintext highlighter-rouge">nc 35.246.172.142 1</code>
None
hh2018/12/23/xmas18-white-rabbit/
<code class="MathJax_Preview">n</code>
None
hh2018/12/02/pwn2win18-tpm20/
<code>Welcome to my trusted platform. Tell me what do you want:
hh2018/05/21/rctf18-stringer/
<code class="language-plaintext highlighter-rouge">calloc</code>
None
但是,打印 soup = BeautifulSoup(content['value'], "html.parser")
它 returns 正确的数据 pre > code
我只对这些标签中的内容感兴趣,看起来像这样
<h3 id="overview">Overview</h3>
<p>The challenge shipped with several cave templates.
A user can build a cave from an existing template and populate it with treasures in random positions.
For caves created by the gamebot, the treasures are flags.
Any user can visit a cave by providing a program written in a custom programming language.
The program has to navigate around the cave.
If it terminates on a treasure, the treasure’s contents will be printed.</p>
<p>I was drawn to this challenge because the custom programming language is compiled to machine code using LLVM, and then executed.
It seemed like a fun place to look for bugs.</p>
<p>The challenge ships the backend’s source code in <code class="language-plaintext highlighter-rouge">backend/src</code>, some program samples in <code class="language-plaintext highlighter-rouge">backend/samples</code>, and the prebuilt binaries in <code class="language-plaintext highlighter-rouge">backend/build</code>.
The <code class="language-plaintext highlighter-rouge">backend/build/SaarlangCompiler</code> executable is a standalone compiler for the language.
It’s useful for testing, but it is not used in the challenge.
The actual server is <code class="language-plaintext highlighter-rouge">backend/build/SchlossbergCaveServer</code>.
It binds to the local port 9081, and it is exposed to other teams through a nginx reverse proxy on port 9080.
I will use port 9081 in examples and exploits so that they can be tested locally without nginx.</p>
<h3 id="api-interactions">API interactions</h3>
<p>The APIs are defined in <code class="language-plaintext highlighter-rouge">backend/src/api.cpp</code>.
We will take a look at some typical API interactions.
I will prettify JSON responses for your convenience.</p>
<p>First, we need to register a user:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl -c cookies -X POST -H 'Content-Type: application/json' \
-d '{"username": "abiondo", "password": "secret"}' \
http://localhost:9081/api/users/register
{
"username": "abiondo"
}
</code></pre></div></div>
我想刮掉所有 <pre *><code>
并用 code_snippet.get_text()
清理它,但我不确定,我在这方面遗漏了什么,但是,我正在使用 asyncio + feedparser + bs4
scraper,但在某些时候,它给了我错误的数据。
for entrie in entries:
print(entrie['link'])
for content in entrie['content']:
soup = BeautifulSoup(content['value'], "html.parser")
code_snippet = soup.find('code')
print(soup)
您可以尝试使用
soup.findAll("code",{"class":"language-plaintext"}.text
您可以使用CSSselect或pre > code
。这将 select 所有 <code>
直接在 <pre>
:
import requests
from bs4 import BeautifulSoup
url = 'https://abiondo.me/2020/03/22/saarctf20-schlossberg/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for code in soup.select('pre > code'):
print(code.get_text())
print('-' * 80)
打印:
$ curl -c cookies -X POST -H 'Content-Type: application/json' \
-d '{"username": "abiondo", "password": "secret"}' \
http://localhost:9081/api/users/register
{
"username": "abiondo"
}
--------------------------------------------------------------------------------
$ curl -b cookies -X POST -H 'Content-Type: application/json' \
-d '{"name": "MyFancyCave", "template": 1}' \
http://localhost:9081/api/caves/rent
{
"created": 1584867401,
"id": "1584867401_1345632849",
"name": "MyFancyCave",
"owner": "abiondo",
"template_id": 1,
"treasure_count": 0,
"treasures": []
}
--------------------------------------------------------------------------------
$ curl -b cookies -X POST -H 'Content-Type: application/json' \
-d '{"cave_id": "1584867401_1345632849", "names": [ \
"SAAR{OneFancyFlagOneFancyFlag00000000}", \
"SAAR{TwoFancyFlagsTwoFancyFlags000000}"]}' \
http://localhost:9081/api/caves/hide-treasures
{
"created": 1584867401,
"id": "1584867401_1345632849",
"name": "MyFancyCave",
"owner": "abiondo",
"template_id": 1,
"treasure_count": 2,
"treasures": [
{
"name": "SAAR{OneFancyFlagOneFancyFlag00000000}",
"x": 645,
"y": 97
},
{
"name": "SAAR{TwoFancyFlagsTwoFancyFlags000000}",
"x": 505,
"y": 14
}
]
}
--------------------------------------------------------------------------------
...and so on.