允许某些网址并使用 robots.txt 拒绝其余网址

allowing certain urls and deny the rest with robots.txt

我只需要允许某些特定目录并拒绝其余目录。据我了解,您应该先允许,然后再禁止其余的。这是我设置的正确吗?

Allow: /word-lists/words-that-start-with/letter/z/
Allow: /word-lists/words-that-end-with/letter/z/
Disallow: /word-lists/words-that-start-with/letter/
Disallow: /word-lists/words-that-end-with/letter/

您的代码段看起来不错,只是不要忘记在顶部添加 User-Agent

allow/disallow关键字的顺序目前不重要,但要由客户做出正确的选择。请参阅我们的 Robots.txt 文档中的 Order of precedence for group-member records 部分。

[...] for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule.

最初的 RFC 确实声明客户端应该按照找到规则的顺序评估规则,但是我不记得有任何爬虫会真正这样做,而是为了安全起见并遵循最严格的规则。

To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.