robots.txt 在 codeigniter 中 - 允许 view/function
robots.txt in codeigniter - allow view/function
我阅读了一些关于 robots.txt 的内容,我读到我应该禁止我的 Web 应用程序中的所有文件夹,但我希望允许机器人读取主页和一个视图(url 是例如:www.mywebapp/searchresults - 这是一条 codeigniter 路由 - 它是从 application/controller/function).
调用的
例如文件夹结构是:
-index.php(should be able to read by bots)
-application
-controllers
-controller(here is a function which load view)
-views
-public
我应该像这样创建 robots.txt 吗:
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/controllers/function
或使用类似
的路线
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /www.mywebapp/searchresults
或者可能使用视图?
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/views/search/index.php
谢谢!
您不要阻止视图文件,因为抓取工具无法直接访问该文件。您需要阻止用于访问您的视图的 URL
robots.txt 文件必须放在主机的文档根目录中。它不适用于其他位置。
If your host is www.example.com, it needs to be accessible at http://www.example.com/robots.txt
要删除您网站的目录或个别页面,您可以在 server.When 的根目录下放置一个 robots.txt 文件来创建 robots.txt 文件,请记住以下几点: 当决定在特定主机上抓取哪些页面时,Googlebot 将遵循 robots.txt 文件中的第一条记录,用户代理以 "Googlebot." 开头,如果不存在这样的条目,它将服从用户代理为“”的第一个条目。此外,Google 通过使用星号为 robots.txt 文件标准增加了灵活性。不允许的模式可以包含“”以匹配任何字符序列,并且模式可以以“$”结尾以指示名称的结尾。
To remove all pages under a particular directory (for example, listings), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /listings
To remove all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /*.gif$
To remove dynamically generated pages, you'd use this robots.txt entry:
User-agent: Googlebot
Disallow: /*?
Option 2: Meta tags
Another standard, which can be more convenient for page-by-page use, involves adding a <META> tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.
To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow other robots to index the page on your site, preventing only Search Engine's robots from indexing the page, you'd use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
进一步参考
https://www.elegantthemes.com/blog/tips-tricks/how-to-create-and-configure-your-robots-txt-file
回答我自己的老问题:
当我们想让机器人阅读某些页面时,我们需要使用我们的 URL(路由),所以在这种情况下:
Allow: /www.mywebapp/searchresults
在某些情况下,我们还可以通过 HTML 标记(添加到 header)来禁止某些页面:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
当我们想屏蔽一些文件夹,即图片时:
Disallow: /public/images
我阅读了一些关于 robots.txt 的内容,我读到我应该禁止我的 Web 应用程序中的所有文件夹,但我希望允许机器人读取主页和一个视图(url 是例如:www.mywebapp/searchresults - 这是一条 codeigniter 路由 - 它是从 application/controller/function).
调用的例如文件夹结构是:
-index.php(should be able to read by bots)
-application
-controllers
-controller(here is a function which load view)
-views
-public
我应该像这样创建 robots.txt 吗:
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/controllers/function
或使用类似
的路线User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /www.mywebapp/searchresults
或者可能使用视图?
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/views/search/index.php
谢谢!
您不要阻止视图文件,因为抓取工具无法直接访问该文件。您需要阻止用于访问您的视图的 URL
robots.txt 文件必须放在主机的文档根目录中。它不适用于其他位置。
If your host is www.example.com, it needs to be accessible at http://www.example.com/robots.txt
要删除您网站的目录或个别页面,您可以在 server.When 的根目录下放置一个 robots.txt 文件来创建 robots.txt 文件,请记住以下几点: 当决定在特定主机上抓取哪些页面时,Googlebot 将遵循 robots.txt 文件中的第一条记录,用户代理以 "Googlebot." 开头,如果不存在这样的条目,它将服从用户代理为“”的第一个条目。此外,Google 通过使用星号为 robots.txt 文件标准增加了灵活性。不允许的模式可以包含“”以匹配任何字符序列,并且模式可以以“$”结尾以指示名称的结尾。
To remove all pages under a particular directory (for example, listings), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /listings
To remove all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /*.gif$
To remove dynamically generated pages, you'd use this robots.txt entry:
User-agent: Googlebot
Disallow: /*?
Option 2: Meta tags
Another standard, which can be more convenient for page-by-page use, involves adding a <META> tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.
To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow other robots to index the page on your site, preventing only Search Engine's robots from indexing the page, you'd use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
进一步参考
https://www.elegantthemes.com/blog/tips-tricks/how-to-create-and-configure-your-robots-txt-file
回答我自己的老问题:
当我们想让机器人阅读某些页面时,我们需要使用我们的 URL(路由),所以在这种情况下:
Allow: /www.mywebapp/searchresults
在某些情况下,我们还可以通过 HTML 标记(添加到 header)来禁止某些页面:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
当我们想屏蔽一些文件夹,即图片时:
Disallow: /public/images