Thursday 2 April 2015

What is Robots.txt?

 

The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl and index pages on their website.

The Robots Exclusion Protocol (REP) is a group of web standards that regulate web robot behavior and search engine indexing. The REP consists of the following:
  • The original REP from 1994, extended 1997, defining crawler directives for robots.txt. Some search engines support extensions like URI patterns (wild cards).
  • Its extension from 1996 defining indexer directives (REP tags) for use in the robots meta element, also known as "robots meta tag." Meanwhile, search engines support additional REP tags with an X-Robots-Tag. Webmasters can apply REP tags in the HTTP header of non-HTML resources like PDF documents or images.
  • The Microformat rel-nofollow from 2005 defining how search engines should handle links where the A Element's REL attributes contains the value "nofollow."
Structure of a Robots.txt File
The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:

User-agent:
Disallow:

User-agent” are search engines' crawlers and disallow: lists the files and directories to be excluded from indexing. In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:

# All user agents are disallowed to see the /temp directory.

Block all web crawlers from all content
User-agent: *
Disallow: /

Block a specific web crawler from a specific folder
User-agent: Googlebot
Disallow: /no-google/

Block a specific web crawler from a specific web page
User-agent: Googlebot
Disallow: /no-google/blocked-page.html

Sitemap Parameter
User-agent: *
Disallow:
Sitemap: http://www.example.com/none-standard-location/sitemap.xml

Important Rules
  • In most cases, meta robots with parameters "noindex, follow" should be employed as a way to to restrict crawling or indexation.
  • It is important to note that malicious crawlers are likely to completely ignore robots.txt and as such, this protocol does not make a good security mechanism.
  • Only one "Disallow:" line is allowed for each URL.
  • Each subdomain on a root domain uses separate robots.txt files.
  • Google and Bing accept two specific regular expression characters for pattern exclusion (* and $).
  • The filename of robots.txt is case sensitive. Use "robots.txt", not "Robots.TXT."
  • Spacing is not an accepted way to separate query parameters. For example, "/category/ /product page" would not be honored by robots.txt.
Things you should avoid
If you don't format your robots.txt file properly, some or all files of your Web site might not get indexed by search engines. To avoid this, do the following:
  1. Don't use comments in the robots.txt file
    Although comments are allowed in a robots.txt file, they might confuse some search engine spiders.

    "Disallow: support # Don't index the support directory" might be misinterepreted as "Disallow: support#Don't index the support directory".

  2. Don't use white space at the beginning of a line. For example, don't write

    placeholder User-agent: *
    place Disallow: /support

    but

    User-agent: *
    Disallow: /support

  3. Don't change the order of the commands. If your robots.txt file should work, don't mix it up. Don't write

    Disallow: /support
    User-agent: *

    but

    User-agent: *
    Disallow: /support

  4. Don't use more than one directory in a Disallow line. Do not use the following

    User-agent: *
    Disallow: /support /cgi-bin/ /images/

    Search engine spiders cannot understand that format. The correct syntax for this is

    User-agent: *
    Disallow: /support
    Disallow: /cgi-bin/
    Disallow: /images/

  5. Be sure to use the right case. The file names on your server are case sensitve. If the name of your directory is "Support", don't write "support" in the robots.txt file.

  6. Don't list all files. If you want a search engine spider to ignore all files in a special directory, you don't have to list all files. For example:

    User-agent: *
    Disallow: /support/orders.html
    Disallow: /support/technical.html
    Disallow: /support/helpdesk.html
    Disallow: /support/index.html

    You can replace this with

    User-agent: *
    Disallow: /support

  7. There is no "Allow" command

    Don't use an "Allow" command in your robots.txt file. Only mention files and directories that you don't want to be indexed. All other files will be indexed automatically if they are linked on your site.

No comments:

Post a Comment