I am a huge fan of the free WordPress Security plugin, Wordfence. I received this email yesterday regarding robots.txt and I thought I’d publish here for everyone’s reference:
Dear WordPress Publisher,
There is a subtle vulnerability that has emerged in the robots.txt standard and how Google indexes data. The issue may cause the existence of administrative areas on your site to be exposed by Google, even if you specifically tell Google to not index those areas.
Google respects the rules in robots.txt, but indexes robots.txt itself. With a specially crafted search query to Google, a hacker can find administrative areas on websites that publishers have told Google to exclude from the index. A possible solution is to exclude the admin areas from robots.txt and instead use HTTP basic authentication to prevent the areas from being indexed.
As a brief reminder, robots.txt is a way to tell crawlers like Google which pages you don’t want indexed on your website. You can read more about how robots.txt works here.
It has become common practice among website administrators to place administrative areas in their robots.txt telling Google that they don’t want that area indexed.
Unfortunately Google indexes robots.txt itself which creates a resource for hackers trying to find sites with an admin area they might be able to exploit.
Hackers can craft a query to Google which contains a URL of a known vulnerable application’s admin area to find vulnerable sites.
The following examples were posted to the Full Disclosure security mailing list which demonstrates this attack:
The query specifies that “robots.txt” must be in the URL, that the filetype must be a text file, and that the file contain a “Disallow” statement which tries to prevent Google from indexing an admin area, an area that stores backup files which would allow download of the entire website, and an area called “password”.
This is a catch22 situation because if you don’t use robots.txt to prevent indexing of your admin areas, Google will expose the fact that they exist by indexing whatever is publicly visible. If you do include the admin area in the robots.txt, Google indexes the robots.txt which exposes their existence anyway.
Naturally security through obscurity is not a good policy. Your admin areas should be protected using authentication. But using a crafted Google query to find vulnerable systems has long been a part of the hackers toolkit. So if you want to remove yourself from a potential target list, you should consider how to solve this issue.
My advice is the following:
If you have any directories where you store backups or sensitive data, don’t include them in your robots.txt and add a layer of “basic authentication” using an .htaccess file in addition to the existing authentication you have. So if you have a web based login screen, create a .htaccess file that protects that login screen from being indexed by Google.
Expect Google to respond to this in the coming weeks or months with a change in policy in how it indexes the content in robots.txt.
Wordfence Creator & Feedjit Inc. CEO.
PS: If you aren’t already a member you can subscribe to our WordPress Security and Product Updates mailing list here. You’re welcome to republish this email in part or in full provided you mention that the source is www.wordfence.com. If you would like to get Wordfence for your WordPress website, simply go to your “Plugin” menu, click “add new” and search for “wordfence”.