How to Optimize Your Robots.txt File

Your Robots.txt file is your “anti sitemap”. Instead of telling Google and other search engines about the content you want them to find and index, a Robots.txt file tells them what pages on your site you don’t want them to find. In this video you’ll learn how to block search engine spiders from accessing the pages on your site that you don’t want people to be able to find via a search engine.

Video Transcript

Hey, what’s up everybody, it’s Brian Dean from Quick Sprout. In this video I’m going to show you how to optimize your robots.txt files for SEO.

First of all, what is a robots.txt file? It’s basically the opposite of your site map. A site map exists to tell search engines what pages you want them to index. And, a robots file is the opposite – it tells them what pages you don’t want them to index. You’re actually looking at a robots.txt file right now. I’m going to walk you through what each part of the robots file means and how you can modify it for your site.

The first thing you want to do is check to see if you have one, and if you do, look at what settings are already there. To do that, head into your favorite browser. Type in your site name followed by robots.txt. If you see something that looks like this then you have a robots file. If not, then you don’t.

If you want to modify this file you’ll need to install a plugin in WordPress. Head over to WordPress. Hover over plugins and click on add new. Then, under the search field put in robots. Click on search plugins. You want to install this one, WP Robots Txt. I’ve already installed it. You just install it and activate it.

Once that’s done, hover over settings. Click on reading. Then, scroll down. Now you have this new section to the reading area called robots.txt content. This is where you can modify your robots file.

I do want to give a warning that you usually don’t want to mess around with this too much. The only reason you would want to change your robots.txt content is if there are pages on your site that you don’t want Google to crawl and index, or if there are duplicate content issues on your site and you want certain pages to be blocked. So, if you have two identical pages you would choose one and add it here to the robots.txt file. Then, search engines wouldn’t crawl it. Later on I’m going to talk about another step you can take to make sure that those pages don’t get indexed.

Let me walk you through the language of the robots.txt file. User agent represents the bot that you’re speaking to. In this case it has a little asterisk there, and that speaks to all bots. This is Google bot, Yahoo, Bing, every search engine. You’re basically telling them under disallow you don’t want this page to be indexed and you don’t want this page to be indexed. So, disallow means that you don’t want that page to be crawled by the spiders.

User agent, you can change this, you add specific ones, and I’m going to teach you later on in the video how to find the bots that are visiting your site and to see maybe which ones you might want to block. But, in general, you want to keep this like it is.

Let’s say you want to add another user agent that you may want specific rules for. So, say in general you don’t want any spiders to index these two pages. You can add a new user agent, like Google bot, and then you can add a new disallow. You can disallow them from accessing, let’s say, Google page. Now, you don’t want them for some reason to index this page. That’s just Google. Bing is okay to index that page. That would depend on how your site is set up. You can also disallow certain file types. You could say *.jpeg and that would block Google from accessing any images on your site.

Like I said, in general you don’t want to mess too much with this. You should probably keep it pretty much like it is. But, if you want to add a user agent, you may want to look at what bots are accessing your site that maybe you don’t want to.

To do that, head over to your cPanel. You have to login to your hosting plan’s cPanel, and almost every host has cPanel. Then, scroll down to where it says logs and click on latest visitors. Then it’ll have how many domains you have in your hosting plan. Pick the one that you want to check out, and click on the magnifying glass button.

Then, under user agent you want to sort this. Because typically most users to your site, obviously, are just people. So, these user agents will be browsers like Mozilla and Chrome. You want to sort and see.

In this case Zemanta aggregator is a bot, and maybe some other bots. You’d want to take note of any that are visiting your site. Especially keep an eye out for things like href’s, Majestic SEO, and Open Site Explorer.

If you don’t want other people to reverse engineer your site you can make it a little more difficult if you block these bots. So, if you didn’t want Zemanta to be visiting your site you could choose Zemanta aggregator. Go back to your robots.txt file. Then, user agent, and you could disallow your entire site or certain sections of your site. If you want to disallow the entire site you do that. So, that’s the way to find specific user agents that you can block.

Now, I just want to say again, some bots won’t pay attention to your robots.txt file. They’ll just ignore your instructions and crawl and index the pages anyway. But, you also want to add another layer of security to your pages that you don’t want indexed by making them noindex and nofollow.

I’m going to show you why. Here is the robots.txt file from about.com. As you can see, they disallow this library no search area of their site. But, when you search for this in Google there are over 4,900 results. So, Google is still indexing the page, but they’re not crawling it.

That’s a distinction between a robots.txt file and a noindex tag which is a meta tag you can add to your pages. Because if there are links pointing to these pages Google will index them, even if they’re blocked by the robots.txt file. Because the robot file just tells them not to crawl it. They have this link in their index, but they’ve never crawled it. So, they don’t know what’s on it. They have no idea what content is on this page.

A more effective way to make sure these URL’s don’t even appear in the index at all is to also add the nofollow tag. To do that, head back to your WordPress dashboard. You want to install a plugin. So, click on add new. It’s called GD Press. So, put in GD Press here. Click on search plugins. It’s the first result.

Once that’s installed any page that you edit in WordPress or posts… I’m going to show you an example right now. Uncheck the use global meta tag settings, then you can choose no index follow. That’ll make sure that search engines don’t index your site and the robots.txt file will make sure they don’t crawl it. This is like two layers of security for pages that you really, really don’t want to appear in search engines.

Finally, you can actually edit your robots.txt file within Google Webmaster tools. You can check to see if you’re setting it up right and to see how search engines, or at least Google, is interpreting the information in your robots.txt file. Login to Google Webmaster tools and click on your site. Then, under crawl click on blocked URL’s.

As you can see, they’ve found the robots.txt file on my site. You want to see if they’ve actually found it, which is typically found at robots.txt like I said. Then, blocked URL’s, there are 33 blocked URL’s.

If you want to make changes you can add the same information you did using the plugin. So, you’d put user agent Google bot. Then disallow, whatever you want it to change. Then you could test the changes and it would tell you how well it worked.

This is another way to make sure your robots.txt file is set up correctly. Because it can result in your entire site not being crawled by Google. So, this is obviously something that you want to get right. That’s why you want to make only the changes that are absolutely necessary to your robots.txt file.

That’s all there is to optimizing your robots.txt file for SEO. Thanks for watching this video, and I’ll see you in the next one.