What Is robots.txt?
robots.txt is a plain-text file placed in the root of a website that tells search engine crawlers which URLs they are allowed to request. When a crawler such as Googlebot visits a site, it checks https://www.example.co.za/robots.txt first to see which areas it should and should not fetch.
The file uses a simple syntax. A typical example reads: User-agent: * followed by Disallow: /admin/. The first line names the crawler the rule applies to (the asterisk means all crawlers), and the second line tells it not to request anything under the /admin/ path. A robots.txt file also commonly points crawlers to the sitemap with a line such as Sitemap: https://www.example.co.za/sitemap.xml.
It follows the Robots Exclusion Protocol, a long-standing standard that well-behaved crawlers respect voluntarily. It is not a security mechanism, only a set of instructions.
Why robots.txt Matters
robots.txt helps you manage crawl budget and steer crawlers away from low-value or sensitive areas of your site. By disallowing internal search results, admin areas, faceted-navigation URLs, or staging paths, you keep crawlers focused on the pages that actually matter for ranking. On a large South African site, this can meaningfully improve how efficiently Google crawls and refreshes your important content.
Just as important is what robots.txt does not do. Blocking a URL in robots.txt prevents crawling, but it does not prevent indexing. If other pages link to a blocked URL, Google may still list it in search results, usually with no description because it could not read the page. This is the single most misunderstood point about the file.
This is where the difference between robots.txt and the noindex directive becomes critical. To stop a page from appearing in search results, you must let crawlers access it and place a <meta name="robots" content="noindex"> tag on the page. If you both disallow the page in robots.txt and add a noindex tag, Google can never see the noindex tag, so the page may remain indexed. The two tools solve different problems: robots.txt controls crawling, noindex controls indexing.
How to Use robots.txt (and Avoid Common Mistakes)
Place the file at the domain root, keep the rules deliberate, and test before you publish. Google Search Console includes a robots.txt report that shows how Google reads your file. Always include a reference to your XML sitemap so crawlers can find your full list of pages.
Watch for the frequent errors that damage rankings. The most serious is accidentally disallowing the entire site with Disallow: /, which can happen when a staging configuration is pushed live and wipes a site from search overnight. Other mistakes include blocking CSS or JavaScript files that Google needs to render the page, trying to hide private data with robots.txt (a blocked URL is still publicly listed in the file), and relying on robots.txt to deindex content instead of using noindex.
robots.txt is one piece of technical SEO that sits alongside canonical tags and your sitemap. Related reading includes canonical tags and SEO. For a full check of your crawlability and indexation, see our SEO audit and SEO services.
FAQ
Does robots.txt stop a page from appearing in Google?
No. robots.txt only asks crawlers not to fetch a URL. A disallowed page can still appear in search results without a description if other pages link to it. To keep a page out of the index, allow it to be crawled and use a noindex meta tag instead.
Where should the robots.txt file be located?
robots.txt must sit in the root directory of your domain, for example https://www.example.co.za/robots.txt. Crawlers only look for it there, so placing it in a subfolder means it will be ignored.