Robots.txt Explained: The Backend File That Controls How Search Engines See Your Website

Robots.txt Crawl Sequence

Googlebot requests yoursite.com/robots.txt

?What HTTP status does robots.txt return?

Yes

200 OK: Googlebot reads and obeys directives

5xx error: Googlebot blocks ALL crawling (site disappears)

?Does robots.txt block the target URL?

Yes

Page uncrawled — but can still be indexed via external links

Googlebot crawls page, reads noindex/meta tags, indexes content

✓Crawl budget spent wisely: block low-value URLs, prioritize money pages

If you run a website and have never opened your robots.txt file, you’re flying blind on one of the most fundamental parts of technical SEO. This tiny text file, sitting quietly in your site’s root directory, is the first thing Googlebot reads before it crawls a single page. Get it wrong, and you could accidentally hide your best content from search engines. Get it right, and you direct crawl resources exactly where they matter most.

I’ve audited hundreds of Singapore business websites over the years. A surprising number of them have robots.txt files that are either misconfigured, copy-pasted from a random tutorial, or completely empty. Some are actively sabotaging their own rankings without knowing it.

This guide goes deep into how robots.txt works, how to write one properly, and the specific mistakes I see Singapore site owners making repeatedly. Whether you’re running an e-commerce store, a professional services firm, or a content-heavy portal, this is the backend primer you need.

What Robots.txt Actually Does (And What It Doesn’t)

A robots.txt file is a plain text file that lives at the root of your domain. When you type yoursite.com.sg/robots.txt into a browser, you should see it. If you see a 404 error, you don’t have one, and that’s a problem we’ll address below.

The file follows a standard called the Robots Exclusion Protocol. It gives crawl directives to search engine bots, telling them which URLs they’re allowed to access and which ones to skip. Think of it like the “Authorised Personnel Only” sign at a hawker centre’s kitchen entrance. Well-behaved visitors (Googlebot, Bingbot) will respect the sign. A rat won’t.

The Critical Distinction: Blocking Crawling vs. Blocking Indexing

This is where most people get confused, and where the real damage happens. Robots.txt controls crawling, not indexing. These are two completely different things.

Crawling means a bot visits and reads your page. Indexing means Google stores that page in its database and can show it in search results. If you block a URL in robots.txt, Googlebot won’t crawl it. But if another website links to that blocked URL, Google can still index it. You’ll see it appear in search results with the frustrating message: “A description for this result is not available because of this site’s robots.txt.”

If your goal is to keep a page out of search results entirely, you need a noindex meta tag or an X-Robots-Tag HTTP header. And here’s the catch: if you block the page in robots.txt, Googlebot can’t crawl it, which means it can’t see the noindex tag. The two directives cancel each other out. I’ve seen this exact mistake on at least a dozen Singapore sites in the past year alone.

How the Crawl Sequence Works

Here’s the exact order of operations when Googlebot visits your site:

Googlebot sends an HTTP request to yoursite.com.sg/robots.txt
If the file returns a 200 status, Googlebot reads and obeys the directives inside
If the file returns a 404, Googlebot assumes everything is open for crawling
If the file returns a 5xx server error, Googlebot treats the entire site as disallowed (this is the dangerous one)
Only after processing robots.txt does Googlebot begin crawling your actual pages

That fourth point is worth pausing on. If your server is having issues and robots.txt returns a 500 error, Google will stop crawling your entire site. I’ve seen a Singapore SaaS company lose 60% of their indexed pages over two weeks because their hosting provider had intermittent server errors that affected robots.txt delivery. They didn’t notice until organic traffic cratered.

Why Robots.txt Matters for SEO Performance in Singapore

You might think this is just housekeeping. It’s not. A properly configured robots.txt file directly impacts three things that affect your rankings.

Crawl Budget Optimisation

Google allocates a finite crawl budget to every website. For a 50-page brochure site, this rarely matters. But if you’re running a Singapore e-commerce store with 10,000 product pages, filtered category views, and internal search result URLs, your crawl budget becomes a real constraint.

Every time Googlebot wastes a crawl on your /search?q=red+shoes&size=42&sort=price URL, that’s a crawl it didn’t spend on your actual product page. Robots.txt lets you block these low-value, auto-generated URLs so your crawl budget goes to the pages that actually drive revenue.

On a recent audit for a Singapore fashion retailer, we found that 73% of Googlebot’s crawl activity was going to faceted navigation URLs. After updating robots.txt to block those paths, their important product pages were being recrawled 3x more frequently. New products started appearing in search results within days instead of weeks.

Preventing Duplicate Content Dilution

Many Singapore websites, especially those on older CMS platforms, generate duplicate versions of pages without the site owner realising it. Print-friendly versions, session ID URLs, HTTP and HTTPS variants, trailing slash variations. Each one splits your ranking signals.

While canonical tags are the primary solution for duplicate content, robots.txt serves as a useful first layer of defence. Blocking crawlers from accessing known duplicate paths prevents them from even discovering the duplicates in the first place.

Protecting Staging and Development Environments

If you’re testing a site redesign at staging.yoursite.com.sg, that subdomain needs its own robots.txt file with a full disallow. I cannot tell you how many times I’ve found staging sites ranking in Google for Singapore businesses, sometimes outranking the live site. Each subdomain is treated as a separate website by search engines, so each one needs its own robots.txt.

Robots.txt Syntax: A Technical Breakdown

The syntax is deceptively simple. Four directives handle 95% of what you’ll ever need. But precision matters. One wrong character and your directive does nothing, or worse, blocks something you didn’t intend to block.

User-agent

This specifies which crawler the following rules apply to. The wildcard * means “all crawlers.” You can also target specific bots:

User-agent: Googlebot
User-agent: Bingbot
User-agent: GPTBot

That last one is increasingly relevant. If you don’t want OpenAI’s crawler training on your content, you’d add a specific block for GPTBot. Same for other AI crawlers like Google-Extended, CCBot, and anthropic-ai. This is a growing concern for content-heavy Singapore publishers.

Disallow

Tells the specified bot not to crawl a given path. The path is always relative to the root domain.

Disallow: /admin/
Disallow: /checkout/
Disallow: /internal-reports/

A blank Disallow directive means “disallow nothing,” which effectively allows everything:

User-agent: *
Disallow:

Allow

Overrides a Disallow rule for a specific path within a blocked directory. This is where you get granular control:

User-agent: *
Disallow: /resources/
Allow: /resources/public-guide.pdf

This blocks the entire /resources/ folder but lets crawlers access one specific PDF. Google processes Allow and Disallow by specificity, with the more specific path winning. If two directives have equal specificity, Allow takes precedence.

Sitemap

This directive points crawlers to your XML sitemap. It’s not tied to any User-agent block and should be placed at the top or bottom of your file:

Sitemap: https://www.yoursite.com.sg/sitemap.xml

You can list multiple sitemaps if you have them (for example, a separate sitemap for blog posts and products). This is one of the simplest things you can do to help search engines discover your content faster.

Wildcard Patterns

Two special characters give you pattern-matching power:

* matches any sequence of characters. Disallow: /*?sort= blocks any URL containing the query parameter sort=
$ indicates the end of a URL. Disallow: /*.pdf$ blocks all URLs ending in .pdf, but won’t accidentally block /pdf-resources/guide/

These wildcards are technically a Google and Bing extension to the original protocol. Most major crawlers support them, but not all do.

Step-by-Step: Creating Your Robots.txt File

Here’s exactly how to create and deploy a robots.txt file, whether you’re on WordPress or a custom-built site.

For WordPress Sites

WordPress generates a virtual robots.txt file by default. If you’re using Yoast SEO or Rank Math, you can edit it directly from the plugin’s settings panel. In Rank Math, go to General Settings > Edit robots.txt. In Yoast, go to Tools > File Editor.

Here’s a solid starting template for a typical Singapore WordPress site:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?s=
Disallow: /*?replytocom=
Disallow: /tag/
Disallow: /author/

Sitemap: https://www.yoursite.com.sg/sitemap_index.xml

Note the Allow: /wp-admin/admin-ajax.php line. Many WordPress themes and plugins use admin-ajax.php for front-end functionality. Blocking it can break how Google renders your pages.

For Custom or Non-WordPress Sites

Open a plain text editor. Notepad on Windows, TextEdit on Mac (set to plain text mode), or VS Code. Never use Word or Google Docs.
Write your directives following the syntax above.
Save the file as robots.txt (all lowercase, .txt extension).
Upload it to your site’s root directory via FTP, SFTP, or your hosting panel’s file manager. It must be accessible at yoursite.com.sg/robots.txt, not in any subfolder.
Verify it’s accessible by visiting the URL in your browser.

Testing Before You Go Live

Never deploy a robots.txt change without testing it first. Google Search Console has a robots.txt report under Settings > Crawling. You can paste in your proposed file and test specific URLs against it to confirm they’re being blocked or allowed as intended.

Screaming Frog also lets you test robots.txt rules locally before uploading. If you’re making changes to a high-traffic site, test in Screaming Frog first, then deploy during low-traffic hours.

Real Mistakes I’ve Fixed on Singapore Websites

Theory is useful. But seeing actual mistakes drives the point home. Here are patterns I encounter regularly when auditing Singapore business sites.

Mistake 1: The “Disallow Everything” Leftover

A web developer sets Disallow: / during development to keep the staging site out of Google. The site launches. Nobody removes the directive. The business owner wonders why they have zero organic traffic six months later.

I’ve seen this happen to a Singapore legal firm that spent $15,000 on a new website. Their developer forgot to update robots.txt after launch. For four months, Google couldn’t crawl a single page. Always check robots.txt as part of your launch checklist.

Mistake 2: Blocking CSS and JavaScript

This was considered acceptable practice years ago. It’s now actively harmful. Google needs to access your CSS and JS files to render your page the way a real user sees it. If Googlebot can’t load your stylesheets, it can’t evaluate your page layout, mobile responsiveness, or Core Web Vitals properly.

Check your robots.txt right now. If you see lines blocking /wp-content/themes/ or /wp-content/plugins/, remove them immediately.

Mistake 3: Disallowing Pages That Need Noindex Instead

A Singapore property portal wanted to keep their agent login pages out of Google. They added Disallow: /agent-login/ to robots.txt. Problem solved, right? No. Another property site linked to their login page. Google indexed it anyway, showing that embarrassing “no description available” snippet in search results.

The fix: remove the Disallow, add a <meta name="robots" content="noindex"> tag to the page’s HTML. Let Googlebot crawl it, see the noindex tag, and properly remove it from the index.

Mistake 4: Forgetting Subdomain Separation

Your robots.txt at www.yoursite.com.sg/robots.txt has zero effect on blog.yoursite.com.sg or shop.yoursite.com.sg. Each subdomain needs its own file. I regularly find Singapore businesses with well-configured robots.txt on their main domain but completely unprotected staging or blog subdomains being crawled and indexed.

Mistake 5: Using Robots.txt for Security

Robots.txt is publicly readable. Anyone can type /robots.txt after your domain and see exactly which directories you’re trying to hide. Malicious actors actually use robots.txt as a roadmap to find sensitive areas of a site. Never rely on it for security. Use proper authentication, firewalls, and access controls instead.

Advanced Robots.txt Strategies

Once you’ve got the basics right, there are a few advanced techniques worth knowing.

Blocking AI Crawlers

With the rise of AI training crawlers, many Singapore content publishers are adding specific blocks:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

This won’t affect your Google search rankings. These are separate crawlers from the main search engine bots. But it does signal that you don’t consent to your content being used for AI model training.

Crawl-Delay for Smaller Servers

Some Singapore businesses run on shared hosting that can’t handle aggressive crawling. The Crawl-delay directive tells compatible bots to wait a specified number of seconds between requests:

User-agent: Bingbot
Crawl-delay: 10

Note that Googlebot does not honour Crawl-delay. For Google, you need to adjust crawl rate through Google Search Console. Bing and Yandex do respect this directive.

Combining Robots.txt with Your Sitemap Strategy

Your robots.txt and XML sitemap should work as a pair. The sitemap tells search engines what to crawl. Robots.txt tells them what not to crawl. If a URL appears in your sitemap but is blocked by robots.txt, that’s a conflict. Google Search Console will flag these as errors. Audit both files together, not in isolation.

How to Audit Your Current Robots.txt File

Here’s a quick audit you can do right now. It takes about 10 minutes.

Visit yoursite.com.sg/robots.txt in your browser. Confirm it loads with a 200 status code.
Check that it references your XML sitemap. If not, add the Sitemap directive.
Look for any Disallow: / that might be blocking your entire site.
Confirm you’re not blocking CSS, JS, or image directories.
Cross-reference blocked paths with your sitemap. No URL should appear in both.
Open Google Search Console. Go to Settings > Crawling and review the robots.txt report for any flagged issues.
Check each subdomain separately. They each need their own file.

If you find issues during this audit, fix them and then request a re-crawl of your robots.txt through Google Search Console. Google caches your robots.txt file and typically refreshes it about once a day, but you can speed this up.

Get Your Technical SEO Foundation Right

Robots.txt is one small file, but it sits at the very top of the technical SEO hierarchy. Every crawl session starts with it. Every indexing decision is influenced by it. And every mistake in it cascades through your entire site’s search performance.

If you’ve read this far and realised your robots.txt needs work, that’s a good sign. It means you’re paying attention to the details that actually move rankings. Most of your competitors aren’t.

If you’d like a second pair of eyes on your robots.txt file, or you want a full technical SEO audit that covers crawlability, indexation, and site architecture, reach out to us at Best SEO. We’ll tell you exactly what’s working, what’s broken, and what to fix first. No fluff, just a clear action plan.