Crawl budget is one of the most consequential technical SEO concepts for large or rapidly growing websites. When Googlebot allocates a finite number of page visits to your site, every wasted crawl on a low-value URL is a missed opportunity to get a high-value page indexed, refreshed, or ranked. For smaller sites, crawl budget rarely constrains performance. For sites with tens of thousands of URLs, it can be the bottleneck between publishing content and seeing it rank.
This guide explains what crawl budget is, how Google calculates it, how to check it, and how to optimise it with a focus on the internal linking decisions that most directly influence what Googlebot chooses to crawl.
Definition
Crawl budget is "the set of URLs that Google can and wants to crawl" on a given website within a given time period. It is determined by two factors: crawl capacity (how fast and often Googlebot can crawl without overloading the server) and crawl demand (how much Googlebot wants to crawl specific URLs based on perceived importance and freshness). This definition comes directly from Google's developer documentation.
Google has been explicit about this. Their documentation states that crawl budget management matters for sites with more than 1 million pages, or sites with at least 10,000 pages that update daily.[1] If your site has fewer than 1,000 pages and updates infrequently, crawl budget is unlikely to be a limiting factor for your SEO.
That said, understanding crawl budget is useful for any site, because the practices that optimise it (clean URL structure, fast server response, minimal duplicate content, strong internal linking) are also SEO best practices that improve performance regardless of site size.
| Site size | Crawl budget priority | Typical impact |
|---|---|---|
| Under 1,000 pages | Low | Rarely a constraint |
| 1,000–10,000 pages | Medium | Worth auditing; some waste likely |
| 10,000+ pages with frequent updates | High | Direct impact on indexing speed and rankings |
Crawl budget is the product of two independent signals that Google evaluates for every website.
This is the maximum rate at which Googlebot will crawl your site without causing server degradation. It is determined by your server's response time, error rate, and stability. Fast, reliable servers get a higher crawl capacity; slow or frequently erroring servers see Googlebot back off. You have direct control over this through hosting quality, CDN configuration, image optimisation, and server-side performance work. See: Core Web Vitals and server performance.
This is Google's assessment of how much it wants to crawl each URL on your site. High demand pages are those with strong signals of importance and freshness: they receive many internal and external links, they have been recently updated, and they generate click traffic in Search. Low demand pages are those with few links, thin content, no external authority, and no update history. Demand is the factor most directly influenced by internal linking.
~60%
of the internet is estimated to be duplicate content, wasting crawl budget at scale
Source: Gary Illyes (Google), via Ahrefs
Gary Illyes of Google has stated that "Google's crawling process is highly focused on removing duplication because 60% of the internet is duplicate."[2] Duplicate content is one of the biggest crawl budget drains. Every time Googlebot visits a parameter-generated duplicate of a page it has already seen, that is a crawl wasted on content with no unique value.
GSC provides two reports that give you a practical picture of your crawl budget situation.
Navigate to Settings (gear icon) in GSC, then "Crawl stats". This report shows the total number of requests from Googlebot over the past 90 days, broken down by response code, file type, and purpose. Key metrics to monitor:
The Index Coverage report (under "Indexing > Pages" in GSC) shows you how many pages are indexed vs excluded, and why excluded pages were not indexed. High counts of "Crawled, currently not indexed" can indicate crawl budget waste on low-quality pages. High counts of "Discovered, currently not indexed" suggest Googlebot has found the URL but has not prioritised crawling it, often because of insufficient internal link signals.
The most effective optimisations target either the numerator (increasing crawl capacity) or the denominator (reducing wasted crawl on low-value URLs).
Identify URL categories that provide no unique value to searchers and should not consume crawl budget:
Use robots.txt to disallow crawling of URL patterns that should never be accessed by crawlers. Use noindex meta tags for pages that users can access but should not be indexed. Use canonical tags to consolidate parameter variants. See: canonical tags explained.
10k+
pages with frequent updates: the threshold at which Google recommends active crawl budget management
Source: Google Developers documentation
Each redirect hop Googlebot follows uses crawl capacity and time. A chain of 301 redirects (A to B to C) wastes two crawls to reach the final destination. Audit for redirect chains in your site crawl, and update all links (internal and in sitemaps) to point directly to the final destination URL. See: fixing broken internal links and redirect chains.
Soft 404s (pages that return a 200 HTTP status but display a "not found" or empty content response) are particularly damaging. Googlebot wastes a crawl, receives no useful content, and may suppress indexing of other pages based on the quality signal. Audit for soft 404s in GSC's Pages report under "Not found (404)" and "Soft 404" categories.
Googlebot adjusts its crawl rate based on server health. Pages loading over 500ms consistently signal a strained server. Work with your hosting provider or implement CDN caching, browser caching, and image compression to bring average response times below 200ms for important pages.
This is where crawl budget optimisation intersects most directly with internal link architecture. Internal links are the primary mechanism through which Googlebot discovers pages and judges their relative importance. Pages that receive many internal links are crawled more frequently. Pages that receive few or no links are deprioritised or missed entirely.
Key actions:
"Google's crawling process is highly focused on removing duplication because 60% of the internet is duplicate."
Gary Illyes, Google, via Ahrefs crawl budget guide
Your sitemap tells Googlebot which pages exist and (optionally) when they were last updated. Ensure your sitemap contains only canonical, indexable, 200-status URLs. Remove noindex pages, redirect URLs, and parameter variants from the sitemap. A sitemap that accurately represents your best content steers crawl budget towards those pages.
A simple diagnostic metric is the crawl score: the ratio of indexed pages to daily crawl requests.
Crawl score = Indexed pages / Daily Googlebot requests
A score of 1-3 is considered healthy: Googlebot is visiting each page roughly every 1-3 days. A score above 10 means Googlebot is crawling infrequently relative to your indexed pages, and freshness signals are slow to update.[3]
Example: A site with 5,000 indexed pages and 1,500 daily crawl requests has a crawl score of 3.3 (healthy). A site with 20,000 indexed pages and 800 daily crawls has a score of 25 (problematic). The second site should aggressively reduce low-value URLs and improve internal linking to raise crawl demand for its most important pages.
The fastest way to improve crawl demand for your best pages is to fix your internal link architecture. Linki analyses your complete internal link graph to identify:
By fixing these issues, you concentrate Googlebot's crawl capacity on the pages that matter most, accelerating indexing of new content and improving recrawl frequency for your most important existing pages.
Crawl budget is the number of URLs Googlebot will crawl on your site within a given period. Google defines it as "the set of URLs that Google can and wants to crawl." It is determined by crawl capacity (how fast Googlebot can crawl without overloading your server) multiplied by crawl demand (how much Google wants to crawl specific URLs based on their importance and freshness signals).
In GSC, go to Settings (gear icon) and click "Crawl stats". This report shows total Googlebot requests over 90 days, broken down by response code and file type. To check crawl demand signals, use the "Pages" report under "Indexing" to identify "Discovered, currently not indexed" pages, which indicate URLs Googlebot has found but not prioritised for crawling.
A crawl score of 1-3 (indexed pages divided by daily crawl requests) is considered healthy, indicating Googlebot revisits each page roughly every 1-3 days. A score above 10 suggests Googlebot is crawling infrequently, and freshness signals may be slow to update. Sites with scores above 10 should focus on reducing low-value URLs and strengthening internal links to high-priority pages.
Yes, significantly. Internal links are the primary mechanism through which Googlebot discovers pages and judges their crawl priority. Pages with many internal inlinks receive higher crawl demand and are revisited more frequently. Pages with zero or very few internal links (orphan and near-orphan pages) generate minimal crawl demand and may be crawled rarely or missed entirely. Improving internal link distribution is one of the most direct ways to optimise crawl budget for large sites.
Indirectly, yes. Crawl budget does not directly influence ranking algorithms, but pages that are crawled infrequently receive delayed indexing of updates, slower discovery of new content, and reduced freshness signals. For sites publishing time-sensitive content or frequent updates, poor crawl budget management translates directly into slower ranking improvements.
Sources