Written by Rocío Rodríguez
Table of contents
- 1 What is a sitemap file?
- 2 How do search engine crawlers find our site’s pages?
- 3 Types of sitemaps
- 4 Improving the site’s indexing
- 4.1 Sitemap XML tags
- 4.2 Page priority within a website
- 4.3 Update frequency for each page
- 4.4 Modified date
- 4.5 Accessible URLs
- 4.6 Pages returning response codes other than 200
- 4.7 Pages with parameters and other session identifiers
- 4.8 Sitemap update
- 4.9 Multilingual sites
- 4.10 Size of the sitemap
- 4.11 Administering several sitemaps
- 4.12 Irrelevant pages
- 5 Check that your sitemap contains the correct pages
- 6 Submitting the sitemap
- 7 Common errors to avoid
What is a sitemap file?
A sitemap is a file including a list of all relevant pages of a website, with some additional information about them: their update frequency, when they were updated last, or the importance of a particular URL with regard to other pages found on a website. The update frequency of a page’s content provides a hint to Google on how often these pages should be crawled.
The purpose of this file is to help search engines to find and index your website’s pages. Overall, crawlers index all pages they find, unless the run into instructions that include some kind of rule that prevents them from doing it.
Sitemaps can have various formats, but the most common one uses XML. Sitemap files can be created manually, or generated by third-party tools such as XML Sitemap Generator, as well as plugins specifically developed for certain popular content management systems, like WordPress or Drupal.
The creation of a sitemap file is not mandatory; it is, however, recommended. All webmasters should consider the perks of generating it on websites that do not yet have one. It’s important that we keep in mind all Google’s requirements and guidelines when creating a sitemap to prevent any kind of error or problem. And if we still get a warning or an error after creating and uploading one, then we must analyse them in detail and solve the existing issues, so that Google can access and process this file correctly.
How do search engine crawlers find our site’s pages?
Search engines discover new pages through links, both internal and external. For example, if we create a new landing page and it doesn’t have any inbound links directing to it, neither internal nor external, Google won’t be able to find it, and therefore, it won’t be able to index it either. Nevertheless, there are also instances of pages being properly linked, but they are buried so deep within the website’s hierarchy that it becomes extremely difficult for crawlers to reach it.
A sitemap file makes finding and discovering new pages much easier for search engines. We must not assume, though, that the inclusion of pages in this file guarantees their crawling and indexing. If we have pages with poor content, their URLs might be included and sent through the sitemap, but the search engine might choose to not index them. This typically happens to pages like tags, URLs containing 2 or less products, etc.
Search engines discover new pages on our website through both internal and external links. The sitemap file makes finding and discovering them much easier.
All pages, which are neither found nor crawled won’t be added to Google index, and therefore, they won’t be returned in the search results for relevant queries made by a user.
Types of sitemaps
There are various different types of sitemaps to help search engines discover multimedia content, as well as other content which may be more difficult for them to analyse and process.
Video sitemaps allow us to inform search engines about video content we have on our site. This sort of information is something search engine crawlers wouldn’t be able to identify appropriately with their regular crawling techniques. This way, we will improve our website’s visibility for search queries made from Google Videos.
A sitemap video entry can specify the duration, category and recommended age classification for the posted video content.
You can also include video content URLs in an already existing regular sitemap, without creating a specific one for your vids.
This type of sitemap will significantly improve our visibility for searches made from Google Images. It helps our images to get crawled and indexed. This is the sort of information that search engines wouldn’t be able to identify using their regular crawling techniques.
A sitemap image entry can include the subject, type and license for the uploaded picture.
We can either use an independent sitemap file for images, or add image URLs to an already existing file.
Websites that would most likely benefit from an image sitemap file could be, for example, tourist sites, recipe sites, or online shops.
If we search on Google Images the recipe for “almond chicken” we’ll see a great amount of results depicting the dish. Each of these images has its own URL:
In the case of this result we can see how this recipe images have all been included to the sitemap file:
This practice makes it easier for search engines to crawl and index images, which in turn improves our visibility for specific queries in Google Images.
This type of sitemap is mostly used to speed up the process of news discovery by search engine crawlers.
It’s slightly different than the one containing regular pages. It has specific tags like <news:keywords> or <title>. The <title> tag is mandatory and it must include the exact title or headline that appears on the website. The <news:keywords> is not mandatory, but we recommend including it, too. Search engine crawlers use the terms included in it to classify the news. This helps our news article to obtain better rankings for related search queries and for which we wish to be more visible.
We do not recommend including more than 8 keywords in this tag, though. On the other hand, we must also keep in mind that the order in which these keywords appear in no way determines their importance; they all receive the same amount of relevance weight.
There is also the stock tickers tag, which is used for finance news.
News sitemap files cannot contain more than 1,000 URLs, or include news articles older than 24 hours, counting from the moment they are published. However, they can continue to appear in Google News for up to 30 days.
Here’s what a news sitemap syntax is like:
This will contribute to improving our rankings, because if search engines discover our page shorty after publishing it, we will have a better chance at positioning our news for current search trends just in time, at their highest peak of audience.
Google has some strict guidelines for generating news sitemaps, which must be met. We recommend reviewing these requirements if you intend to include this type of sitemap on your site.
Improving the site’s indexing
A sitemap contains a series of tags, some of which are optional: <lastmod> and <priority>. We’ve highlighted them with italics, so you see it more clearly:
These optional tags, which we will explore shortly, are going to allow us to provide search engines with some relevant data regarding our pages, which will contribute to improve their crawling and indexing process.
Page priority within a website
The <priority> tag indicates search engines the importance of a URL over the remaining pages of a website, and its value ranges from 0.0 to 1.0. This suggestion doesn’t affect your Google rankings.
This is what Google’s documentation says on this matter: “this value doesn’t affect the comparison between your website and others; it simply allows you to inform search engines about pages you consider to be more important for crawlers“.
Update frequency for each page
The <changefreq> tag is also optional and it indicates the change frequency of a page. These are the values accepted by this tag:
- always (documents change each time they are accessed)
- never (should be used for archived URLs)
This tag’s content should be taken as it is: a mere suggestion, and not an absolute guideline, which means crawlers might choose to take its information into consideration or to ignore it. For example, it’s not rare to run into cases where a page set to <changefreq>hourly</changefreq> is crawled daily. Just as well, search engine crawlers may crawl a page set to <changefreq>yearly</changefreq> more frequently.
As you’ve probably figured out from its name, the <lastmod> tag indicates the date in which a content was updated last. The date must be written in W3C date and time format.
All sitemap URLs must be accessible, i.e. any search engine bot should be able to see them. Thus, we must not include in our sitemap file URLs which have been blocked by robots.txt or by using the HTML meta robots tag.
To further expand on the subject, we must keep in mind that it’s incoherent for Google to find on the one hand, pages listed in the sitemap file so that it crawls and indexes them, and on the other, to specify a “noindex” value in the meta robots tag within the HTML of the same pages. We must avoid this at all costs if we want to achieve optimal saturation results.
Pages returning response codes other than 200
All URLs included in a sitemap must return 200 OK status codes. Any 400 status codes must be avoided, same goes for redirected URLs (301, 302, etc.)
Pages with parameters and other session identifiers
We must avoid including URLs with any kind of session ID, as they’re duplicates of the original page. This way, we will limit the search engines’ access to relevant URLs, while reducing at the same time crawling of duplicate content that doesn’t have any value in terms of SEO.
URLs with parameters should also be excluded. These pages usually have the same content as their original versions, at least partially, if not entirely, but ordered in a different manner through a filter chosen by the user: price, colour, brand, etc. So these URLs would have identical or very similar information. Including them in a sitemap file would result in search engines crawling duplicate pages, as well as irrelevant URLs that we don’t want them to take into account, much less index.
A sitemap file must be updated periodically, so that it always contains the newest URLs of our website. The content we have on our website should be coherent with the content we tell search engines they should index.
The update frequency will depend on the website type (news outlet, blog, online shop, etc.) and how frequently we publish new content. In the case of a news outlet, for example, the sitemap should update daily, because it’s best to include all URLs of the latest news and articles as they are published. If it’s an online shop, however, the update frequency won’t be just as regular. Nevertheless, given that an e-commerce site changes products all the time (some get discontinued, others get added, etc.), we should make sure that the sitemap file is also up to date with our content modifications.
Using plugins helps us keep our sitemap updated automatically. Manual updates are a perfectly valid option too, but it usually becomes a somewhat complex and arduous process.
If your website is multilingual, each language must have its own sitemap file in its respective root directory and Search Console property, whenever possible.
Size of the sitemap
A sitemap file cannot exceed 50MB(52,428,800 bytes), before or after compression, and it shouldn’t contain more than 50,000 URLs. By complying with these conditions, we will ensure that our web server doesn’t crash serving a large amount of files.
If your website’s number of URLs is greater than 50,000, or its size exceeds the recommended 50MB, you’ll have to distribute your content into various sitemap files, which you will manage through a unique sitemap index.
Google recommends using the gzip format for compressing the sitemap, instead of zip.
Administering several sitemaps
Simplify your sitemap administration using a sitemap index. The sitemap index file allows you to send all your sitemaps at the same time, which makes the process considerably easier.
This can be usually pretty useful for large websites, for instance, media outlets, which upload news sitemap indices for each month of the year. It’s also a good practice for other sites, which may not be as large, but have many different sitemap files.
Oftentimes we may also find sitemaps with a very delayed response time, situation which can perfectly affect the indexing of pages contained within it. In this case, it is best to distribute this one big sitemap into various sitemap files. Managing them all from one sitemap index will be very easy.
Here’s an example of how you can organise the pages for an online fashion shop. They can be distributed as follows:
- Sitemap 1: category pages (men, women, dresses, trousers, etc.)
- Sitemap 2: brand pages (Diesel, Desigual, Pedro del Hierro, Bimba y Lola, etc.)
- Sitemap 3: blog posts
- Sitemap 4: blog tags
- Sitemap 5: products
- Sitemap 6: products
- Sitemap 7: products
- Sitemap 8: images
This way we can control the indexing of each sitemap separately, and see if there are any specific problems in any sitemap type.
The sitemap index file can contain up to 50,000 URLs and it cannot contain other sitemap indices, only sitemap files.
Check that your sitemap contains the correct pages
Before submitting our sitemap, we must make sure that it only contains relevant URLs of our site, that is, those we want Google to crawl and index. To do this, we can use tools like Screaming Frog, with which we can download our current sitemap file.
When the tool has finished crawling all URLs, we should focus on the “Status Code” column. If we find status codes other than 200OK, that’s not good. Redirected pages, regardless of whether they are temporary or permanent, don’t belong in a sitemap. We recommend removing them from the file. Same thing applies if we find non-existing pages, namely those returning 404, 410, etc.
Another indicator we recommend you to check is “Status”, which tells you if a page is set to be prevented from being crawled through the robots.txt file. We must check if this rule is correct, or it’s been mistakenly added to the file. It could be the case of a page, which at some point we didn’t want to be found by crawlers, but now we do want it to get indexed. If our robots.txt file is correct, though, then we should remove all the blocked URLs from the sitemap. Similarly, if some of these URLs are being blocked by mistake, to facilitate their crawling we should remove them from robots.txt.
It’s also important to check the “Meta Robots” column, to identify which pages have the “noindex” rule implemented. Any URLs implementing this rule should not be included in our sitemap file. It would be incoherent to submit in our sitemap the same URLs we’re blocking search engines’ access to.
We recommend that you take into account all these recommendations, because you’ll get better saturation results by correcting all possible errors.
Once we’ve fixed all issues, we must re-submit our sitemap.
Submitting the sitemap
You can add the sitemap file to the root directory of your HTML server, at https://mydomain.com/sitemap.xml. Once created, it’s important we submit it to Google. From Search Console we can add, submit and test the sitemap, by going to Index > Sitemap in the sidebar menu. Moreover, we will be able to monitor submitted files and to detect possible errors or warnings the tool might point out.
Another option is to include the sitemap file URL to our website’s robots.txt file. To do this, we simply need to include:
- Sitemap: https://example.com/sitemap_location.xml
However, this option should be considered as a last resort. Any person can access your site’s robots.txt file, and it’s better to avoid providing information that can be used against you.
The saturation index is the ratio between pages we submit to Google and pages it ultimately indexes. We can access this data from Google Search Console, going to Index > Sitemaps from the sidebar.
This value rarely reaches 100%. However, we must strive to get as close as possible to this percentage, as it will mean that almost all pages submitted through the sitemap are being crawled and indexed by search engine crawlers. The more relevant pages get indexed, the higher our chance of them being returned by the search engine in its results for suitable queries made by users. Pages not indexed by Google won’t be found by users entering queries, which results in: loss of traffic, rankings, etc.
Common errors to avoid
- Submitting an empty sitemap: generating and submitting a sitemap that doesn’t contain URLs we want crawlers to find won’t help improve our SEO.
- Exceeding the allowed size: if our sitemap exceeds 50MB without compressing, we should create a sitemap index file and divide it into several sitemaps.
- Including an incorrect date: we must make sure that all dates have the correct W3C date and time format (including the time stamp is optional).
- Invalid URLs: those URLs which have odd characters or banned symbols, namely quotation marks or spaces; as well as other URLs with the wrong protocol (HTTP instead of HTTPS).
- Duplicate tags: to solve this problem we must remove duplicate tags and re-submit the sitemap.
- Too many URLs: make sure the sitemap doesn’t have more than 50,000 URLs, and if it does, divide it into several files, keeping in mind that each file shouldn’t have more than 50,000 URLs.
- Incomplete URLs: URLs included in the sitemap should always be absolute. For example, www.mydomain.com would be incorrect, because we’re skipping the HTTP/HTTPS protocol.
- Submitting individual sitemaps: place all your sitemap files under a common sitemap index file.
- Wrong tags: make sure that all sitemap tags are written correctly. Avoid typos like <news:languaje> instead of <news:language>, they could make the sitemap return multiple errores, preventing search engine from successfully processing it.