Written by Ramón Saquete
Table of contents
Over time, Google has greatly improved its indexing of JavaScript and AJAX. At the beginning, it didn’t index anything, or followed any links appearing within content loaded through these frameworks. But then, little by little, it began indexing some implementations, and improving its capabilities. Nowadays, it can index many different implementations, and follow links loaded through AJAX or API Fetch. Nevertheless, there still will be cases where it may fail to do it.
To analyse cases in which Google could end up not indexing our site, first we need to understand the concept of Client Side Rendering (CSR). It implies that the HTML is painted client-side with JavaScript, usually using AJAX in excess. Originally, websites always painted the HTML server-side (Server Side Rendering or SSR), but for some time now, CSR has become popular, with the arrival of JavaScript frameworks like Angular, React and Vue. However, CSR negatively affects indexing, website rendering performance, and consequently, SEO.
As we’ve already explained before, to ensure indexing in all search engines and situations, besides achieving good performance, the best solution is to use a universal framework, as with these measures we end up with something called Hybrid Rendering. It consists in painting the website on the server on first load, and then on the client through JavaScript and AJAX as the navigation moves to the links that follow. Although, in reality, there are more situations where the use of the Hybrid Rendering term is also valid.
Sometimes, the development company uses CSR and doesn’t offer us the option of using a universal framework. This CSR-based web development will get us into trouble, to a greater or lesser degree, depending on the crawler and its ranking algorithms. In this post, we are going to analyse what these problems with Google’s crawler are and how to solve them.
CSR issues during the initial load of a page
First, we are going to analyse indexing problems happening as soon as we enter a URL outside of the website, and when the HTML is rendered client-side with JavaScript.
Issues as a result of slow rendering
Google’s indexing process goes through the following steps:
- Crawling: Googlebot requests a URL to the server.
- First wave of indexing: it indexes content painted on the server instantly, and gets new links to crawl.
- It generates the HTML painted client-side by running JavaScript. This process is computationally costly (it can be done in the moment, or take several days, even, waiting to get the necessary resources for doing it).
- Second wave of indexing: with the HTML painted client-side, the remaining content is indexed, and new links to be crawled are obtained.
Besides the fact that pages may take longer to fully index, thus delaying indexing of subsequent pages linked from them, if the rendering of a page is slow, Googlebot’s renderer can leave unpainted parts. We’ve tested this using the option of “Fetch as Google” provided by Google Search Console, and the screenshot it generates doesn’t paint anything that takes longer than 5 seconds to be displayed. However, it does generate the HTML that takes longer than those 5 seconds. To understand why this happens, we have to keep in mind that Google Search Console’s renderer first builds the HTML running the JavaScript with Googlebot’s renderer, and then paints the page’s pixels. The first task is the one that must be considered for indexing, to which we refer with the CSR term. In Google Search Console we can see the HTML generated during the first wave of indexing, and not the one generated by Googlebot’s renderer.
In Google Search Console we can't see the HTML painted by the JavaScript, run by Googlebot 🕷and used in the last indexing phase. To do this, we have to use this tool: https://search.google.com/test/mobile-friendly 😲Click To TweetIn the tests we’ve conducted, when the HTML rendering took more than 19 seconds, it didn’t get to index anything. While this is a long time, in some cases it can be surpassed, especially if we use AJAX intensively, and in these cases Google’s renderer, just as any renderer, really, has to wait for the following steps to occur:
- HTML is downloaded and processed to request linked files and to create DOM.
- CSS is downloaded and processed, to request linked files and to create CSSOM.
- JavaScript is downloaded, compiled and run, to launch AJAX request(s).
- The AJAX request is moved to a request queue, waiting to be responded, together with other requested files.
- The AJAX request is launched, and it has to travel through the network to the server.
- The server responds to the requests through the network, and finally, we have to wait for JavaScript to run, in order to paint the content of the page’s HTML template.
The request and download times of the process we just described depend on the network and server load during that time. Moreover, Googlebot only uses HTTP/1.1, which is slower than HTTP/2, because requests are dealt with one after the other, and not all at the same time. It’s necessary that both the client and the server allow HTTP/2 to be used, which is why Googlebot will only use HTTP/1.1, even if our server allows HTTP/2. To summarise, this means Googlebot waits for each request to finish in order to launch the next one, and it’s possible it won’t try to parallelise certain requests by opening various connections, as browsers do (although we don’t know exactly how it does it). Therefore, we are in a situation where we could exceed these 19 seconds we estimated earlier.
Imagine, for example, that with images, CSS, JavaScript and AJAX requests, over 200 requests are launched, each taking 100 ms. If AJAX requests are sent to the back of the queue, we’ll probably exceed the required time for their content to be indexed.
On the other hand, due to these CSR performance issues, we will get a worse mark for the FCP (First Contentful Paint) metric in PageSpeed in terms of rendering and its WPO, and as a consequence, worse rankings.
🕸A pure CSR approach damages indexing and rankings, because HTML generation is more costly for both Googlebot and browsers 😕Click To TweetIndexing issues:
When indexing content painted client-side, Googlebot can run into the following issues, which will prevent the indexing of JavaScript-generated HTML:
- They use a version of JavaScript the crawler doesn’t recognise.
- They use a JavaScript API not recognised by Googlebot (presently, we know Web Sockets, WebGL, WebVR, IndexedDB and WebSQL are not supported – more information at https://developers.google.com/search/docs/guides/rendering).
- JavaScript files are blocked by robots.txt.
- JavaScript files are served through HTTP while the website uses HTTPS.
- There are JavaScript errors.
- If the application requests user permission to do something, and on it depends the rendering of the main content, it won’t get painted, because Googlebot denies any permission it’s requested by default.
To find out whether we are suffering from any of these issues, we should use Google’s mobile friendly test. It will show us a screenshot of how a page is painted on the screen, similar to Google Search Console, but also it shows us the HTML code generated by the renderer (as mentioned earlier), log registers of errors in the JavaScript code, and JavaScript features the renderer cannot interpret yet. We should use this tool to test all URLs that are representative of each page template on our website, to make sure the website is indexable.
We must keep in mind that in the HTML generated by the previous tool, all metadata (including the canonical URL) will be ignored by Googlebot, as it only takes into account information when it’s painted on the server.
Now, let’s see what happens when we use a link to navigate, once we’re already on the website, and the HTML is painted client-side.
Indexing issues
Contrary to the CSR during the initial load, navigation to the next page switching the main content via JavaScript is faster than SSR. But we’ll have indexing issues if:
- Links don’t have a valid URL returning 200 OK in their href attribute.
- The server returns an error when accessing the URL directly, without JavaScript, or with JavaScript enabled and deleting all caches. Be careful with this: if we navigate to the page by clicking on a link, it might seem it’s working, because it is loaded by JavaScript. Even when accessing directly, if the website uses a Service Worker, the website can simulate a correct response, by loading its cache. But Googlebot is a stateless crawler, reason why it doesn’t take into account any Server Worker cache, or any other JavaScript technology like Local Storage or Session Storage, so it will get an error.
Moreover, for the website to be accessible, the URL has to change using JavaScript with the history API.
What happens with the fragments now that Google can index AJAX?
Fragments are a part of a URL that can appear at the end, preceded by a hash #. For example:
http://www.humanlevel.com/blog.html#example
This kind of URLs never reach the server, they are managed client-side only. This means that when requesting the above URL to the server, it would get the request for “http://www.humanlevel.com/blog.html”, and in the client, the browser will scroll to the fragment of the document being referred to. This is the common and originally intended use for these URLs, commonly known as HTML anchors. And an anchor, in reality, is any link (the “a” tag in HTML comes from anchor). However, back in the old days, fragments were also used to modify URLs through JavaScript on AJAX-loaded pages, intending to let the user navigate through their browsing history. It was implemented this way, because back then the fragment was the only part of the URL which we could modify using JavaScript, which is why developers took advantage of this in order to use them in a way they weren’t intended. This changed with the arrival of the history API, because it allowed modifying the entire URL through JavaScript.
Back when Google couldn’t index AJAX, if a URL changed its content through AJAX, based on the fragment part, we knew it was only going to index the URL and the content without taking the fragment into account. So… what happens to pages with fragments now that Google can index AJAX? The behaviour is exactly the same. If we link a page with a fragment, and it changes its content when accessed through the fragment, it will index the content ignoring the fragment, and the popularity will go to this URL, because Google trusts the fragment is going to be used as an anchor, and not to change the content, as it should.
However, Google does currently index URLs with a hashbang (#!). This can be implemented by simple adding the exclamation mark or bang, and Google will make it work, to maintain backward compatibility with an obsolete specification to make AJAX indexable. This practice, however, is not recommended, because now it is should be implemented with the history API, and besides, Google could suddenly stop indexing hashbang URLs any time.
Blocking the indexing of partial responses through AJAX
When an AJAX request is sent to URLs of a REST or GraphQL API, we’re returned a JSON or a piece of a page we don’t want indexed. Therefore, we should block the indexing of the URLs to which these requests are directed.
Back in the day we could block them using robots.txt, but since Googlebot’s renderer came to exist, we cannot block any resource used to paint HTML.
Currently, Google is a little bit smarter and it doesn’t usually attempt to index responses with JSONs, but if we want to make sure they don’t get indexed, the universal solution applicable to all search engines is to make all URLs used with AJAX to only accept requests made through the POST method, because it isn’t used by crawlers. When a GET request reaches the server, it should return a 404 error. In terms of programming, it doesn’t force us to remove parameters from the URL’s QueryString.
There is also the possibility of adding the HTTP “X-Robots-Tag: noindex” header (invented by Google) to AJAX responses, or to make these responses get returned with 404 or 410. If we use these techniques on content loaded directly from the HTML, it won’t get indexed, just as if we had blocked it through the robots.txt file. However, given that it’s the JavaScript painting the response on the page, Google doesn’t establish a relationship between this response and the JavaScript painting the content, so it does exactly what we expect of it. And that is: not index the partial response and fully index the generated HTML. Careful with this, because this behaviour may change some day, and so will all our content loaded through AJAX if we apply this technique.
Conclusion
Google can now index JavaScript and AJAX, but it inevitably implies a higher cost to indexing already processed HTML in the server. This means SSR is and will continue to be the best option for quite some time. If you have no other alternative but to deal with a CSR website, fully or partially, you now know how to do this.