Written by Jose Vicente
Table of contents
The ecosystem of the Web needs links in order to create relationships between different pieces of content. This is a fundamental rule for websites, because both the user and search engine bots need crawlable links to be able to navigate through the different content pages. Moreover, bots also need to establish relationships and hierarchy between all the content of a website.
What are orphan pages?
When a page is disconnected from the rest because it doesn’t get linked from any other page on the website, we’re talking about an orphan page. Orphan pages don’t have other pages transferring authority or relevance to them through link anchor texts.
Although search engine bots are capable of reaching this content through links coming from other domains, they won’t be able to adequately situate them within the internal website architecture these pages belong to.
Are orphan pages an issue for SEO?
Orphan pages are a problem for our website’s SEO, not so much due to the disconnection from the rest of the content on the domain, but for the type of pages themselves, which remain in this state. If a page has plenty of quality content and loses links from all other pages, it will still be a quality page in and of itself, but its individual ranking will be affected by the loss of:
- General authority inherited from the rest of the pages.
- Relevance contributed by link anchor texts.
- Regular analysis provided by SEO tools, because if they can’t reach them, we can’t detect errors with the typical crawling methods.
The real problem resides in pages becoming orphaned o published without links from other pages, and these can be:
- One-off campaigns in Google Ads or other advertising platforms acting as entry points, which in many cases are not integrated within a website’s main navigation.
- Obsolete pages becoming unpublished: the CMS unlinks them from the categories they appear in, but doesn’t actually delete them.
- Incorrect CMS-generated URLs disconnected from the architecture.
- Pages created for testing purposes, which are not removed afterwards, ending up in many cases as duplicate content sources.
If this type of pages is constant and has a cumulative presence, in the end this ‘thin content’ could become a real issue for our website’s organic rankings.
Thus, orphan pages can generate problems when trying to get SEO traffic for the following two reasons:
- Lack of links from the website itself, which will damage its rankings.
- Thin content generation if the content is low-quality.
How to detect orphan pages
Once it’s clear that having orphan pages is not great for our website’s organic ranking in search engines, we need to learn how to detect them. Running a crawl with Screaming Frog or any other SEO crawler is not going to give us any information, because these pages don’t have links, and thus, these tools won’t be able to reach them. We have to try and find this unlinked content using other ways. Let’s explore them:
- XML sitemap file: in cases where our content management system can generate an XML sitemap file listing all pages created by the CMS.
- Google Search Console: this tool can provide us with pages generating impressions but no SEO traffic. It’s an interesting way to discover pages, which, while not linked from anywhere, are still displayed in the results sometimes.
- Google Analytics: see pages being visited from other traffic sources than the organic channel.
- Importing URLs with external links: pages linked from other domains, whether they contribute traffic or they do not. We can analyse these with Google Search Console, or any other tool like Majestic or Ahrefs.
- Getting a list of all HTML URLs from the server log through a request to our server.
This way we will get all possible URLs, including those bringing in traffic or receiving authority from other domains. If we put all these URLs in a list and we compare it to a list obtained through a full crawl of our website, we will be able to identify orphan pages because they will appear in the full list, but not in the crawl.
Albeit it’s a relatively simple task we can carry out using an Excel document, it’s pretty laborious. Crawlers like Screaming Frog and Ryte can tremendously help us with our endeavour.
With Screaming Frog
Screaming Frog allows us to import the sitemap file to be crawled, in order to discover all the pages of our website. We only need to go to Configuration > Spider > Basic and scroll down this tab until we reach the “XML Sitemaps” block. Once we’re there, we will:
- Check “Crawl Linked XML Sitemaps”.
- Check “Crawl These Sitemaps”, and enter into the dialog below the URLs of our sitemap or sitemaps, including a full list of all the URLs our website contains.
Moreover, it enables us to import URLs from Google Search Console and Google Analytics. To do this, we will go, once again, to the Configuration menu and get down to API Access. Google Analytics and Google Search Console are the one we’re interested in. We will be able to connect both to Screaming Frog, and select in both cases “Crawl New URLs Discovered”, in the “General” tab. This way, Screaming Frog will crawl all URLs it encounters using the data provided by these two tools.
Once we’ve set up everything, we only have to crawl our website with the URLs from these three fonts, and once it’s done, run a Crawl Analysis. Then, after the analysis has been finished in all three tabs: Sitemaps, Analytics and Search Console, we will obtain our website’s orphan URLs if we apply the “Orphan URLs” filter. This way, we’ll know which URLs have been detected as orphaned through each data source.
Ryte also has an option allowing us to download our XML sitemap. After the crawl and its comparison with the list of URLs, we will see our orphan pages in Website Success > Links > Pages without incoming links.
In this case, the tool doesn’t allow us to get URLs from other sources, but in most scenarios this will be enough to detect pages without links, as long as our XML sitemap is correctly implemented.
How to fix orphan pages
We will fix our pages without links in different ways, depending on their traffic generation capacity.
- Irrelevant or thin content pages: if we detect unnecessary pages on our site, the recommendation is to proceed to their deindexing, removing them from the CMS inventory, or simply unpublishing them. In any case, they must return a 410 or 404 error when someone attempts to access their content.
- Relevant pages for our website: in this case, we want these pages to stop being orphans. We’ll have to incorporate them to our website’s information architecture, and link them from all related sections and content. This way, our possibility to attract traffic with them will increase.
- Halfway between the previous two scenarios, we may have orphan pages receiving inbound links from other domains, while being low-quality: we should consider the possibility of improving their content and incorporating them to our information architecture. If we cannot improve them, then our best bet would be to redirect 301 these pages to the most similar content within our site, in order to make the most out of their authority.
In summary, we want to incorporate into our information architecture those pages, which are relevant for our website, by including the appropriate links pointing to them from other sections. If this is not possible, we will proceed to remove said content in the most search-engine-friendly way we can.
These are relatively simple on-page actions, which will help us make the most out of our website’s traffic acquisition potential.