10 Ways To Save Time and Identify Major Site Errors with DeepCrawl

October 15, 2014, 3:17 am

≫ Next: New Year Resolution: Crawl Quarterly

I first met the DeepCrawl team at Brighton SEO where they were promoting their web tool to conference attendees. DeepCrawl has a relatively new website tool which I have been using. I was lucky to be given a demonstration of the tool. I cannot cover every aspect of DeepCrawl in this post, but I would like to highlight the areas I found particularly useful.

1) Easy to Use

The DeepCrawl tool allows you to easily crawl a site. The first step is to Add Project as the below shows.

DeepCrawl can crawl sites with a up to 5 million URLs in total. The crawl is also customisable, so users area able to select the speed of the crawl between 1 and 20 URLS per second (this in the advanced settings).

It is also possible to select exactly when you crawl your own site. Many people choose to crawl overnight or over the weekends as it is faster than during peak hours.

The tool allows the user to include or exclude certain pages such as adhering to the no follow links.

In my case, I wanted DeepCrawl to crawl through my entire site and also check the Sitemaps and organic landing pages, and therefore ticked “Universal crawl”. I clicked ‘save’ and moved to the next section where it confirmed the analytics. It picked up the UA ID and then I pressed save to begin the crawl of the site.

2) Simplification

The tool simplifies what may seem complicated to many, especially those not familiar with the technical aspects of SEO. The overview report highlights many aspects of the site that need to be addressed from a top level.

The overview report also shows the number of crawls you have processed on your site. It shows the number of unique pages and the depth, which is important especially in large sites.

3) Identify Indexation

The report clearly shows the number of URLs that have been crawled as well as the number of unique pages on a site and duplicate pages. What I like about the tool is that it also shows the canonicalized pages and the no-followed pages. I find it particularly useful that the tool clearly shows the errors of the site.

One element that is a clear USP, is the fact that DeepCrawl highlights the changes from one crawl to the next which makes it very easy for people to see what has changed and what has not. From the above example the cells in green are from one crawl and the cells in red are from the second crawl of the site.

After users have run more than one crawl they have the trend the bottom of the dashboard on the Overview tab.

4) Identify Content

The Deep Crawl tool clearly shows the meta titles and descriptions on the content tab. It also tells the user if the meta data on the site is over the recommended length. The content tab also shows duplicate body content, as well as if there are missing H1 tags and multiple H1s on the page. The report also identifies if there are valid Twitter cards and open graphs, the latter is something I have not seen before in a crawling tool.

5) Clearly see internal broken links

My site was hacked into twice last year. Since then, I have done a lot of work to try and resolve this. I had to a do a complete reinstallation of the site, which meant many URLS went from sitename/date/post-name to sitename/uncategorozed/post-name.

I knew I had internal broken links and therefore have been going through my site slowly to resolve them. This tool has helped to identify the internal broken links, which I will be addressing when I go through my posts. All internal links, external links as well as redirected links are highlighted in the validation page.

6) Assigning tasks to others

This is the aspect of the tool I really like and is crucial in project management. There may be several areas of the site, which the tech team identify and should be amended. However, due to limited resources (time and money), rectifying these errors may not always be possible. Therefore it is best to identify the tasks that can be actioned with a realistic date and assign these tasks to the dedicated personnel. The issues can then be seen in the projects section. It is possible to export the tasks and discuss them with your clients.

7) Page Level Detail

I found a few duplicate page titles on my blog, this was mostly due to the pagination issue with the site. (eg. page-2/page-3). With larger commerce sites, the page level detail is a useful aspect of the tool as it is easy to see the errors of the page on a detailed level.

Below is a screengrab of the page level detail. The DeepRank level is out of 10. The DeepRank score is a measure of authority based on number of links in and out of that page as well as number of clicks from the home page. And when you combine that with GA data such as site visits, you get an even better idea of which pages you should prioritise fixing because they have a lot of authority from search engines and are greatly accessed by your users.

A score of 10 for DeepRank is the most serious, with this page being a 3 out of 10. The tick marks show the page is indexable and it is a unique page.

8) Schedule reports

The ability to schedule reports is very useful, especially if you have a busy work calendar and may forget without reminders. The report will be emailed to you once it is complete. It is important to have reports so you can monitor the progress and changes made to the site. Once you review a decrease of duplicate page titles or any other issue marked at the beginning of the project, you can then monitor the progress made. This is particularly important if your client is asking to see the ROI of SEO.

9) Integration with analytics

When setting up a project it is possible to integrate the tool with your own analytics at the click of a button. This means no more exporting your own data and trying to match it with the crawler data like broken pages. This makes our job in SEO that much easier.

How does DeepCrawl do this?

It crawls the website’s architecture level by level from the root domain. Then it compares the pages discovered in the architecture to the URLs you are submitting in your sitemaps. Finally, DeepCrawl compares all of these URLs to the Organic Landing Pages in your Google Analytics account.

This is another great USP of DeepCrawl as this feature allows users to find some of the gaps in their site such as:

Sitemaps URLs which aren’t linked internally
Linked URLs which aren’t in your Sitemaps
URLs which generate entry visits but aren’t linked, sometimes referred to as orphaned pages or ghost URLs
Linked URLs or URLs in Sitemaps which don’t generate traffic – perhaps they can be disallowed or deleted

By integrating DeepCrawl with your analytics, it can give you an indication as to how important the pages are based on site visits, bounce rate, time on site etc. and therefore which you should probably fix first to have maximum impact.

10) Crawls Before the Site is Live

You may have had a client site where you want to run a crawl through the site before it goes live, but cannot as it is behind a secure wall. Fortunately, Deep Crawl is here to help. Deep Crawl allows you to crawl the site behind the secure wall and it will run a report highlighting any errors. This is particularly important because you compare the test site to the live site to see what the difference is and if there is anywhere you may lose traffic if you pushed the site live as it was. This means you can easily identify any errors of the site before the site goes live, saving you hours of time and making you look good in front of the client – another bonus !

Conclusion

Deep Crawl is a great tool. It clearly shows any issues with your site and makes the complicated and “techie” aspects of the site easy for anyone to understand. If you have difficulty explaining the more technical aspects of a site to the rest of your team, this tool will save you time and simplify the issues making it easy for your colleagues to understand. At just £50 a month to crawl 100,000 URLs, this is certainly a good deal.

Post from Jo Turnbull

↧

New Year Resolution: Crawl Quarterly

December 16, 2014, 1:50 am

≫ Next: The First Pillar of SEO: Technology

≪ Previous: 10 Ways To Save Time and Identify Major Site Errors with DeepCrawl

In the new year, if you are in charge of clients, are in-house, or just in charge of a site of some kind, I implore you to set a new year’s resolution to crawl your site regularly. Most of the issues that I see with clients from a technical perspective can be solved with a regular crawl and time to fix the small issues that arise.

Call this a new mission, but I am determined to get everyone to crawl their sites regularly. I recently spoke on this topic at the State of Search conference in Dallas, TX. Many of the people attending my session knew everything in my presentation, but there are more people out there that find crawling confusing or daunting and it’s those people I want to reach. Crawling is no longer just for “technical” people – there are so many tools that have been developed to help making crawling automated, fun, and actionable. It’s just a matter of finding the right tool, making it a regular action item, and focusing on the right things.

This is the shortened version of my talk … the action items if you will …

Finding the Right Crawler

There are a number of crawlers on the market and there is one for you or your client. I created a graphic that shows the right crawling tool for your specific situation. This graphic was created in my professional opinion, but if you want more information, check out my fill deck below. I suggest you test all of the tools to find the right one for you.

Crawler Scale by Kate Morris and Outspoken Media

I know I am missing some of the crawlers on the market. The nice people at Site Condor came by after my presentation to say hi and gave me a look at their crawler a few weeks later. This space is moving constantly so don’t get married to one tool. Keep your mind and eyes open – and share any thoughts you have for new features with your favorite tool. I know each company welcomes ideas, so share away!

Making a Regular Schedule

You might be thinking that it’s as simple as a recurring calendar invite, but to reap the full rewards of a regular crawl schedule, there are a few things to keep in mind.

Intervals

If nothing changes about your site, a crawl should be performed quarterly. Start it in January and then do one at the top of every quarter. But things do change, and it’s recommended that you crawl the site before and after each major site change and define action items in the post crawl.

Withstanding Staff Changes

If you just set a calendar item for your own calendar, or have someone else do it, it will get lost over time with staff changes. Talk to company (internal or client) development teams to add this into their standing processes. Also ensure that the marketing team adds it to their process for any site revisions and marketing reviews.

Resources

The final thing to consider aligns well with changing processes to protect staff changes, and that is ensuring you have the resources to fix any issues in the crawl. This means development or marketing time each quarter and after every major site change. Your action items can be anything from changes to robots.txt to getting a copywriter to fill in missing meta descriptions.

Focusing on the Right Metrics

Doing a site crawl can be daunting with the amount of information you can get with a crawl. We use a crawler every time we do an SEO Audit at Outspoken, but there are a few facets that are the keys to any audit/crawl.

Errors – Naturally, the server (5XX) and Not Found errors (4XX) are going to be of the highest priority. They are issues for users and search crawlers.
Redirects – Redirects should never happen within your site. They are going to over time, but should be fixed as soon as possible. It’s link equity you can control, so try to keep any of it from passing through a redirect.
Duplicated Titles – These are the first warning sign of duplicate content and canonical issues. Check out titles for any that are missing, too short or long, or duplicated.
XML Sitemaps – Finally, use a crawler to check your XML sitemap to ensure that you are only giving search crawlers the best URLs to crawl.

Below is my presentation in full for those that are interested in all the tools reviewed, the metrics to watch, the recommended timeline, and other fun uses of crawlers.

Post from Kate Morris

↧

The First Pillar of SEO: Technology

February 16, 2015, 6:10 am

≫ Next: Find and Fix Common Crawl Optimisation Issues

≪ Previous: New Year Resolution: Crawl Quarterly

Technical SEO In my previous post for State of Digital I wrote about my ‘Three Pillars’ approach to SEO: Technology, Relevance, and Authority. Together these three pillars create a holistic view of SEO that should take all aspects of a website in to account. Additionally, the three pillars map to the three main processes in web search engines: crawling, indexing, and ranking.

I want to elaborate further on each of the three pillars, starting with the first: technology.

The technological aspect of SEO is something many practitioners are, by their own admission, only passingly familiar with. It’s also one of the aspects of SEO that intrudes in the domain of web developers and server administrators, which means that for many marketing-focused SEOs it’s not something they can easily get their hands dirty with.

Yet it can be a tremendously important aspect of good SEO, especially for large-scale complicated websites. Whilst the average WordPress site won’t need a lot of technical SEO fixes applied to it (hopefully), large news publishers and enterprise-level ecommerce platforms are a different story altogether.

Why this is the case is something that becomes evident when you understand the purpose of technical SEO which, in my model, is crawl efficiency. For me the technology pillar of SEO is about making sure search engines can crawl your content as easily as possible, and crawl only the right content.

Crawl Efficiency

When the technological foundations of a website are suboptimal, the most common way this affects the site’s SEO is by causing inefficiencies in crawling. This is why good technical SEO is so fundamental: before a search engine can rank your content, it first needs to have crawled it.

A site’s underlying technology impacts, among many other things, the way pages are generated, the HTTP status codes that it serves, and the code it sends across the web to the crawler. These all influence how a web crawler engages with your website. Don’t assume that your site does these things correctly out of the box; many web developers know the ins and outs of their trade very well and know exactly what goes in to building a great user-focused website, but can be oblivious to how their site is served to web crawlers.

When it comes to technical SEO, the adage “focus on your users and SEO will take care of itself” is proven entirely erroneous. A website can be perfectly optimised for a great user experience, but the technology that powers it can make it impossible for search engines to come to grips with the site.

In my SEO audit checklists, there are over 35 distinct aspects of technical SEO I look for. Below I summarise three of the most important ones, and show how they lead to further investigations on a whole range of related technical issues.

Crawl Errors

When analysing a new website, the first place many SEOs will look (myself included) is the Crawl Errors report in Google Webmaster Tools. It still baffles me how often this report is neglected, as it provides such a wealth of data for SEOs to work with.

When something goes wrong with the crawling of your website, Google will tell you in the Crawl Errors report. This is first-line information straight from the horse’s mouth, so it’s something you’ll want to pay attention to. But the fact this data is automatically generated from Google’s toolset is also the reason we’ll want to analyse it in detail, and not just take it at face value. We need to interpret what it means for the website in question, so we can propose the most workable solution.

Google Webmaster Tools Crawl Errors report

In the screenshot above we see more than 39,000 Not Found errors on a single website. This may look alarming at first glance, but we need to place that in the right context.

You’ll want to know how many pages the website actually has that you want Google to crawl and index. Many SEOs first look at the XML sitemap as a key indicator of how many indexable pages the site has:

Google Webmaster Tools Sitemaps report

It’s evident that we’re dealing with a pretty substantial website, and the 39k Not Found errors now seems a little less apocalyptic amidst a total of over 300k pages. Still, at over 11% of the site’s total pages the 39,000 Not Found errors presents a significant level of crawl inefficiency. Google will spend too much time crawling URLs that simply don’t exist.

But what about URLs that are not in the sitemap and which are discovered through regular web crawls? Never assume the sitemap is an exhaustive list of URLs on a site – I’ve yet to find an automatically generated XML sitemap that is 100% accurate and reliable.

So let’s look further and see how many pages on this site Google has actually indexed:

Google Webmaster Tools Index Status report

The plot thickens. We have 39k Not Found errors emerging from 329k URLs in the XML sitemap and the regular web crawl, which in turn has resulted in over 570k URLs in Google’s index. But this too doesn’t yet paint the entire picture: the back-end CMS that runs this website reports over 800k unique pages for Google to crawl and index.

So by analysing one single issue – crawl errors – we’ve ended up with four crucial data points: 39k Not Found errors, 329k URLs in the XML sitemap, 570k indexed URLs, and 800K unique indexable pages. The latter three will each result in additional issues being discovered, which leads me to the next aspect to investigate: the XML sitemap.

But before we move on, we need to recommend a fix for the Not Found errors. We’ll want to get the full list of crawlable URLs that result in a 404 Not Found error, which in this case Google Webmaster Tools cannot provide; you can only download the first 1000 URLs.

This is where SEO crawlers like Screaming Frog and DeepCrawl come in. Run a crawl on the site with your preferred tool and extract the list of discovered 404 Not Found URLs. For extra bonus points, run that list through a link analysis tool like Majestic to find the 404 errors that have inbound links, and prioritise these for fixing.

XML Sitemaps

No matter how well a website is structured and how easy it is to navigate to any page, I never assume the site doesn’t need an XML sitemap. Some SEO ranking correlation studies show a positive correlation between the presence of an XML sitemap and higher rankings, but this is likely not a direct causal effect; the presence of an (error-free) XML sitemap is a sign of a website that has been subjected to proper SEO efforts, where the sitemap is just one of many things the optimisers have addressed.

Nonetheless I always recommend having an error-free XML sitemap, because we know search engines use it to seed their crawlers. Including a URL in your XML sitemap doesn’t guarantee it’ll be indexed, but it certainly increases its chances, and it ensures that the bulk of your site’s crawl budget is used on the right pages.

Again, Google Webmaster Tools is the first place to start, specifically the Sitemaps report:

Google Webmaster Tools Sitemap Errors report

Here we see that every single sitemap submitted by this site has one or more errors. As this is an old website that has gone through many different iterations and upgrades, this is not unsurprising. Still, when we see a sitemap with 288,000 warnings, it seems obvious there’s a major issue at hand.

Fortunately Google Webmaster Tools provides more details about what errors exactly it finds in each of these sitemaps:

Google Webmaster Tools Sitemap Errors report detail

There are several issues with this sitemap, but the most important one is that it has thousands upon thousands of URLs that are blocked by robots.txt, preventing Google from crawling them.

Now because we have a number of earlier established data points, namely that out of 800k unique pages only 570k are actually in Google’s index, this number of 288k blocked URLs makes sense. It’s obvious that there is a bit of excessive robots.txt blocking going on that prevents Google from crawling and indexing the entire site.

We can then identify which robots.txt rule is the culprit. We take one of the example URLs provided in the sitemap errors report, and put that in the robots.txt tester in Webmaster Tools:

Google Webmaster Tools Robots.txt Tester

Instantly it’s obvious what the problem with the XML sitemap is: it includes URLs that belong to the separate iPad-optimised version of the site, which are not meant for Google’s web crawlers but that instead are intended for the website’s companion iPad app.

And by using the robots.txt tester we’re now also aware that the robots.txt file itself has issues: there are 18 errors reported in Webmaster Tools, which we’ll need to investigate further to see how that impacts on site crawling and indexing.

Load Speed

While discussing XML sitemaps above, I referenced ‘crawl budget‘. This is the concept that Google will only spend a certain amount of time crawling your website before it terminates the process and moves on to a different site.

It’s a perfectly logical idea, which is why I believe that it still applies today. After all, Google doesn’t want to waste endless CPU cycles on crawling infinite URL loops on poorly designed websites, so it makes sense to assign a time period to a web crawl before it expires.

Moreover, beyond the intuitive sensibility of crawl budgets, we see that when we optimise the ease with which a site can be crawled, the performance of that site in search results tends to improve. This all comes back to crawl efficiency; optimising how web crawlers interact with your website to ensure the right content is crawled and no time is wasted on the wrong URLs.

As crawl budget is a time-based metric, that means a site’s load speed is a factor. The faster a page can be loaded, the more pages Google can crawl before the crawl budget expires and the crawler process ends.

And, as we know, load speed is massively important for usability as well, so you tick off multiple boxes by addressing one technical SEO issue.

As before, we’ll want start with Webmaster Tools, specifically the Crawl Stats report:

Google Webmaster Tools Crawl Stats report

At first glance you’d think that these three graphs will give you all you need to know about the crawl budget Google has set aside for your website. You know how many pages Google crawls a day, how long each page takes to load, and how many kilobytes cross the ether for Google to load all these pages. A few back-of-a-napkin calculations will tell you that, using the reported average numbers, the average page is 25 kilobytes in size.

But 1.5 seconds load time for 25KB seems a bit sluggish, and even a cursory glance at the website will reveal that 25KB is a grossly inaccurate number. So it seems that at last we’ve exhausted the usefulness of Google Webmaster Tools, at least when it comes to load speed issues.

We can turn to Google Analytics next and see what that says about the site’s load speed:

Google Analytics Site Speed report

There we go; a much better view of the average load speed, based on a statistically significant sample size as well (over 44k pages). The average load speed is a whopping 24 seconds. Definitely something worth addressing; not just for search engines, but also for users that have to wait nearly half a minute for a page to finish loading. That might be perfectly fine in the old days of 2400 baud modems, but it’s simply unacceptable these days.

But this report doesn’t tell us what the actual problems are. We only know that the average load speed is too slow. We need to provide actionable recommendations, so we need to dig a bit deeper.

There are many different load speed measurement tools that all do a decent job, but my favourite is without a doubt WebPageTest.org. It allows you do select a nearby geographic location and a browser version, so that you get an accurate reflection of how your users’ browsers will load your website.

What I like most about WebPageTest.org is the visual waterfall view it provides, showing you exactly where the bottlenecks are:

WebPageTest.org Waterfall report

The screenshot above is just the first slice of the total waterfall view, and immediately we can see a number of potential issues. There are a large number of JS and CSS files being downloaded, and some take over a full second to load. The rest of the waterfall view makes for equally grim reading, with load times on dozens of JS and image files well over the one second mark.

Then there are a number of external plugins and advertising platforms being loaded, which further extend the load speed, until the page finally completes after 21 seconds. This is not far off the 24 second average reported in Google Analytics.

It’s blatantly obvious something needs to be done to fix this, and the waterfall view will give you a number of clear recommendations to make, such as minimising JS and CSS and using smaller images. As a secondary data set you can use Google’s PageSpeed Insights for further recommendations:

Google PageSpeed Insights report

When it comes to communicating my load speed recommendations to the client I definitely prefer WebPageTest.org, as the waterfall view is such a great way to visualise the loading process and identify pain points.

Down The Rabbit Hole

At this stage we’ve only ticked off three of the 35 technical SEO aspects on my audit checklist, but already we’ve identified a number of additional issues relating to index levels and robots.txt blocking. As we go on through the audit checklist we’ll find more and more issues to address, each of which will need to be analysed in detail so that we can provide the most effective recommendation that will address the problem.

In the end, technical SEO for me boils down to making sure your site can be crawled as efficiently as possible. Once you’re confident of that, you can move on to the next stage: what search engines do with the pages they’ve crawled.

The indexing process is what allows search engines to make sense of what they find, and that’s where my Relevance pillar comes in, which will be the topic of my next article.

Post from Barry Adams

↧

Find and Fix Common Crawl Optimisation Issues

September 14, 2015, 2:59 am

≫ Next: 40 DeepCrawl tweaks to make a website soar in Google Search

≪ Previous: The First Pillar of SEO: Technology

When I analyse websites for technical SEO issues, the biggest factor for me is always crawl optimisation – i.e. ensuring that when a search engine like Google crawls the site, only the right pages are crawled and Googlebot doesn’t waste much time on crawling pages that won’t end up in the index anyway.

If a site has too many crawlable pages that aren’t being indexed, you are wasting Google’s crawl budget. Crawl budget, for those who don’t know, is the amount of pages Google will crawl on your site – or, as some believe (myself included) the set amount of time Google will spend trying to crawl your site – before it gives up and goes away.

So if your site has a lot of crawl waste, there is a strong likelihood that not all pages on your site will be crawled by Google. And that means that when you change pages or add new pages to your site, Google might not be able to find them any time soon. The negative repercussions for your SEO efforts should be evident.

How do you find out if your site has crawl optimisation issues? Google’s Search Console won’t tell you much, but fortunately there are tools out there that can help. My preferred tool to identify crawl optimisation problems is DeepCrawl. With a DeepCrawl scan of your site, you can very quickly see if there are crawl efficiency issues:

DeepCrawl crawl report

The screenshot above is from DeepCrawl’s main report page for a site crawl. As is evident here, this site has a huge crawl optimisation issue: out of nearly 200k pages crawled, over 150k are not indexable for various reasons. But they’re still crawlable. And that means Google will waste an awful lot of time crawling URLs on this site that will never end up in its index – a total waste of Google’s time, and therefore a dangerous issue to have on your website.

Optimising crawl budget is especially important on larger websites where more intricate technical SEO elements can come in to play, such as pagination, sorted lists, URL parameters, etc.

Today I’ll discuss a few common crawl optimisation issues, and show you how to handle them effectively in ways that hopefully won’t cause your web developers a lot of hassle.

Accurate XML Sitemaps

One of the things I like to do when analysing a site is take the site’s XML sitemap and run it through Screaming Frog. While the Search Console report on a sitemap can give you good information, nothing is quite as informative as actually crawling the sitemap with a tool and seeing what happens.

Recently when analysing a website, the Search Console report showed that only a small percentage of the submitted URLs were actually included in Google’s index. I wanted to find out why, so I downloaded the XML sitemap and ran a Screaming Frog crawl. This was the result:

XML Sitemap with 301 redirects

As it turns out, over 90% of the URLs in the XML sitemap resulted in a 301 redirect. With several thousand URLs in the sitemap, this presented quite a waste of crawl budget. Google will take the URLs from the sitemap to seed its crawlers with, which will then have to do double the work – retrieve the original URL, receive a 301-redirect HTTP header, and then retrieve the redirect’s destination URL. This times several thousand URLs, and the waste should be obvious.

Upon looking at the redirected URLs in Screaming Frog, the root issue was clear very quickly: the sitemap contained URLs without a trailing slash, and the website was configured to redirect these to URLs with the trailing slash.

So this URL http://www.example.com/category/product then redirected to this URL: http://www.example.com/category/product/. The fix is simple: ensure that the XML sitemap contains only URLs with the trailing slash.

The key lesson here is to make sure that your XML sitemaps contain only final destination URLs, and that there’s no waste of crawl budget with redirects or non-indexable pages in your sitemap.

Paginated & Sorted Lists

A common issue on many ecommerce websites, as well as news publishers that have a great amount of content, is paginated listings. As users of the web, this is something we have become almost desensitised to: endless lists of products or articles which, in the end, don’t make finding what you’re looking for any easier; we end up using the site’s internal search function more often than not.

For SEO, paginated listings can cause an array of problems, especially when you combine them with different ways to sort the lists. For example, take an ecommerce website that in one of its main categories has 22 pages worth of products.

22 Pages

Now, this large list of products can be sorted in various different ways: by price, by size, by colour, by material, and by name. That gives us five ways to sort 22 pages of products. Each of these sortings generates a different set of 22 pages of content, each with their own slightly different URL.

Then add in the complication of additive filters – so-called faceted navigation, a very common feature on many ecommerce sites. Each of these will generate anywhere from one to 22 additional URLs, as each filtered list can also be sorted in five different ways. A handful of filters would make the amount of crawlable pages grow exponentially. You see how one product category can easily result in millions of URLs.

Obviously we’ll want to minimise the number of pages Google has to crawl to find all the products on the site. There are several approaches to do this effectively:

Increase the default number of products/articles per page. Few things grind my gears as much as a paginated list with only 10 products on a page. Put more products on a single page! Scrolling is easy – clicking on a new page is harder. Less clicking, more scrolling. Don’t be afraid to put 100 products on a single page.

Block the different sorted pages in robots.txt. In most cases, a different way to sort a list of products is expressed through a parameter in the URL, like ‘?order=price‘ or something like that. Prevent Google from ever crawling these URLs by blocking the unique parameter in robots.txt. A simple disallow rule will prevent millions of potential pages from ever being crawled:

User-agent: * Disallow: /*order=price*

This way you can block all the unique parameters associated with specific ways to sort a list, thereby massively reducing the number of potentially crawlable pages in one fell swoop. Just be careful you don’t inadvertently block the wrong pages from being crawled – use Google’s robots.txt tester in the Search Console to double-check that the regular category pages, as well as your product pages, are not blocked from being crawled.

Excessive Canonicals

Ever since the advent of the rel=canonical meta tag, SEOs have used it enthusiastically to ensure that the right pages are included in Google’s index. Personally I like the canonical tag too, as it can solve many different issues and prevent other problems from arising.

But there’s a downside to using rel=canonicals: they’re almost too easy. Because implementing a rel=canonical tag can be such a blunt instrument, many SEOs use it without realising the true repercussions. It’s like using a hammer for all your DIY, when sometimes you should actually use a screwdriver instead.

Take the pagination issue I described above. Many SEOs would not consider using a robots.txt block or increasing the number of items per page. Instead they’d just canonicalise these paginated listings back to a main category page and consider the problem solved.

And from a pure index issue, it is solved; the canonical tag will ensure these millions of paginated pages will not appear in Google’s index. But the underlying issue – massive waste of crawl budget – is entirely unaffected. Google still has to crawl these millions of pages, only then to be told by the rel=canonical tag that no, actually, you don’t need to index this page at all, see the canonical page instead, thanks for visiting, kthxbye.

DeepCrawl non-indexable pages

Before implementing a rel=canonical tag, you have to ask yourself whether it actually addresses the underlying issue, or whether it’s a slap-dash fix that serves as a mere cosmetic cover-up for the problem. Canonical tags only work if Google indexes the page and sees the rel=canonical there, and that means it’ll never address crawl optimisation issues. Canonical tags are for index issues, not crawl issues.

In my Three Pillars of SEO approach, the first Technology pillar aligns with Google’s crawl process, and the second Relevance pillar aligns with the search engine’s indexer process. For me, canonical tags help solve relevance issues, by ensuring identical content on different URLs does not compete with itself. Canonical tags are never a solution for crawl issues.

Crawl issues are addressed by ensuring Google has less work do to, whereas canonicals generate more work for Google; to properly react to a canonical tag, Google has to crawl the duplicate URL as well as the original URL.

The same goes for the noindex meta tag – Google has to see it, i.e. crawl it, before it can act on it. It is therefore never a fix for crawl efficiency issues.

In my view, crawl issues are only truly solved by ensuring Google requires less effort to crawl your website. This is accomplished by an optimised site architecture, effective robots.txt blocking, and minimal wastage from additional crawl sources like XML sitemaps.

Just The Start

The three issues above are by no means the only technical SEO elements that impact on crawl efficiency – there are many more relevant aspects, such as load speed, page weight, server responses, etc. If you’re attending Pubcon this year, be sure to catch the SEO Tech Masters session on Tuesday where I’ll be speaking about crawl optimisation alongside Dave Rohrer and Michael Gray.

I hope that was useful, and if you’ve any comments or questions about crawl optimisation, please do leave a comment or catch me on Twitter: @badams.

Post from Barry Adams

↧

40 DeepCrawl tweaks to make a website soar in Google Search

October 6, 2015, 2:55 am

≫ Next: 46 Updated DeepCrawl Tweaks to Make Your Website Soar in Google Search Results

≪ Previous: Find and Fix Common Crawl Optimisation Issues

How to understand and optimize your site’s search signals

Optimizing a website for your target audience and search engines requires gathering the right data and understanding its significance. It can be a challenge for large websites but there is an enterprise level solution: DeepCrawl. With DeepCrawl, sites can be crawled in a similar way to Search engine bots. More importantly, sites can be seen and understood from a search engine’s point of view. Lastly, little effort is required to optimize search signals and make a site visible in Google’s organic search. Here are 40 small steps to get there!

1. Find Duplicate Content

Duplicate content is an issue for search engines and users alike. Users don’t appreciate repeated content that adds no additional value, while search engines can be confused by duplicate content, and fail to rank the original source as intended. You can help your trust perception on both fronts by finding duplicate titles, descriptions, body content, and URLs with DeepCrawl. Click through the post-audit content tab in your crawl report to see all web pages with doppelganger-style content.

DeepCrawl scans your project domains for duplicate content as part of its default practices during web and universal crawls. Adjust sensitivity to duplicated content in the report settings tab of your project’s main dashboard before you launch your next crawl. Easy!

2. Identify Duplicate Pages

Duplicate pages

Duplication is a very common issue, and one that can be the decisive factor when it comes to authority. Using DeepCrawl, you can view a full list of duplicate pages to get an accurate picture of the problem. This feature helps you consolidate your pages so your site funnels visitors to the right information, giving your domain a better chance at conversion and reduces pages competing for the same search result rankings.

How To Do It:

Click “Add Project” from your main dashboard.
Pick the “web crawl” type to tell DeepCrawl to scan only your website at all its levels.
Review your site’s duplicate pages from the “issues” tab located in the left panel of your main reporting dashboard once the crawl finishes.

3. Optimize Meta Descriptions

Meta descriptions

Meta descriptions can greatly influence click-through rates on your site, leading to more traffic and conversions. Duplicate and inconsistent descriptions can negatively impact user experience, which is why you want to prioritize fixes in this area. Through the content report tab, DeepCrawl gives you an accurate count of duplicate, missing and short meta descriptions. This lets you identify problem areas, and turn them into positive search signals that work for, rather than against, your site.

4. Optimize Image Tags

Image alt tags

Google Image Search is a huge opportunity to claim SERP real estate. Missing image alt tags are organic traffic opportunities lost. Using DeepCrawl’s custom extraction tool, you and your team can set crawls to target images and audit their corresponding alt tags. You can access this feature through the Advanced Settings tab in your project’s main dashboard.

How To Do It:

Create custom extraction rules using Regular Expressions syntax.
Hint: Try “/(<img(?!.*?alt=([‘”]).*?\2)[^>]*)(>)/” to catch images that have alt tag errors or don’t have alt tags altogether.
Paste your code into “Extraction Regex” from the Advanced Settings tab in your projects dashboard.
Check your reports from your projects dashboard when the crawl completes. DeepCrawl gives two reports when using this setting: URLs that followed at least one rule from your entered syntax and URLs that returned no rule matches.

5. Crawl Your Site like Googlebot

Crawl urls

Crawl your website just like search engine bots do. Getting a comprehensive report of every URL on your website is a mandatory component of regular maintenance. With DeepCrawl, you can dig into your site without slowing its performance. Because the crawler is cloud-based, there’s minimal impact on your website while the crawl is in progress. Choose “universal crawl” to also crawl your most important conversion pages and XML sitemaps with a single click.

6. Discover Potentially Non-indexable Pages

Non-indexable Pages

Improve user experience on your site by identifying non-indexable pages that are either wrongly canonicalized and potentially missed opportunities or may be wasting precious crawl budget. In DeepCrawl, from the content tab of your completed crawl, you can review every no-indexed page on your entire website based on the type of no-indexing (e.g. nofollowed, canonicalization and noindex).

7. Compare Previous Crawls

Track changes

As part of its native feature set, DeepCrawl compares page level, indexation, http status and non-index changes from your previous crawls to create historical data for your organisation. This view helps you to identify areas that need immediate attention, including server response errors and page indexation issues, as well as parts that show steady improvement.

In addition, get a visual display of how your site’s web properties are improving from crawl to crawl with this project-level view. You can get info for up to 20 concurrent projects right from the monitoring tab in your main project dashboard. DeepCrawl caches your site data permanently, meaning you can compare how your website progressed from crawl to crawl.

Compare websites

How To Do It:

Download crawl data from a finished report in .xls or .pdf format.
Add client logo or business information to the report.
Serve data to your client that’s formatted to look as though it came directly from your shop.

8. Run Multiple Crawls at Once

Multiple Crawls

Resource-draining site audits are a thing of the past. Thanks to cloud-based servers, you can run up to 20 active crawls spanning millions of URLs at once and still use your computer to do whatever other tasks you need it for… And best of all, all the data is backed up in the cloud.

9. Avoid Panda, Manage Thin Content

Manage Thin Content

Search engines have entire algorithm updates focused on identifying thin content pages. Track down thin content on your website by using DeepCrawl’s content audit tab. Clicking the “min content size” section gives you every URL with pages that have content below three kilobytes. This filter gives your team a list of URLs which serves as a starting point for further investigation. Excluding lean content pages from being indexed or enriching their content can help the improve the website both from a user experience and a search engine optimization point of view.

How To Do It:

From the report settings tab, adjust the minimum content size by kilobytes.
If you do not change the setting, DeepCrawl will scan your URLs for the default min content size of three kilobytes.
Fine tune as necessary.

10. Crawl Massive Sites

Crawl Massive Sites

You may have to crawl a website that spans over 1 million URLs. The good news is that DeepCrawl can run audits for up to 3 million URLs per crawl, giving you the powerful tool your site size demands.

How To Do It:

From the crawl settings tab in your projects dashboard, adjust the crawl limit to suit your target domain’s total URLs.
Crawl up to 1 million URLs using prefabricated settings in DeepCrawl’s dropdown “crawl limits” menu.
For a custom crawl, select “custom” from the dropdown menu and adjust max URLs and crawl depth to suit your reporting needs.
Crawl Limit: 3 million URLs per crawl.

11. Test Domain Migration

Test Domain Migration

Newly migrated websites can experience server-side issues and unexpected URL complications that can lead to page display errors and downtime. Check status codes post-migration from your project’s reporting dashboard. There you can see the total number of non-200 status codes, including 5xx and 4xx errors, DeepCrawl detected during the platform’s most recent audit. Use this information to see if URLs that your team redirected are working as intended.

12. Track Migration Changes

Track Migration Changes

Website migration is often a painstaking effort involving multiple development teams, technical SEOs, and decision makers. Use DeepCrawl to compare your site’s post-migration structure to a previous crawl dating back before the move. You can choose to compare the newest crawl to any previously-run data set to see if teams missed their assignments or if old problems managed to make their way into the site’s new web iteration.

13. Find 404 Errors

Find 404 Errors

Correcting 404 errors helps search crawlers navigate your pages with less difficulty and reduces the chance that searchers might land on content that serves an “oops!” message and not products. Find 404s using DeepCrawl from your report’s main audit dashboard. One click gives you a full list of a 4xx errors on site at the time of audit, including their page title, URL, source code and the page the link to the 404 was found on.

14. Check Redirects

Check Redirects

Check the status of temporary and permanent redirects on site by clicking through the “non-200 status codes” section on your project dashboard. Download 301 and 302 redirects into .CSV files for easy sorting, or share a project link with team members to start the revision process.

15. Monitor Trends Between Crawls

Monitor Trends Between Crawls

Tracking changes between crawls gives you powerful data to gauge site trends, emerging issues, and potential opportunities. You can manually set DeepCrawl to compare a new crawl to its previous audit of the same site from the “crawls” tab in the project management section. Once finished, the new crawl shows your improved stats in green and potential trouble areas in red. The overview includes link data, HTML ratio and max load times among others.

16. Check Pagination

Check Pagination

Pagination is crucial for large websites with many items, (such as e-commerce websites),helping to make sure the right pages display for relevant categories. Find paginated URLs in the “crawled URLs” section of your audit’s overview dashboard. From this section, your team can check rel=next and rel=prev tags for accuracy while vetting individual URLs to make sure they’re the intended targets.

17. Find Failed URLs

Find Failed URLs

When a URL fails on your website, DeepCrawl’s spider can’t reach it to render the page, which means users and search engines probably can’t get there either. Investigate connection errors and other issues that can cause failed URLs through your project’s main reports dashboard. Finding the source of these issues can help you improve your site’s user experience by improving server response times and reducing time-out connection errors. You can find more information about Connection errors in DeepCrawl’s guide.

18. Crawl Sitemaps for Errors

Crawl Sitemaps for Errors

Problems in your XML sitemap can lead to delays in search engines identifying and indexing your important pages. If you’re dealing with a large domain with a million URLs, finding that one bad URL can be a challenge. Put DeepCrawl to work with the platform’s universal crawl feature. From this tab in your project reporting dashboard, review URLs missing from your sitemap, all pages included in your sitemaps, broken sitemaps, and sitemap URLs that redirect.

19. Improve Your Crawls with Google Analytics Data

Improve Your Crawls with Google Analytics Data

Validating your Google Analytics account when you create a new project gives DeepCrawl the ability to overlay organic traffic and total visits on your individual URLs and paths. Seeing this data helps with prioritizing changes coming out of your site audit. For example, if you notice low organic traffic to product pages that have thin or duplicate content, that may serve as a signal to speed up revamping those URLs.

How To Do It:

Select a universal crawl when setting up your new project.
Click save and click over to the “analytics settings” that now appears in the top menu.
Click “add new analytics account” located in the top left of the dashboard.
Enter your Google Analytics name and password to sync your data permissions within the DeepCrawl platform.
Hit save and DeepCrawl will pull analytics data for all domains you have permission to view on current and future crawls.

20. Verify Social Tags

Verify Social Tags

To increase share rates from your blog posts on Facebook and Twitter and thereby enhance your community outreach activities, avoid any errors in your social tag markup. View Twitter Cards and Open Graph titles, images, and URLs to see what needs fixing.

21. Test Individual URLs

Test Individual URLs

Granular reporting over thousands of landing pages is difficult to grasp, but DeepCrawl makes the process digestible with an elegant statistical breakdown. View unique pages by clicking the link of the same name (unique pages) in your dashboard overview. From here, you can see a detailed breakdown of page health, including external links, word count, HTML size, and nofollow tag use.

22. Verify Canonical Tags

Verify Canonical Tags

Incorrect canonical tags lead crawlers to ignore canonicals altogether, leaving your site in danger of duplication and search engines having trouble correctly identifying your content. View canonicalized pages in your DeepCrawl audit by clicking through from the “non-indexable pages” section of the project’s dashboard. The platform gives you the canonical’s location, HTML, title tag, and URLs to help with verification.

23. Clean Up Page Headers

Clean Up Page Headers

Cluttered page headers can impair the click through rate if users’ expectations are not being managed well. CTRs can vary by wide margins, which makes it difficult to chart the most effective path to conversion. When you run a universal crawl, make sure to integrate your Google Analytics account. This will help you gain a deeper insight, by combining crawl data with powerful analytics data including bounce rate, time on page, and load times.

24. Make Landing Pages Awesome

Make Landing Pages Awesome Page-level elements, including H1 tags, compelling content, and proper pagination, are essential parts of your site’s marketing funnel that helps turn visitors into leads. Use DeepCrawl’s metrics to help improve engagement and conversion. Find content that is missing key parts through the “unique pages” tab, including H1s, H2s, sitemap inclusion, and social markup, to help them engage visitors faster, deliver your site’s message in a clearer fashion, and increase chances for conversions and exceeding user expectations.

25. Prioritize Site Issues

Prioritize Site Issues

Knowing what to address first in the aftermath of a sizable audit can be a challenging task for any site owner. Using the project management tab, you can assign tasks emerging from the audit to your team members and give each task a priority rating. This system helps you track actions from your site audit by priority and assigned team member through the “all issues” tab, accessible from any page. You can view the age of each task, leave notes for team members, and mark projects as fixed, all from the same screen in the DeepCrawl projects platform. For assignments with several moving parts, including 404 cleanups and page title correction, projects count down remaining URLs until they reach zero.

Prioritize Site Issues

26. Check for Thin Content

Check for Thin Content

Clean, efficient code leads to fast loading sites – a big advantage in search engines and for users. Search engines tend to avoid serving pages that have thin content and extensive HTML in organic listings. Investigate pages that don’t meet DeepCrawl’s min content/HTML ratio by clicking through the tab of the same name in your project’s main reporting dashboard. View pages by title and URL, exporting the list as a .CSV or shared link.

27. Crawl as Googlebot or Your Own User Agent

Crawl as Googlebot or Your Own User Agent

If your site auditor can’t crawl your pages as Googlebot, then you have no prayer of seeing your domain through the search giant’s eyes. DeepCrawl can mimic spiders from other search engines, social networks, and browsers, including Firefox, Bing, and Facebook. Select your user agent in the advanced settings tab after you choose the website you want to audit.

28. Discover Disallowed URLs

Discover Disallowed URLs

Never crawled (or disallowed) URLs may contain broken links, corrupted files, and poorly coded HTML that can impact site performance. View disallowed pages from the overview section of your site’s crawl report. From this section, you can see all disallowed pages and their corresponding URLs and then get the picture on which URLs may not be crawled by search engines.

29. Validate Page Indexation

Validate Page Indexation

DeepCrawl gives you the versatility to get high-level and granular views of your indexed and non-indexed pages across your entire domain. Check if search engines can see your site’s most important pages through this indexation tab, which sits in the main navigation of your reporting dashboard. Investigate no-indexed pages to make sure you’re only blocking search engines from URLs when it’s absolutely necessary.

30. Sniff Out Troublesome Body Content

Sniff Out Troublesome Body Content

Troublesome content diminishes user confidence and causes them to generate negative behavioral signals that are recognized by search engines. Review your page-level body content after a web crawl by checking out missing H1 tags, spotting pages with threadbare word count, and digging into duplication. DeepCrawl gives you a scalpel’s precision in pinpointing the problems right down to individual landing pages, which enables you to direct your team precisely to the source of the problem.

How To Do It:

From the report settings tab in your project dashboard, set your “duplicate precision” setting between 1.0 (most stringent) and 5.0 (least stringent).
The default setting for duplicate precision is 2.0, if you decide to run your crawl as normal. If you do not change your settings, the crawl will run at this level of content scrutiny.
Run your web crawl.
Review results for duplicate body content, missing tags, and poor optimization as shown above.

31. Set Max Content Size

Set Max Content Size

Pages that have a high word count might serve user intent and drive conversion, but they can also cause user confusion. Use DeepCrawl’s report settings tab to change max content size to reflect your ideal word count. After your crawl finishes, head to the “max content size” section from the audit dashboard to find pages that exceed your established limit.

How To Do It:

In the report settings tab of your project dashboard, adjust maximum content size by KB (kilobytes).
Hint: one kilobyte equals about 512 words.
Check content that exceeded your KB limit from the finished crawl report.

32. Fixing Broken Links

Fixing Broken Links

Having too many broken links to external resources on your website can lead to a bad user experience, as well as give the impression your website is out of date. Use the validation tab from your web crawl dashboard to find all external links identified by DeepCrawl’s spider. Target followed links, sorting by http status, to see URLs that no longer return 200 “OK” statuses. These are your broken links. Go fix them.

33. Schedule Crawls

Schedule Crawls Use DeepCrawl’s report scheduling feature to auto-run future crawls, including their frequency, start date, and even time of day. This feature can also be used to avoid sensitive and busy server times or have monthly reported emailed directly to you or your team.

How To Do It:

From the projects dashboard, click on the website you’d like to set reporting schedules.
Click the scheduling tab.
Set the frequency of your automated crawl, start date, and the hour you’d like the crawl to site.
Check back at the appointed time for your completed crawl.

34. Give Your Dev Site a Checkup

Give Your Dev Site a Checkup

Your development or staging site needs a checkup just like your production website. Limit your crawl to your development URL and use the universal or web crawl settings to dig into status codes and crawl depth, locating failed URLs, non-indexable pages and non-200 codes by level.

How To Do It:

Run a universal crawl to capture all URLs in your target domain, including sitemaps.
When the crawl finishes, check your non-200 status codes and web crawl depth from the reports dashboard.
Examine the crawl depth chart to see where errors in your development site’s architecture occur most often.

35. Control Crawl Speed

Control Crawl Speed

At crawl speeds of up to twenty URLs per second, DeepCrawl boasts one of most nimble audit spiders available for online marketers working with enterprise level domains. Sometimes, however, speed isn’t what is most important; accuracy is what matters. Alter crawl speeds by how many URLs are crawled per second or switching from dynamic IP addresses to static, stealth crawl, or location-based IP from the advanced settings tab during initial audit setup.

36. Custom Extraction

Custom Extraction

Add custom rules to your website audit with DeepCrawl’s Custom Extraction tool. You can tell the crawler to perform a wide array of tasks, including paying more attention to social tags, finding URLs that match a certain criteria, verifying AppIndexing deeplinks, or targeting an analytics tracking code to validate product information across category pages. For more information about Custom Extraction syntax and coding, check out this tutorial published by the DeepCrawl team.

How To Do It:

Enter your Regular Expressions syntax into the Extraction Regex box from the advanced settings tab on your DeepCrawl projects dashboard.
Click the box underneath the Extraction Regex box if you’d like DeepCrawl to exclude HTML tags from your crawl results.
View your results by checking the Custom Extraction tab in your project crawl dashboard.

37. Restrict Crawls

Restrict Crawls

Restrict crawls for any site using DeepCrawl’s max URL settings or the list crawl feature. The max setting places a cap on the total number of pages crawled, while the list feature restricts access to a set of URLs you upload before the audit launches.

How to Do It (Include URLs):

Go to the “Include Only URLs” section in the Advanced Settings tab when setting up your new project’s crawl.
Add the URL paths you want to include in the box provided on single lines. DeepCrawl includes all URLs with the paths you enter in them when you begin your crawl.
Ex: /practice-areas//category/

How to Do It (Exclude URLs):

Navigate to the “Excluded URLs” section in the advanced settings tab of your project setup dashboard.
Add URL paths you want to exclude from your crawl by writing them on single lines in the box provided using the method as outlined above.
Important Note: Exclude rules override include rules you set for your crawl
Bonus Exclude: Stop DeepCrawl from crawling script-centric URLs by adding “*.php” and “*.cgi” into the exclude URLs field.

38. Check Implementation of Structured Data

Check Implementation of Structured Data

Access Google’s own Structured Data Testing Tool to validate Schema.org markup by adding a line or two of code to your audit through DeepCrawl’s Custom Extraction. This tool helps you see how your rich snippets may appear in search results, where errors in your markup prevent it from being displayed, and whether or not Google interprets your code, including rel=publisher and product reviews, correctly.

How To Do It:

Choose the “list crawl” option during your project setup.
Enter a list of URLs you want to validate into the box provided. If you have over 2,000 URLs in your list, you’ll need to upload them as a .txt file.
Add the Custom Extraction code found here to get DeepCrawl to recognize Schema markup tags and add the particular line of code you want for your crawl: ratings, reviews, person, breadcrumbs, etc.
Run Crawl. You can find the info DeepCrawl gleaned on your site’s structured markup by checking the Extraction tab from the reporting dashboard.

39. Verify HREFLANG Tags

Verify HREFLANG Tags

If your website is available in multiple languages, you need to validate the site’s HREFLANG tags. You can test HREFLANG tags through the validation tab in your universal crawl dashboard.

If you have HREFLANG tags in your sitemaps, be sure to use the universal crawl as this includes crawling your XML sitemaps.

How To Do It:

Run a universal crawl on your targeted website from the projects dashboard.
Once the crawl finishes, head to the reports section under “validation.”
From there you can view pages with HREFLANG tags and pages without them.
View URLs flagged as inconsistent and alternative URLs for all country variations on the domain.

40. Share Read-Only Reports

Share Read-Only Reports

This is one of my favorite options: Sharing reports with C-levels and other decision makers without giving them access to other sensitive projects is easily doable with DeepCrawl. Generate a “read-only” URL to give them a high-level view of the site crawl as a whole or to kick out a link that focuses on a granular section, including content, indexation and validation.

Last but not least

This top 40 listing is by no means complete. In fact, there are many more possibilities to utilize DeepCrawl for enhancing site performance and the tool undergoes constant improvements. This list is a starting point to understanding your website as Google does and making improvements for users and search engines alike. DeepCrawl is a very powerful tool and conclusions drawn using the data it provides must be based on experience and expertise. If applied to its full effect DeepCrawl can bring an online business to the next level and significantly contribute to user expectations management, brand building and most importantly driving conversions.

What are your favourite DeepCrawl features? Your opinion matters, share it in the comments section below.

Post from Fili Wiese

↧

46 Updated DeepCrawl Tweaks to Make Your Website Soar in Google Search Results

September 27, 2016, 2:00 am

≫ Next: A Crawl-Centred Approach to Content Auditing

≪ Previous: 40 DeepCrawl tweaks to make a website soar in Google Search

Optimizing a website for your target audience can be tricky. Optimizing a large website for your target audience can be even trickier. Since the last article, DeepCrawl recently launched a significant update to a brand new crawler which is a lot faster and there are now a bunch of new reports available.

Below are 46 new and updated tips and tricks to optimise the search signals for your websites in Google’s organic search.

Spend more time making recommendations and changes

(and less time trawling through data)

1. Crawl MASSIVE Sites

If you have a really large website with millions of pages you can scan unlimited amounts with the custom setting – so long as you have enough credits in your account!

How To Do It:

From the crawl limits in step 3 of crawl set up, adjust the crawl limit to suit your target domain’s total URLs
Crawl up to 10 million using pre-fabricated options from the dropdown menu
For a custom crawl, select “custom” from the dropdown menu and adjust max URLs and crawl depth to suit your reporting needs

01-crawl-massive-sites

2. Compare Previous Crawls

Built into DeepCrawl is the ability to compare your current crawl with the most recent crawl, this is useful for tracking changes as they are implemented, and for providing rich data for your organization to show the (hopefully positive) impacts of your SEO changes on the site. You’ll also be able to see all of your previous crawls as well.

How To Do It:

On step 4 of your crawl set up, you can select to compare your new crawl to a previous one

02-compare-previous-crawls

3. Monitor Trends Between Crawls

Tracking changes between crawls gives you powerful data to gauge site trends, get ahead of any emerging issues, and spot potential opportunities. DeepCrawl highlights these for you through the Added, Removed, and Missing reports. These are populated once a project has been re-crawled, and appear in every metric reported upon.

Once a follow-up crawl is finished, the new crawl shows your improved stats in green and potential trouble areas in red.

In addition to calculating the URLs which are relevant to a report, DeepCrawl also calculates the changes in URLs between crawls. If a URL appears in a report where it did not appear in the previous crawl, it will be included in the ‘Added report’. If the URL was included in the previous crawl, and is present in the current crawl, but not in this specific report, then it is reported within the ‘Removed report’. If the URL was in the previous crawl, but was not included in any report in the current crawl, it is included in the ‘Missing report’ (e.g. the URL may have been unlinked since last crawled).

03-monitor-trends-between-crawls

4. Filters, filters, filters!

DeepCrawl reports are highly customisable. With well over 120 filtering options, you can really drill down into the data and isolate exactly what you are looking for – in relation to your specific SEO strategy and needs. Whether highlighting pages with high load times/GA bounce rates, missing social tags, broken JS/CSS, or low content: HTML ratio. You can even save your filters by creating tasks within reports, so the SEO issues that matter to you most will be flagged every time you recrawl, making the thorns in your side easy to monitor, as well as your progress!

04-filters-filters-filters

5. Assign Tasks, Ease Workflows

After you’ve gone through a big audit, assigning and communicating all your to-do’s can be challenging for any site owner. By using the built-in task manager, a button on the top right of each area of your crawl, you can assign tasks as you go along to your team members, and give each task a priority. This system helps you track actions from your site audit in the Issues area, easing team workflows by showing you what is outstanding, what you’ve achieved so far etc. You can also add deadlines and mark projects as fixed, all from the same screen in the DeepCrawl projects platform. Collaborators receiving tasks do not need a DeepCrawl account themselves, as they’ll have access to the specific crawl report shared as guests.

05-assign-issues-to-team-members

6. Share Read-Only Reports

This is one of my favourite options: sharing reports with C-levels and other decision makers without giving them access to other sensitive projects is easily doable with DeepCrawl. Generate an expiring URL to give them a high-level view of the site crawl as a whole or to kick out a PDF that focuses on a granular section, including content, indexation and validation. This also doubles up for sharing links externally when prospecting clients, especially if you’ve branded your DeepCrawl reports with your name and company logo.

06-share-read-only-reports

7. DeepCrawl is now Responsive

With DeepCrawl’s new responsive design, crawl reports look great across devices. This means you can also set off crawls on the go, useful when setting off a quick site audit or pitch. Whilst you set off crawls from the palm of your hand you can also edit the crawls while you monitor them in real-time, in case you need to alter the speed or levels etc.

07-deepcrawl-is-now-responsive

8. Brand your Crawl Reports

Are you a freelancer or an agency? In order to brand DeepCrawl reports with your business information/logo (or a client’s logo), and serve data to your team or client that is formatted to look as though it came directly from your shop, you can white-label them.

How To Do It:

Go to Account Settings
Select from Theme, Header colour, Menu colour, Logo and Custom proxy
Make the report yours!

08-brand-your-crawl-reports

Optimise your Crawl Budget

9. Crawl Your Site like Googlebot

Crawl your website just like search engine bots do. Getting a comprehensive report of every URL on your website is a mandatory component for regular maintenance. Crawl and compare your website, sitemap and landing pages to identify orphan pages, optimise your sitemaps and prioritise your workload.

09-crawl-your-site-like-googlebot

10. Optimise your Indexation

DeepCrawl gives you the versatility to get high-level and granular views of indexation across your entire domain. Check if search engines can see your site’s most important pages from the Indexation report, which sits just under the Dashboard on the left hand navigation area. Investigate no-indexed pages to make sure you’re only blocking search engines from URLs when it’s absolutely necessary.

10-optimise-your-indexation

11. Discover Potentially Non-indexable Pages

To stop you wasting crawl budget and/or identify wrongly canonicalised content, the Indexation reports show you a list of all no-indexed pages on your site, and gives you details about their meta-tags e.g. nofollowed, rel canonical, noindex etc. Pages with noindex directives in the robots meta tag, robots.txt or X-Robots-Tag in the header should be reviewed, as they can’t be indexed by search engines.

11-non-indexable-pages

12. Discover Disallowed URLs

The Disallowed URLs report, nested under Uncrawled URLs within Indexation, contains all the URLs that were disallowed in the robots.txt file on the live site, or from a custom robots.txt file in the Advanced Settings. These URLs cannot be crawled by Googlebot, which prevents their appearance in search results, these should be reviewed to ensure that none of your valuable pages are being disallowed. It’s good to get an idea of which URLs may not be crawled by search engines.

12-check-disallowed-urls-for-errors

13. Check Pagination

Pagination is crucial for large websites with lots of products or content, by making sure the right pages display for relevant categories on the site. You’ll find First Pages in a series in the pagination menu, you can also view unlinked paginated pages which is really useful for hunting down pages which might have rel=next and rel=prev implemented wrongly.

13-hunt-down-pages-that-have-relnext-and-relprev-implemented-wrongly-copy

Understand your Technical Architecture

14. Introducing Unique Internal Links & Unique Broken Links

The Unique Internal Links report shows you the number of instances for all the anchor texts DeepCrawl finds in your crawl, so you can maximise your internal linking structure and spread your link juice out to rank for more terms! The links in this report can be analysed to understand for anchor text, as well as the status of the redirect target URL.

DeepCrawl’s Unique Broken Links report shows your site’s links with unique anchor text and target URL where the target URL returns a 4xx or 5xx status. Naturally, these links can result in poor UX and waste crawl budget, so they can be updated to a new target page or removed from the source page. This nifty new report is unique to DeepCrawl!

14-introducing-unique-internal-links-_-unique-broken-links

15. Find Broken Links

Fixing 404 errors reduces the chance of users landing on broken pages and makes it easier on the crawlers, so they can find the most important content on your site more easily. Find 404s in DeepCrawl’s Non-200 Pages report. This gives you a full list of a 4xx errors on site at the time of audit, including their page title, URL, source code and the link on the page found to return a 404.

You’ll also find pages with 5xx errors, any unauthorised pages, non-301 redirects, 301 redirects, and uncategorised HTTP response codes, or pages returning a text/html content type and an HTTP response code which isn’t defined by W3C – these pages won’t be indexed by search engines and their body content will be ignored.

15-find-broken-links

16. Fix Broken Links

Having too many broken links to internal and external resources on your website can lead to a bad user experience, as well as give the impression your website is out of date. Go to the Broken Pages report from the left hand menu to find them. Go fix them.

16-fixing-broken-links

17. Check Redirects – Including Redirection Loops

Check the status of temporary and permanent redirects on site by checking the Non-200 Status report, where your redirects are nested. You can download 301 and 302 redirects or share a project link with team members to start the revision process

You can also find your Broken Redirects, Max Redirects and Redirect Loops. The Max Redirects report is defaulted to show pages that hop more than 4 times, this number can be customised on step 4 of your crawl set up in Report Settings, nested under Report Setup within the Advanced Settings.

The new Redirect Loop report contains URL chains which redirect back to themselves. These chains will result in infinite loops, causing errors in web browsers for users, and prevent crawling by search engines. In short, fixing them helps bots and, users, and prevent the loss of important authority. Once found you can update your redirect rules to prevent loops!

17-check-redirects-including-redirection-loops

18. Verify Canonical Tags

Incorrect canonical tags can waste crawl budget and cause the bots to ignore parts of your site, leaving your site in danger, as search engines may have trouble correctly identifying your content. View canonicalised pages in your DeepCrawl audit from the Indexation report. The Non-Indexable report gives you the canonical’s location and page details (like hreflang, links in/out, any duplicate content/size, fetch time etc).

18-see-all-canonical-tags-and-url-locations

19. Verify HREFLANG Tags

If your website is available in multiple languages, you need to validate the site’s HREFLANG tags. You can test HREFLANG tags through the validation tab in your universal crawl dashboard.

If you have HREFLANG tags in your sitemaps, be sure to include Web Crawl in your crawl set up, as this includes crawling your XML sitemaps. DeepCrawl reports on all HREFLANG combinations, working/broken, and/or unsupported links as well as pages without HREFLANG tags.

How To Do It:

The Configuration report gives you an overview of HREFLANG implementation
In the lower left menu, the HREFLANG section breaks down all the aspects of HREFLANG implementation into categorised pages

19-verify-hreflang-tags

20. Optimise Image Tags

By using the custom extraction tool you can extract a list of images that don’t have alt tags across the site which can help you gain valuable rankings in Google Image Search.

How To Do It:

Create custom extraction rules using Regular Expressions
Hint: Try “/(<img(?!.*?alt=([‘”]).*?\2)[^>]*)(>)/” to catch images that have alt tag errors or don’t have alt tags altogether
Paste your code into “Extraction Regex” from the Advanced Settings link on step 4 of your crawl set up
Check your reports from your projects dashboard when the crawl completes. DeepCrawl gives two reports when using this setting: URLs that followed at least one rule from your entered syntax and URLs that returned no rule matches

20-optimise-image-tags

21. Is Your Site Mobile-Ready?

Since “Mobilegeddon”, SEO’s have all become keenly aware of the constant growth of mobile usage around the world, where 70% of site traffic is from a mobile device. To optimise for the hands of the users holding those mobile devices, and the search engines connecting them to your site, you have to send mobile users content in the best way possible, and fast!

DeepCrawl’s new Mobile report shows you whether pages have any mobile configuration, and if so whether they are configured responsively, or dynamically. The Mobile report also shows you any desktop mobile configurations, mobile app links, and any discouraged viewport types.

21-is-your-site-mobile-ready_

22. Migrating to HTTPS?

Google has been calling for “HTTPS everywhere” since 2014, and it has been considered a ranking signal. It goes without saying that sooner or later most sites may to switch to the secure protocol. By crawling both http/https, DeepCrawl’s HTTPS report will show you:

HTTP resources on HTTPS pages
Pages with HSTS
HTTP pages with HSTS
HTTP pages
HTTPs pages

Highlighting any HTTP resources on HTTPS pages enables you to make sure your protocols are set up correctly, avoiding issues when search engines and browsers are identifying whether your site is secure or not. Equally, your users won’t see a red lock appear in the URL instead of a green lock, and won’t get a warning message from browsers saying proceed with caution, site insecure. Which is not exactly what people want to see when they are about to make a purchase, because when they see it, they probably won’t…

22-migrating-to-https

23. Find Unlinked Pages driving Traffic

DeepCrawl really gives you a holistic picture of your site’s architecture. You can incorporate up to 5 sources into each crawl. By combining a website crawl with analytics you’ll get a detailed gap analysis, and find URLs which have generated traffic but aren’t linked – also known as orphans – will be highlighted for you.

Pro tip: even more so if you add backlinks and lists to your crawl too! You can link your Google Analytics account and sync 7 or 30 days of data, or you can manually upload up to 6 months worth of GA or any other analytics (like Omniture) for that matter. This way you can work on your linking structure, and optimise pages that would otherwise be missed opportunities.

23-find-unlinkedin-pages-driving-traffic

24. Sitemap Optimisation

You can opt to manually add sitemaps into your crawl and/or let DeepCrawl automatically find them for you. It’s worth noting that if DeepCrawl does not find them, then it’s likely that search engines won’t either! By including sitemaps into your website crawl, DeepCrawl identifies the pages that either aren’t linked internally or are missing from sitemaps.

By including analytics in the crawl you can also see which of these generate entry visits, revealing gaps in the sitemaps. Moreover, shedding light on your internal linking structure by highlighting where they don’t match up, as you’ll see the specific URLs that are found in the sitemaps but aren’t linked, likewise those that are linked but are not found in your sitemaps.

Performing a gap analysis to optimise your sitemaps with DeepCrawl enables you to visualise your site’s structure from multiple angles, and find all of your potential areas and opportunities for improvement. TIP: You can also use the tool as an XML sitemaps generator.

24-sitemap-optimisation

25. Control Crawl Speed

You can crawl at rates as fast as your hosting environment can handle, which should be used with caution to avoid accidentally taking down a site. DeepCrawl boasts one of the most nimble audit spiders available for online marketers working with enterprise level domains.

Whilst appreciating the need for speed, accuracy is what’s most important! That said, you can change crawl speeds by URLs crawled per second when setting up, or even during a live crawl. Speeds range from 0-50 URLs crawled per second.

25-control-crawl-speed

Is your Content being found?

26. Identifying Duplicate Content

Duplicate content is an ongoing issue for search engines and users alike, but can be difficult to hunt down. Especially on really, really large sites. But, these troublesome pages are easily identified using DeepCrawl.

Amongst DeepCrawl’s duplicate reporting features, lies the Duplicate Body Content report. Unlike the previous version of DeepCrawl, the new DeepCrawl doesn’t require the user to adjust sensitivity. All duplicate content is flagged, to help avoid repeated content that can confuse search engines, make original sources fail to rank, and aren’t really giving your readership added value.

Finding Duplicate Titles, Descriptions, Body Content, and URLs with DeepCrawl is an effortless time-saver.

26-identifying-duplicate-content

27. Identify Duplicate Pages, Clusters, Primary Duplicates & Introducing True Uniques

Clicking into Duplicate Pages from the dashboard gives you a list of all the duplicates found in your crawl, which you can easily download or share. DeepCrawl now also gives you a list of Duplicate Clusters so you can look at groups of duplicates to try find the cause/pattern of these authority-draining pages.

There is also a new report entitled True Uniques. These pages have no duplicates coming off them in any way shape or form, are the most likely to be indexed, and naturally are very important pages in your site.

Primary Duplicates have duplicates coming off them – as the name implies – but have the highest internal link weight from each set of duplicated pages. Though signals like traffic and backlinks need be reviewed to assess the most appropriate primary URL, these pages should be analysed – as they are the most likely to be indexed.

Duplicate Clusters are pages sharing an identical title and near identical content with another page found in the same crawl. Duplicate pages often dilute authority signals and social shares, affecting potential performance and reducing crawl efficiency on your site. You can optimise clusters of these pages, by removing internal links to their URLs and redirecting duplicate URLs to the primary URL, or adding canonical tags to another one.

How To Do It:

Click “Add Project” from your main dashboard
Under the crawl depth setting tell DeepCrawl to scan your website at all its levels
Once the crawl has finished, review your site’s duplicate pages from the “issues” list on your main dashboard or search for ‘duplicate’ in the left nav search bar

27-identify-duplicate-pages-clusters-primary-duplicates-_-introducing-true-uniques

28. Sniff Out Troublesome Body Content

Troublesome content impacts UX and causes negative behavioral signals like bouncing back to the search results. Review your page-level content after a web crawl by checking out empty or thin pages, and digging into duplication. DeepCrawl gives you a scalpel’s precision in pinpointing the problems right down to individual landing pages, which enables you to direct your team precisely to the source of the problem.

How To Do It:

Click on the Body Content report in the left hand menu
You can also select individual issues from the main dashboard
Assign tasks using the task manager or share with your team

28-sniff-out-troublesome-body-content

29. Check for Thin Content

Clean, efficient code leads to fast loading sites – a big advantage in search engines and for users. Search engines tend to avoid serving pages that have thin content and extensive HTML in organic listings. Investigate these pages easily from the Thin Pages area nested within the Content report.

29-find-pages-with-bad-html_content-ratios

30. Avoid Panda, Manage Thin Content

In a post-Panda world it’s always good to keep an eye on any potentially thin content which can negatively impact your rankings. DeepCrawl has a section dedicated to thin and empty pages in the body content reports.

The Thin Pages report will show all of your pages with less than the minimum to content size specified in Advanced Settings > Report Settings (these settings are defaulted at 3 kb, you can also choose to customise them). Empty pages are all your indexable pages with less content than the Content Size setting specified (default set at 0.5 kilobytes) in Advanced Settings > Report Settings.

How To Do It:

Typing content in the main menu will give you the body content report
Clicking on the list will give you a list of pages you can download or share

30-avoid-a-thin-content-penalty-now

31. Optimise Page Titles & Meta Descriptions

Page titles and meta descriptions are often the first point of contact for users to your site coming from the search results, well written and unique descriptions can have a big impact on click through rates and user experience. Through the Content report, DeepCrawl gives you an accurate count of duplicate, missing and short meta descriptions and titles.

31-optimise-page-title-_-meta-description

32. Clean Up Page Headers

If you suspect your page headers are cluttered by running a crawl with Google Analytics data, you can assess key SEO landing pages, and gain deeper insights, by combining crawl data with powerful analytics data including bounce rate, time on page, and load times.

32-clean-up-page-headers

Other nuggets that DeepCrawl gives

33. How Does Your Site Compare to Competitors?

Set up a crawl using the “Stealth Crawl” feature to perform an undercover analysis of your competitor’s site, without them ever noticing. Stealth Crawl randomises IPs, user agents with delays between requests – making it virtually indistinguishable from regular traffic. Analyse their site architecture and see how your site stacks up in comparison, insodoing discovering areas for improvement.

How To Do It:

Go to the Advanced Settings in step 4 of your crawl setup and select and tick Stealth Mode Crawl nested under the Spider Settings

33-how-does-your-site-compare-to-competitors_

34. Test Domain Migration

There’s always issues with newly migrated websites which usually generate page display errors and the site going down. By checking status codes post-migration in DeepCrawl you can keep an eye on any unexpected server-side issues as you crawl.

In the Non-200 Pages report you can see the total number of non-200 status codes, including 5xx and 4xx errors that DeepCrawl detected during the platform’s most recent crawl.

34-test-domain-migration

35. Test Individual URLs

Getting a granular view over thousands of pages can be difficult, but DeepCrawl makes the process digestible with an elegant Pages Breakdown pie chart on the dashboard that can be filtered and downloaded for your needs. The pie chart (along with all graphs and reports) can be downloaded in the format of your choice, whether CSV/PNG/PDF etc.

View Primary Pages by clicking the link of the same name (primary pages) in your dashboard overview. From here, you can see a detailed breakdown of each and every unique and indexable URL of up to 200 metrics, including DeepRank (an internal ranking system), clicks in, load time, content/HTML ratio, social tags, mobile-optimization (or lack thereof!), pagination and more.

35-test-individual-urls

36. Make Landing Pages Awesome and Improve UX

To help improve conversion and engagement, use DeepCrawl metrics to optimise page-level factors like compelling content and pagination, which are essential parts of your site’s marketing funnel that assist in turning visitors into customers.

You can find content that is missing key parts through the page content reports to help engage visitors faster, deliver your site’s message in a clearer way and increasing chances for conversions and exceeding user expectations.

36-make-landing-pages-awesome-and-improve-ux

37. Optimise your Social Tags

To increase shares on Facebook (Open Graph) and Twitter and get the most out of your content and outreach activities, you need to make sure your Twitter Cards and Open Graph tags are set up and set up correctly.

Within DeepCrawl’s Social Tagging report you will see pages with or without social tags, whether those that do are valid, and OG:URL Canonical Mismatch or, pages where the Open Graph URL is different to the Canonical URL. These should be identical, otherwise shares and likes might not be aggregated for your chosen URL in your Open Graph data but be spread across your URL variations.

37-optimise-your-social-tags

Want Customised Crawls?

38. Schedule Crawls

You can schedule crawls using DeepCrawl to automatically run them in the future and adjust their frequency and start time. This feature can also be used to avoid times of heavy server load. Schedules range from every hour to every quarter.

How To Do It:

In step 4 of your crawl set up click on Schedule crawl

38-schedule-crawls

39. Run Multiple Crawls at Once

You can crawl multiple sites (20 at any given time) really quickly as DeepCrawl is cloud-based, spanning millions of URLs at once while still being able to use your computer to evaluate other reports. Hence with DeepCrawl you can perform your Pitch, SEO Audit and your Competitor Analysis at the same time.

39-run-multiple-crawls-at-once

40. Improve Your Crawls with (Google) Analytics Data

By authenticating your Google Analytics accounts into DeepCrawl you can understand the combined SEO and analytics performance of key pages in your site. By overlaying organic traffic and total visits on your individual pages you can prioritise changes based on page performance.

How To Do It:

On step 2 of your crawl set up, go to Analytics, click add Google account
Enter your Google Analytics name and password to sync your data permissions within the DeepCrawl
Click the profile you want to share for the DeepCrawl project
DeepCrawl will sync your last 7 or 30 days of data (your choice), or you can choose to upload up to 6 months worth of data by uploading the data as a CSV file, whether from Google Analytics or Omniture or any providers

40-improve-your-crawls-with-google-analytics-data

41. Upload Backlinks and URLs

Identify your best linked sites by uploading backlinks from Google Search Console, or lists of URLs from other sources to help you track the SEO performance of the most externally linked content on your site.

41-upload-backlinks-and-urls

42. Restrict Crawls

Restrict crawls for any site using DeepCrawl’s max URL setting, using the exclude URL list or the page grouping feature which lets you restrict pages based on their URL patterns. With page grouping you can chose to crawl say 10% of a particular folder or of each folder on your site if you’re looking for a quick snapshot. Once you’ve re-crawled (so long as you keep the same page grouping settings), DeepCrawl will recrawl the same 10% so you can monitor changes.

Aside from Include/Exclude Only rules you can restrict your crawls by Starting URLs and by limited the depth and/or number of URLs you’d like to crawl on your given site.

How to Do It:

In Advanced Settings nested in step 4 of your crawl set up click “Included / Excluded URLs” or “Page Grouping” and/or “Start URLs”

42-restrict-crawls

43. Check Implementation of Structured Data

How To Do It:

In Advanced Settings of step 4 of your crawl set up
Click on “custom extraction”
Add the Custom Extraction code found here to get DeepCrawl to recognise Schema markup tags and add the particular line of code you want for your crawl: ratings, reviews, person, breadcrumbs, etc

itemtype=”http://schema.org/([^”]*)
itemprop=”([^”]*)
(itemprop=”breadcrumb”)
(itemtype=”http://schema.org/Review”)

43-check-implementation-of-structured-data

44. Using DeepCrawl as a Scraper

Add custom rules to your website audit with DeepCrawl’s Custom Extraction tool. You can tell the crawler to perform a wide array of tasks, including paying more attention to social tags, finding URLs that match a certain criteria, verifying App Indexing deeplinks, or targeting an analytics tracking code to validate product information across category pages.

For more information about Custom Extraction syntax and coding, check out this tutorial published by the DeepCrawl team.

How To Do It:

Enter your Regular Expressions syntax into the Extraction Regex box in the advanced settings of step 4 of your crawl
View your results by checking the Custom Extraction tab in your project’s crawl dashboard or at the bottom of the navigation menu

44-using-deepcrawl-as-a-scraper

45. Track Migration Changes

Migrations happen for a lot of reasons, and are generally challenging. To aid developers, SEOs and decision makers from the business coming together and trying to minimise risks, use DeepCrawl to compare staging and production environments in a single crawl to spot any issues before you migrate and make sure no-one missed their assignments, and ensure the migration goes smoothly.

For site migrations and/or redesigns, testing changes before going live can show you whether your redirects are correctly set up, whether you’re disallowing/no-indexing valuable pages in your robots.txt etc., being careful does pay off!

45-track-migration-changes

46. Crawl as Googlebot or Your Own User Agent

If your site auditor can’t crawl your pages as a search engine bot, then you have no chance of seeing the site through the search engine’s eyes. DeepCrawl can also mimic spiders from other search engines, social networks and browsers. Select your user agent in the advanced settings when setting up or editing your crawl.

Your options are:

Googlebot (7 different options)
Applebot
Bingbot
Bingbot mobile
Chrome
Facebook
Firefox
Generic
Internet Explorer 6 & 8
iPhone
DeepCrawl
Custom User Agent

46-crawl-as-googlebot-or-your-own-user-agent

Last but not least

This top 46 list is by no means complete. In fact, there are many more possibilities to utilise DeepCrawl for enhancing site performance and the tool undergoes constant improvements. This list is a starting point to understanding your website as search engines do and making improvements for users and search engines alike.

DeepCrawl is a very powerful tool and conclusions drawn using the data it provides must be based on experience and expertise. If applied to its full effect DeepCrawl can bring an online business to the next level and significantly contribute to user expectations management, brand building and most importantly driving conversions.

What are your favourite DeepCrawl features? Your opinion matters, share it in the comments section below.

Post from Fili Wiese

↧

A Crawl-Centred Approach to Content Auditing

February 28, 2018, 2:00 am

≫ Next: 10 Ways To Save Time and Identify Major Site Errors with DeepCrawl

≪ Previous: 46 Updated DeepCrawl Tweaks to Make Your Website Soar in Google Search Results

It’s 2018 and content marketing is still on the rise in terms of interest and in the revenue it generates. 75% of companies increased their content marketing budget in 2016 and 88% of B2B sites are now using content marketing in various forms.

Measuring ROI and content effectiveness are seen by B2B companies as more of a challenge than a lack of budget. Generally speaking, the issues now faced with content marketing appear to be less about buy-in and more about showing the value of content marketing initiatives.

Challenges for B2B Content Marketers

Increased Need for Content Audits

The longer companies invest in content marketing initiatives, the larger their sites are likely to become, meaning that the need to review and audit content is more important than ever.

I’m going to put forward a crawl-centred process for reviewing content that will give you a structure for answering these four questions:

What content is/isn’t working?
What should you do with content that isn’t working well?
How can you get the most out of content that is working well?
How can you find insights that will inform your content strategy?

Answering these questions will put you in a position to optimise and maintain site content as well as assisting you in implementing a data-driven approach to ensure content marketing resources are being invested efficiently.

Content Discovery

The first phase of a content audit is the discovery phase. This involves finding all of the content on your site and combining it with a range of relevant data sources that will inform you about the performance of each page.

Starting with a crawl

While there are many ways you can achieve this, the simplest way is by starting with a crawl of the site. Using a crawler, like DeepCrawl, you can crawl sites of any size and you have the ability to easily bring in a whole host of additional data sources to find pages that a simple web crawl might miss.

Integrating Additional Data Sources

The graphic below details some of the data sources you will want to bring in alongside a crawl, but don’t treat this as an extensive list. You should look to include any data sources and metrics that are going to be useful in helping you assess the performance of your content which may also include: social shares, sitemaps, estimated search volume, SERP data etc.

DeepCrawl Search Universe

The beauty of using a crawler like DeepCrawl is that it will save you the effort of having to pull the majority of these data sources together manually…and crashing excel. Once you’ve run a crawl with other data sources included, you can simply export the full set of data into a spreadsheet with the pages occupying rows and the metrics as columns.

Using Custom Extractions

Another benefit of using a more advanced crawler is that you can use custom extractions to pull out data like: author name, out of stock items, published date, last modified date, meta keywords, breadcrumbs and structured data.

DeepCrawl Custom Extraction

The Refining Phase:

At this point you’ll need to take that bloated spreadsheet you’ve got, all full with performance insights, and shrink it into something more manageable. The aim of this phase is to reduce the data down so that you’re in a position to start assessing and reviewing pages.

This involves removing pages (rows) from the spreadsheet that sit outside of the content audit and getting rid of superfluous metrics (columns) which aren’t going to provide you with valuable insights. If you aren’t sure whether a metric or page should be included you can always hide these rather than deleting them.

column deletion

Now that you’ve got a workable dataset, let’s see how you can go about answering those four questions.

What content on the site is/isn’t working?

To understand content performance you will want to decide on a set of criteria that define success and failure. The metrics you choose will be dependent on the goal of your content and what you’re trying to achieve, but will likely cover traffic, social shares, engagement, conversions or a mixture of some or all of these.

Once you’ve made this decision, you can define buckets based on varying levels of those metrics e.g. outstanding, good, average, poor and apply them to the pages you’re assessing.

Now you will able to see how content is performing based on the metrics you care about.

How can you deal with content that isn’t performing well?

Now that your awesome and poor performing pages are right there in front of you, you’ll need to need to decide on how to deal with them respectively.

I’d start by adding in an ‘Action’ column to your spreadsheet and creating a dropdown list of different decisions. These could include:

Keep – Pages that are performing well and will not be changed significantly
Cut – Low value pages that don’t deserve a place on your site e.g. outdated content
Combine – Pages that include content that doesn’t warrant its own dedicated page but can be used to bolster another existing page
Convert – Pages with potential that you want to invest time improving e.g. partially duplicate content

Column actions

For small to medium sized sites you should be able to make these decisions on a page by page basis, but for larger sites it may be easier to make decisions by aggregating pages into groupings, so that it remains a manageable process.

What actions can you take to get the most out of content that is performing well?

Now that you’ve decided what actions you’re going to take for each of your pages, you’ll want to filter down your pages by those that you’re going to keep, and look at ways that you can get the most out of them.

This will take the form of an exercise in content optimisation and involves tuning up the content you want to keep. There are a ton of great resources which cover this subject so I won’t cover this in detail. However, you may want to look at how you can improve:

Optimising titles & meta descriptions

– Bread and butter stuff, but are your titles and descriptions appealing propositions? Do they match the user intent of the queries that they rank for?

Keyword cannibalisation

– Do you have multiple pages targeting ranking for topically similar queries that can be consolidated to maximise your authority on this subject?

Resolving content duplication issues

– Is content on your site unique? Are there near or true (complete) duplicate versions which could be diluting the authority of the page(s) that is performing well?

Linking

– Are there opportunities to link to related pages internally or externally? Do you have relevant CTAs? What do you want visitors to do once they’ve finished with the page they’re on (and do you help get them there)?

Page speed

– Are there any ways you can further optimise pages to reduce load time e.g. image optimisation or clunky code?

Structured data implementation

– Is there any useful structured data that you could use to mark up pages, and is existing markup implemented correctly?

How can you enhance your content strategy?

Once you’ve made it to this stage you will know how you’re going to deal with all of your existing content, but how can you use your performance data to inform your content strategy going forward?

The resources that you have for content production are finite, so you need to uncover insights that will help you determine what you should be investing in more, what you should do less of, or what you shouldn’t be producing altogether.

You will likely have a good understanding of the relative performance of your content at this point, but it can be helpful to focus on specific metrics and dimensions across the whole dataset to gain deeper insights.

Finding Relationships

You can do this by pivoting variables around important metrics to find relationships that will show you how you can do more of what works and less of what doesn’t.

You’ve already defined your success metrics but here’s some examples of variables that you might view these metrics against to find interesting relationships:

Performance and engagement by channel/category/content type – Do some types of content perform better than others? Are they viewed or shared more frequently?
Content length and engagement – Is word count positively correlated with engagement or is there a drop off point of diminishing returns?
Content length and sharing – Does longer content, which is usually more resource intensive, get shared more than short form content? Do the results of long form content justify the larger investment?
Performance and engagement by author – Do some authors receive more pageviews and shares than others? This is particularly useful for media organisations where there is a high level of content production and author performance is more important.
Performance fluctuations by publish date and time – Is content better received on specific days of the week, time of the day or months of the year (if you have data going back far enough). Can you tailor content publication to times that are likely to get more exposure. For news sites this may mean writers publishing articles outside of standard work days and hours to get more views when their readers have more free time.

From Audits to Continuous Monitoring

Content audits are going to vary dependent on site size, type and the data you have to hand. However, the above provides a framework for conducting an audit to streamline and get the most out of existing content as well as improving your content strategy going forward.

Furthermore, you should look to achieve this with crawl data at the core, preferably with a tool that automatically integrates additional data sources to make this process as quick and painless as possible.

Once you’ve got this whole process down, you can take it to the next level by automatically pulling in the data on a regular basis and putting it into dashboards. Doing this will enable regular reviews of your content performance and will allow you to adjust the course of your content strategy accordingly.

Post from Sam Marsden

↧

10 Ways To Save Time and Identify Major Site Errors with DeepCrawl

October 14, 2014, 5:00 pm

≫ Next: New Year Resolution: Crawl Quarterly

≪ Previous: A Crawl-Centred Approach to Content Auditing

1) Easy to Use

The DeepCrawl tool allows you to easily crawl a site. The first step is to Add Project as the below shows.

It is also possible to select exactly when you crawl your own site. Many people choose to crawl overnight or over the weekends as it is faster than during peak hours.

The tool allows the user to include or exclude certain pages such as adhering to the no follow links.

2) Simplification

The overview report also shows the number of crawls you have processed on your site. It shows the number of unique pages and the depth, which is important especially in large sites.

3) Identify Indexation

After users have run more than one crawl they have the trend the bottom of the dashboard on the Overview tab.

4) Identify Content

5) Clearly see internal broken links

6) Assigning tasks to others

7) Page Level Detail

A score of 10 for DeepRank is the most serious, with this page being a 3 out of 10. The tick marks show the page is indexable and it is a unique page.

8) Schedule reports

9) Integration with analytics

How does DeepCrawl do this?

This is another great USP of DeepCrawl as this feature allows users to find some of the gaps in their site such as:

Sitemaps URLs which aren’t linked internally
Linked URLs which aren’t in your Sitemaps
URLs which generate entry visits but aren’t linked, sometimes referred to as orphaned pages or ghost URLs
Linked URLs or URLs in Sitemaps which don’t generate traffic – perhaps they can be disallowed or deleted

10) Crawls Before the Site is Live

Conclusion

Post from Jo Juliana Turnbull

↧

New Year Resolution: Crawl Quarterly

December 15, 2014, 4:00 pm

≫ Next: The First Pillar of SEO: Technology

≪ Previous: 10 Ways To Save Time and Identify Major Site Errors with DeepCrawl

This is the shortened version of my talk … the action items if you will …

Finding the Right Crawler

Crawler Scale by Kate Morris and Outspoken Media

Making a Regular Schedule

You might be thinking that it’s as simple as a recurring calendar invite, but to reap the full rewards of a regular crawl schedule, there are a few things to keep in mind.

Intervals

Withstanding Staff Changes

Resources

Focusing on the Right Metrics

Errors – Naturally, the server (5XX) and Not Found errors (4XX) are going to be of the highest priority. They are issues for users and search crawlers.
Redirects – Redirects should never happen within your site. They are going to over time, but should be fixed as soon as possible. It’s link equity you can control, so try to keep any of it from passing through a redirect.
Duplicated Titles – These are the first warning sign of duplicate content and canonical issues. Check out titles for any that are missing, too short or long, or duplicated.
XML Sitemaps – Finally, use a crawler to check your XML sitemap to ensure that you are only giving search crawlers the best URLs to crawl.

Below is my presentation in full for those that are interested in all the tools reviewed, the metrics to watch, the recommended timeline, and other fun uses of crawlers.

Post from Kate Morris

↧

The First Pillar of SEO: Technology

February 15, 2015, 4:00 pm

≫ Next: Find and Fix Common Crawl Optimisation Issues

≪ Previous: New Year Resolution: Crawl Quarterly

I want to elaborate further on each of the three pillars, starting with the first: technology.

Crawl Efficiency

Crawl Errors

Google Webmaster Tools Crawl Errors report

In the screenshot above we see more than 39,000 Not Found errors on a single website. This may look alarming at first glance, but we need to place that in the right context.

Google Webmaster Tools Sitemaps report

So let’s look further and see how many pages on this site Google has actually indexed:

Google Webmaster Tools Index Status report

XML Sitemaps

Again, Google Webmaster Tools is the first place to start, specifically the Sitemaps report:

Google Webmaster Tools Sitemap Errors report

Fortunately Google Webmaster Tools provides more details about what errors exactly it finds in each of these sitemaps:

Google Webmaster Tools Sitemap Errors report detail

There are several issues with this sitemap, but the most important one is that it has thousands upon thousands of URLs that are blocked by robots.txt, preventing Google from crawling them.

We can then identify which robots.txt rule is the culprit. We take one of the example URLs provided in the sitemap errors report, and put that in the robots.txt tester in Webmaster Tools:

Google Webmaster Tools Robots.txt Tester

Load Speed

And, as we know, load speed is massively important for usability as well, so you tick off multiple boxes by addressing one technical SEO issue.

As before, we’ll want start with Webmaster Tools, specifically the Crawl Stats report:

Google Webmaster Tools Crawl Stats report

We can turn to Google Analytics next and see what that says about the site’s load speed:

Google Analytics Site Speed report

But this report doesn’t tell us what the actual problems are. We only know that the average load speed is too slow. We need to provide actionable recommendations, so we need to dig a bit deeper.

What I like most about WebPageTest.org is the visual waterfall view it provides, showing you exactly where the bottlenecks are:

WebPageTest.org Waterfall report

Google PageSpeed Insights report

Down The Rabbit Hole

The indexing process is what allows search engines to make sense of what they find, and that’s where my Relevance pillar comes in, which will be the topic of my next article.

Post from Barry Adams

↧

Find and Fix Common Crawl Optimisation Issues

September 13, 2015, 5:00 pm

≫ Next: 40 DeepCrawl tweaks to make a website soar in Google Search

≪ Previous: The First Pillar of SEO: Technology

DeepCrawl crawl report

Optimising crawl budget is especially important on larger websites where more intricate technical SEO elements can come in to play, such as pagination, sorted lists, URL parameters, etc.

Today I’ll discuss a few common crawl optimisation issues, and show you how to handle them effectively in ways that hopefully won’t cause your web developers a lot of hassle.

Accurate XML Sitemaps

XML Sitemap with 301 redirects

The key lesson here is to make sure that your XML sitemaps contain only final destination URLs, and that there’s no waste of crawl budget with redirects or non-indexable pages in your sitemap.

Paginated & Sorted Lists

22 Pages

Obviously we’ll want to minimise the number of pages Google has to crawl to find all the products on the site. There are several approaches to do this effectively:

User-agent: * Disallow: /*order=price*

Excessive Canonicals

DeepCrawl non-indexable pages

The same goes for the noindex meta tag – Google has to see it, i.e. crawl it, before it can act on it. It is therefore never a fix for crawl efficiency issues.

Just The Start

I hope that was useful, and if you’ve any comments or questions about crawl optimisation, please do leave a comment or catch me on Twitter: @badams.

Post from Barry Adams

↧

40 DeepCrawl tweaks to make a website soar in Google Search

October 5, 2015, 5:00 pm

≫ Next: 46 Updated DeepCrawl Tweaks to Make Your Website Soar in Google Search Results

≪ Previous: Find and Fix Common Crawl Optimisation Issues

How to understand and optimize your site’s search signals

1. Find Duplicate Content

2. Identify Duplicate Pages

Duplicate pages

How To Do It:

Click “Add Project” from your main dashboard.
Pick the “web crawl” type to tell DeepCrawl to scan only your website at all its levels.
Review your site’s duplicate pages from the “issues” tab located in the left panel of your main reporting dashboard once the crawl finishes.

3. Optimize Meta Descriptions

Meta descriptions

4. Optimize Image Tags

Image alt tags

How To Do It:

Create custom extraction rules using Regular Expressions syntax.
Hint: Try “/(<img(?!.*?alt=([‘”]).*?2)[^>]*)(>)/” to catch images that have alt tag errors or don’t have alt tags altogether.
Paste your code into “Extraction Regex” from the Advanced Settings tab in your projects dashboard.
Check your reports from your projects dashboard when the crawl completes. DeepCrawl gives two reports when using this setting: URLs that followed at least one rule from your entered syntax and URLs that returned no rule matches.

5. Crawl Your Site like Googlebot

Crawl urls

6. Discover Potentially Non-indexable Pages

Non-indexable Pages

7. Compare Previous Crawls

Track changes

Compare websites

How To Do It:

Download crawl data from a finished report in .xls or .pdf format.
Add client logo or business information to the report.
Serve data to your client that’s formatted to look as though it came directly from your shop.

8. Run Multiple Crawls at Once

Multiple Crawls

9. Avoid Panda, Manage Thin Content

Manage Thin Content

How To Do It:

From the report settings tab, adjust the minimum content size by kilobytes.
If you do not change the setting, DeepCrawl will scan your URLs for the default min content size of three kilobytes.
Fine tune as necessary.

10. Crawl Massive Sites

Crawl Massive Sites

You may have to crawl a website that spans over 1 million URLs. The good news is that DeepCrawl can run audits for up to 3 million URLs per crawl, giving you the powerful tool your site size demands.

How To Do It:

From the crawl settings tab in your projects dashboard, adjust the crawl limit to suit your target domain’s total URLs.
Crawl up to 1 million URLs using prefabricated settings in DeepCrawl’s dropdown “crawl limits” menu.
For a custom crawl, select “custom” from the dropdown menu and adjust max URLs and crawl depth to suit your reporting needs.
Crawl Limit: 3 million URLs per crawl.

11. Test Domain Migration

Test Domain Migration

12. Track Migration Changes

Track Migration Changes

13. Find 404 Errors

Find 404 Errors

14. Check Redirects

Check Redirects

15. Monitor Trends Between Crawls

Monitor Trends Between Crawls

16. Check Pagination

Check Pagination

17. Find Failed URLs

Find Failed URLs

18. Crawl Sitemaps for Errors

Crawl Sitemaps for Errors

19. Improve Your Crawls with Google Analytics Data

Improve Your Crawls with Google Analytics Data

How To Do It:

Select a universal crawl when setting up your new project.
Click save and click over to the “analytics settings” that now appears in the top menu.
Click “add new analytics account” located in the top left of the dashboard.
Enter your Google Analytics name and password to sync your data permissions within the DeepCrawl platform.
Hit save and DeepCrawl will pull analytics data for all domains you have permission to view on current and future crawls.

20. Verify Social Tags

Verify Social Tags

21. Test Individual URLs

Test Individual URLs

22. Verify Canonical Tags

Verify Canonical Tags

23. Clean Up Page Headers

Clean Up Page Headers

24. Make Landing Pages Awesome

25. Prioritize Site Issues

Prioritize Site Issues

26. Check for Thin Content

Check for Thin Content

27. Crawl as Googlebot or Your Own User Agent

Crawl as Googlebot or Your Own User Agent

28. Discover Disallowed URLs

Discover Disallowed URLs

29. Validate Page Indexation

Validate Page Indexation

30. Sniff Out Troublesome Body Content

Sniff Out Troublesome Body Content

How To Do It:

From the report settings tab in your project dashboard, set your “duplicate precision” setting between 1.0 (most stringent) and 5.0 (least stringent).
The default setting for duplicate precision is 2.0, if you decide to run your crawl as normal. If you do not change your settings, the crawl will run at this level of content scrutiny.
Run your web crawl.
Review results for duplicate body content, missing tags, and poor optimization as shown above.

31. Set Max Content Size

Set Max Content Size

How To Do It:

In the report settings tab of your project dashboard, adjust maximum content size by KB (kilobytes).
Hint: one kilobyte equals about 512 words.
Check content that exceeded your KB limit from the finished crawl report.

32. Fixing Broken Links

Fixing Broken Links

33. Schedule Crawls

How To Do It:

From the projects dashboard, click on the website you’d like to set reporting schedules.
Click the scheduling tab.
Set the frequency of your automated crawl, start date, and the hour you’d like the crawl to site.
Check back at the appointed time for your completed crawl.

34. Give Your Dev Site a Checkup

Give Your Dev Site a Checkup

How To Do It:

Run a universal crawl to capture all URLs in your target domain, including sitemaps.
When the crawl finishes, check your non-200 status codes and web crawl depth from the reports dashboard.
Examine the crawl depth chart to see where errors in your development site’s architecture occur most often.

35. Control Crawl Speed

Control Crawl Speed

36. Custom Extraction

Custom Extraction

How To Do It:

Enter your Regular Expressions syntax into the Extraction Regex box from the advanced settings tab on your DeepCrawl projects dashboard.
Click the box underneath the Extraction Regex box if you’d like DeepCrawl to exclude HTML tags from your crawl results.
View your results by checking the Custom Extraction tab in your project crawl dashboard.

37. Restrict Crawls

Restrict Crawls

How to Do It (Include URLs):

Go to the “Include Only URLs” section in the Advanced Settings tab when setting up your new project’s crawl.
Add the URL paths you want to include in the box provided on single lines. DeepCrawl includes all URLs with the paths you enter in them when you begin your crawl.
Ex: /practice-areas//category/

How to Do It (Exclude URLs):

Navigate to the “Excluded URLs” section in the advanced settings tab of your project setup dashboard.
Add URL paths you want to exclude from your crawl by writing them on single lines in the box provided using the method as outlined above.
Important Note: Exclude rules override include rules you set for your crawl
Bonus Exclude: Stop DeepCrawl from crawling script-centric URLs by adding “*.php” and “*.cgi” into the exclude URLs field.

38. Check Implementation of Structured Data

Check Implementation of Structured Data

How To Do It:

Choose the “list crawl” option during your project setup.
Enter a list of URLs you want to validate into the box provided. If you have over 2,000 URLs in your list, you’ll need to upload them as a .txt file.
Add the Custom Extraction code found here to get DeepCrawl to recognize Schema markup tags and add the particular line of code you want for your crawl: ratings, reviews, person, breadcrumbs, etc.
Run Crawl. You can find the info DeepCrawl gleaned on your site’s structured markup by checking the Extraction tab from the reporting dashboard.

39. Verify HREFLANG Tags

Verify HREFLANG Tags

If your website is available in multiple languages, you need to validate the site’s HREFLANG tags. You can test HREFLANG tags through the validation tab in your universal crawl dashboard.

If you have HREFLANG tags in your sitemaps, be sure to use the universal crawl as this includes crawling your XML sitemaps.

How To Do It:

Run a universal crawl on your targeted website from the projects dashboard.
Once the crawl finishes, head to the reports section under “validation.”
From there you can view pages with HREFLANG tags and pages without them.
View URLs flagged as inconsistent and alternative URLs for all country variations on the domain.

40. Share Read-Only Reports

Share Read-Only Reports

Last but not least

What are your favourite DeepCrawl features? Your opinion matters, share it in the comments section below.

Post from Fili Wiese

↧

46 Updated DeepCrawl Tweaks to Make Your Website Soar in Google Search Results

September 26, 2016, 5:00 pm

≫ Next: A Crawl-Centred Approach to Content Auditing

≪ Previous: 40 DeepCrawl tweaks to make a website soar in Google Search

Below are 46 new and updated tips and tricks to optimise the search signals for your websites in Google’s organic search.

Spend more time making recommendations and changes

(and less time trawling through data)

1. Crawl MASSIVE Sites

If you have a really large website with millions of pages you can scan unlimited amounts with the custom setting – so long as you have enough credits in your account!

How To Do It:

From the crawl limits in step 3 of crawl set up, adjust the crawl limit to suit your target domain’s total URLs
Crawl up to 10 million using pre-fabricated options from the dropdown menu
For a custom crawl, select “custom” from the dropdown menu and adjust max URLs and crawl depth to suit your reporting needs

01-crawl-massive-sites

2. Compare Previous Crawls

How To Do It:

On step 4 of your crawl set up, you can select to compare your new crawl to a previous one

02-compare-previous-crawls

3. Monitor Trends Between Crawls

Once a follow-up crawl is finished, the new crawl shows your improved stats in green and potential trouble areas in red.

03-monitor-trends-between-crawls

4. Filters, filters, filters!

04-filters-filters-filters

5. Assign Tasks, Ease Workflows

05-assign-issues-to-team-members

6. Share Read-Only Reports

06-share-read-only-reports

7. DeepCrawl is now Responsive

07-deepcrawl-is-now-responsive

8. Brand your Crawl Reports

How To Do It:

Go to Account Settings
Select from Theme, Header colour, Menu colour, Logo and Custom proxy
Make the report yours!

08-brand-your-crawl-reports

Optimise your Crawl Budget

9. Crawl Your Site like Googlebot

09-crawl-your-site-like-googlebot

10. Optimise your Indexation

10-optimise-your-indexation

11. Discover Potentially Non-indexable Pages

11-non-indexable-pages

12. Discover Disallowed URLs

12-check-disallowed-urls-for-errors

13. Check Pagination

13-hunt-down-pages-that-have-relnext-and-relprev-implemented-wrongly-copy

Understand your Technical Architecture

14. Introducing Unique Internal Links & Unique Broken Links

14-introducing-unique-internal-links-_-unique-broken-links

15. Find Broken Links

15-find-broken-links

16. Fix Broken Links

16-fixing-broken-links

17. Check Redirects – Including Redirection Loops

17-check-redirects-including-redirection-loops

18. Verify Canonical Tags

18-see-all-canonical-tags-and-url-locations

19. Verify HREFLANG Tags

If your website is available in multiple languages, you need to validate the site’s HREFLANG tags. You can test HREFLANG tags through the validation tab in your universal crawl dashboard.

How To Do It:

The Configuration report gives you an overview of HREFLANG implementation
In the lower left menu, the HREFLANG section breaks down all the aspects of HREFLANG implementation into categorised pages

19-verify-hreflang-tags

20. Optimise Image Tags

By using the custom extraction tool you can extract a list of images that don’t have alt tags across the site which can help you gain valuable rankings in Google Image Search.

How To Do It:

Create custom extraction rules using Regular Expressions
Hint: Try “/(<img(?!.*?alt=([‘”]).*?2)[^>]*)(>)/” to catch images that have alt tag errors or don’t have alt tags altogether
Paste your code into “Extraction Regex” from the Advanced Settings link on step 4 of your crawl set up
Check your reports from your projects dashboard when the crawl completes. DeepCrawl gives two reports when using this setting: URLs that followed at least one rule from your entered syntax and URLs that returned no rule matches

20-optimise-image-tags

21. Is Your Site Mobile-Ready?

21-is-your-site-mobile-ready_

22. Migrating to HTTPS?

HTTP resources on HTTPS pages
Pages with HSTS
HTTP pages with HSTS
HTTP pages
HTTPs pages

22-migrating-to-https

23. Find Unlinked Pages driving Traffic

23-find-unlinkedin-pages-driving-traffic

24. Sitemap Optimisation

24-sitemap-optimisation

25. Control Crawl Speed

25-control-crawl-speed

Is your Content being found?

26. Identifying Duplicate Content

Finding Duplicate Titles, Descriptions, Body Content, and URLs with DeepCrawl is an effortless time-saver.

26-identifying-duplicate-content

27. Identify Duplicate Pages, Clusters, Primary Duplicates & Introducing True Uniques

How To Do It:

Click “Add Project” from your main dashboard
Under the crawl depth setting tell DeepCrawl to scan your website at all its levels
Once the crawl has finished, review your site’s duplicate pages from the “issues” list on your main dashboard or search for ‘duplicate’ in the left nav search bar

27-identify-duplicate-pages-clusters-primary-duplicates-_-introducing-true-uniques

28. Sniff Out Troublesome Body Content

How To Do It:

Click on the Body Content report in the left hand menu
You can also select individual issues from the main dashboard
Assign tasks using the task manager or share with your team

28-sniff-out-troublesome-body-content

29. Check for Thin Content

29-find-pages-with-bad-html_content-ratios

30. Avoid Panda, Manage Thin Content

How To Do It:

Typing content in the main menu will give you the body content report
Clicking on the list will give you a list of pages you can download or share

30-avoid-a-thin-content-penalty-now

31. Optimise Page Titles & Meta Descriptions

31-optimise-page-title-_-meta-description

32. Clean Up Page Headers

32-clean-up-page-headers

Other nuggets that DeepCrawl gives

33. How Does Your Site Compare to Competitors?

How To Do It:

Go to the Advanced Settings in step 4 of your crawl setup and select and tick Stealth Mode Crawl nested under the Spider Settings

33-how-does-your-site-compare-to-competitors_

34. Test Domain Migration

In the Non-200 Pages report you can see the total number of non-200 status codes, including 5xx and 4xx errors that DeepCrawl detected during the platform’s most recent crawl.

34-test-domain-migration

35. Test Individual URLs

35-test-individual-urls

36. Make Landing Pages Awesome and Improve UX

36-make-landing-pages-awesome-and-improve-ux

37. Optimise your Social Tags

37-optimise-your-social-tags

Want Customised Crawls?

38. Schedule Crawls

How To Do It:

In step 4 of your crawl set up click on Schedule crawl

38-schedule-crawls

39. Run Multiple Crawls at Once

39-run-multiple-crawls-at-once

40. Improve Your Crawls with (Google) Analytics Data

How To Do It:

On step 2 of your crawl set up, go to Analytics, click add Google account
Enter your Google Analytics name and password to sync your data permissions within the DeepCrawl
Click the profile you want to share for the DeepCrawl project
DeepCrawl will sync your last 7 or 30 days of data (your choice), or you can choose to upload up to 6 months worth of data by uploading the data as a CSV file, whether from Google Analytics or Omniture or any providers

40-improve-your-crawls-with-google-analytics-data

41. Upload Backlinks and URLs

41-upload-backlinks-and-urls

42. Restrict Crawls

Aside from Include/Exclude Only rules you can restrict your crawls by Starting URLs and by limited the depth and/or number of URLs you’d like to crawl on your given site.

How to Do It:

In Advanced Settings nested in step 4 of your crawl set up click “Included / Excluded URLs” or “Page Grouping” and/or “Start URLs”

42-restrict-crawls

43. Check Implementation of Structured Data

How To Do It:

In Advanced Settings of step 4 of your crawl set up
Click on “custom extraction”
Add the Custom Extraction code found here to get DeepCrawl to recognise Schema markup tags and add the particular line of code you want for your crawl: ratings, reviews, person, breadcrumbs, etc

itemtype=”http://schema.org/([^”]*)
itemprop=”([^”]*)
(itemprop=”breadcrumb”)
(itemtype=”http://schema.org/Review”)

43-check-implementation-of-structured-data

44. Using DeepCrawl as a Scraper

Add custom rules to your website audit with DeepCrawl’s Custom Extraction tool. You can tell the crawler to perform a wide array of tasks, including paying more attention to social tags, finding URLs that match a certain criteria, verifying App Indexing deeplinks, or targeting an analytics tracking code to validate product information across category pages.

For more information about Custom Extraction syntax and coding, check out this tutorial published by the DeepCrawl team.

How To Do It:

Enter your Regular Expressions syntax into the Extraction Regex box in the advanced settings of step 4 of your crawl
View your results by checking the Custom Extraction tab in your project’s crawl dashboard or at the bottom of the navigation menu

44-using-deepcrawl-as-a-scraper

45. Track Migration Changes

45-track-migration-changes

46. Crawl as Googlebot or Your Own User Agent

Your options are:

Googlebot (7 different options)
Applebot
Bingbot
Bingbot mobile
Chrome
Facebook
Firefox
Generic
Internet Explorer 6 & 8
iPhone
DeepCrawl
Custom User Agent

46-crawl-as-googlebot-or-your-own-user-agent

Last but not least

What are your favourite DeepCrawl features? Your opinion matters, share it in the comments section below.

Post from Fili Wiese

↧

A Crawl-Centred Approach to Content Auditing

February 27, 2018, 4:00 pm

≪ Previous: 46 Updated DeepCrawl Tweaks to Make Your Website Soar in Google Search Results

Challenges for B2B Content Marketers

Increased Need for Content Audits

The longer companies invest in content marketing initiatives, the larger their sites are likely to become, meaning that the need to review and audit content is more important than ever.

I’m going to put forward a crawl-centred process for reviewing content that will give you a structure for answering these four questions:

What content is/isn’t working?
What should you do with content that isn’t working well?
How can you get the most out of content that is working well?
How can you find insights that will inform your content strategy?

Content Discovery

Starting with a crawl

Integrating Additional Data Sources

DeepCrawl Search Universe

Using Custom Extractions

DeepCrawl Custom Extraction

The Refining Phase:

column deletion

Now that you’ve got a workable dataset, let’s see how you can go about answering those four questions.

What content on the site is/isn’t working?

Once you’ve made this decision, you can define buckets based on varying levels of those metrics e.g. outstanding, good, average, poor and apply them to the pages you’re assessing.

Now you will able to see how content is performing based on the metrics you care about.

How can you deal with content that isn’t performing well?

Now that your awesome and poor performing pages are right there in front of you, you’ll need to need to decide on how to deal with them respectively.

I’d start by adding in an ‘Action’ column to your spreadsheet and creating a dropdown list of different decisions. These could include:

Keep – Pages that are performing well and will not be changed significantly
Cut – Low value pages that don’t deserve a place on your site e.g. outdated content
Combine – Pages that include content that doesn’t warrant its own dedicated page but can be used to bolster another existing page
Convert – Pages with potential that you want to invest time improving e.g. partially duplicate content

Column actions

What actions can you take to get the most out of content that is performing well?

Optimising titles & meta descriptions

– Bread and butter stuff, but are your titles and descriptions appealing propositions? Do they match the user intent of the queries that they rank for?

Keyword cannibalisation

– Do you have multiple pages targeting ranking for topically similar queries that can be consolidated to maximise your authority on this subject?

Resolving content duplication issues

– Is content on your site unique? Are there near or true (complete) duplicate versions which could be diluting the authority of the page(s) that is performing well?

Linking

Page speed

– Are there any ways you can further optimise pages to reduce load time e.g. image optimisation or clunky code?

Structured data implementation

– Is there any useful structured data that you could use to mark up pages, and is existing markup implemented correctly?

How can you enhance your content strategy?

Finding Relationships

You can do this by pivoting variables around important metrics to find relationships that will show you how you can do more of what works and less of what doesn’t.

You’ve already defined your success metrics but here’s some examples of variables that you might view these metrics against to find interesting relationships:

Performance and engagement by channel/category/content type – Do some types of content perform better than others? Are they viewed or shared more frequently?
Content length and engagement – Is word count positively correlated with engagement or is there a drop off point of diminishing returns?
Content length and sharing – Does longer content, which is usually more resource intensive, get shared more than short form content? Do the results of long form content justify the larger investment?
Performance and engagement by author – Do some authors receive more pageviews and shares than others? This is particularly useful for media organisations where there is a high level of content production and author performance is more important.
Performance fluctuations by publish date and time – Is content better received on specific days of the week, time of the day or months of the year (if you have data going back far enough). Can you tailor content publication to times that are likely to get more exposure. For news sites this may mean writers publishing articles outside of standard work days and hours to get more views when their readers have more free time.

From Audits to Continuous Monitoring

Post from Sam Marsden

↧