Have some very exciting new features ready to release in the latest update. This includes the following –
1) Rendered Crawling (JavaScript):
There were two things we set out to do toward the begin of the year. Firstly, see precisely what the web crawlers can creep and list. This is the reason we made the Screaming Frog Log File Analyser, as a crawler will just ever be a reenactment of hunt bot conduct.
Also, we needed to slither rendered pages and read the DOM. It's been known for quite a while that Googlebot demonstrations more like a current program, rendering substance, creeping and indexing JavaScript and powerfully created content rather well. The SEO Spider is currently ready to render and creep website pages comparably.
You can pick whether to creep the static HTML, comply with the old AJAX slithering plan or completely render site pages, which means executing and slithering of JavaScript and element content.
Google deprecated their old AJAX crawling scheme and we have seen JavaScript frameworks such as AngularJS (with links or utilising the HTML5 History API) crawled, indexed and ranking like a typical static HTML site. I highly recommend reading Adam Audette’s Googlebot JavaScript testing from last year if you’re not already familiar.
After much research and testing, we integrated the Chromium project library for our rendering engine to emulate Google as closely as possible. Some of you may remember the excellent ‘Googlebot is Chrome‘ post from Mike King back in 2011 which discusses Googlebot essentially being a headless browser.
The new rendering mode is really powerful, but there are a few things to remember –
- Typically crawling is slower even though it’s still multi threaded, as the SEO Spider has to wait longer for the content to load and gather all the resources to be able to render a page. Our internal testing suggests Google wait approximately 5 seconds for a page to render, so this is the default AJAX timeout in the SEO Spider. Google may adjust this based upon server response and other signals, so you can configure this to your own requirements if a site is slower to load a page.
- The crawling experience is quite different as it can take time for anything to appear in the UI to start with, then all of a sudden lots of URLs appear together at once. This is due to the SEO Spider waiting for all the resources to be fetched to render a page, before the data is displayed.
- To be able to render content properly, resources such as JavaScript and CSS should not be blocked from the SEO Spider. You can see URLs blocked by robots.txt (and the corresponding robots.txt disallow line) under ‘Response Codes > Blocked By Robots.txt’. You should also make sure that you crawl JS, CSS and external resources in the SEO Spider configuration.
It’s also important to note that as the SEO Spider renders content like a browser from your machine, so this can impact analytics and anything else that relies upon JavaScript.
By default the SEO Spider excludes executing of Google Analytics JavaScript tags within its engine, however if a site is using other analytics solutions or JavaScript that shouldn’t be executed, remember to use the exclude feature.
2) Configurable Columns & Ordering
You’re now able to configure which columns are displayed in each tab of the SEO Spider (by clicking the ‘+’ in the top window pane).
You can also drag and drop the columns into any order and this will be remembered (even after a restart).
To revert back to the default columns and ordering, simply right click on the ‘+’ symbol and click ‘Reset Columns’ or click on ‘Configuration > User Interface > Reset Columns For All Tables’.
3) XML Sitemap & Sitemap Index Crawling
The SEO Spider already allows crawling of XML sitemaps in list mode, by uploading the .xml file (number 8 in the ‘10 features in the SEO Spider you should really know‘ post) which was always a little clunky to have to save it if it was already live (but handy when it wasn’t uploaded!).
So we’ve now introduced the ability to enter a sitemap URL to crawl it (‘List Mode > Download Sitemap’).
Previously if a site had multiple sitemaps, you’d have to upload and crawl them separately as well.
Now if you have a sitemap index file to manage multiple sitemaps, you can enter the sitemap index file URL and the SEO Spider will download all sitemaps and subsequent URLs within them!
This should help save plenty of time!
4) Improved Custom Extraction – Multiple Values & Functions
We listened to feedback that users often wanted to extract multiple values, without having to use multiple extractors. For example, previously to collect 10 values, you’d need to use 10 extractors and index selectors ([1],[2] etc) with Xpath.
We’ve changed this behavior, so by default a single extractor will collect all values found and report them via a single extractor for XPath, CSS Path and Regex. If you have 20 hreflang values, you can use a single extractor to collect them all and the SEO Spider will dynamically add additional columns for however many are required. You’ll still have 9 extractors left to play with as well. So a single Xpath such as –
Will now collect all values discovered.
You can still choose to extract just the first instance by using an index selector as well. For example, if you just wanted to collect the first h3 on a page, you could use the following Xpath –
Functions can also be used anywhere in Xpath, but you can now use it on its own as well via the ‘function value’ dropdown. So if you wanted to count the number of links on a page, you might use the following Xpath –
I’d recommend reading our updated guide to web scraping for more information.
5) rel=“next” and rel=“prev” Elements Now Crawled
The SEO Spider can now crawl rel=“next” and rel=“prev” elements whereas previously the tool merely reported them. Now if a URL has not already been discovered, the URL will be added to the queue and the URLs will be crawled if the configuration is enabled (‘Configuration > Spider > Basic Tab > Crawl Next/Prev’).
rel=“next” and rel=“prev” elements are not counted as ‘Inlinks’ (in the lower window tab) as they are not links in a traditional sense. Hence, if a URL does not have any ‘Inlinks’ in the crawl, it might well be due to discovery from a rel=“next” and rel=“prev” or a canonical. We recommend using the ‘Crawl Path Report‘ to show how the page was discovered, which will show the full path.
There’s also a new ‘respect next/prev’ configuration option (under ‘Configuration > Spider > Advanced tab’) which will hide any URLs with a ‘prev’ element, so they are not considered as duplicates of the first page in the series.
6) Updated SERP Snippet Emulator
Earlier this year in May Google increased the column width of the organic SERPs from 512px to 600px on desktop, which means titles and description snippets are longer. Google displays and truncates SERP snippets based on characters’ pixel width rather than number of characters, which can make it challenging to optimise.
Our previous research showed Google used to truncate page titles at around 482px on desktop. With the change, we have updated our research and logic in the SERP snippet emulator to match Google’s new truncation point before an ellipses (…), which for page titles on desktop is around 570px.
Our research shows that while the space for descriptions has also increased they are still being truncated far earlier at a similar point to the older 512px width SERP. The SERP snippet emulator will only bold keywords within the snippet description, not in the title, in the same way as the Google SERPs.
Please note – You may occasionally see our SERP snippet emulator be a word out in either direction compared to what you see in the Google SERP. There will always be some pixel differences, which mean that the pixel boundary might not be in the exact same spot that Google calculate 100% of the time.
We are still seeing Google play to different rules at times as well, where some snippets have a longer pixel cut off point, particularly for descriptions! The SERP snippet emulator is therefore not always exact, but a good rule of thumb.
Other Updates
We have also included some other smaller updates and bug fixes in version 6.0 of the Screaming FrogSEO Spider, which include the following –
- A new ‘Text Ratio’ column has been introduced in the internal tab which calculates the text to HTML ratio.
- Google updated their Search Analytics API, so the SEO Spider can now retrieve more than 5k rows of data from Search Console.
- There’s a new ‘search query filter’ for Search Console, which allows users to include or exclude keywords (under ‘Configuration > API Access > Google Search Console > Dimension tab’). This should be useful for excluding brand queries for example.
- There’s a new configuration to extract images from the IMG srcset attribute under ‘Configuration > Advanced’.
- The new Googlebot smartphone user-agenthas been included.
- Updated our support for relative base tags.
- Removed the blank line at the start of Excel exports.
- Fixed a bug with word count which could make it less accurate.
- Fixed a bug with GSC CTR numbers.
Thanks to everyone for all
Visit Our Websites:
No comments:
Post a Comment