We recently finished a solution audit for a very large Sitecore site. The site had serious performance issues (20-30 seconds execution time for some functions and frequent w3wp process crashes) and adding additional beefy servers wasn’t helping in alleviating the situation. We were asked to review the server configuration, solution and code and profile the solution to identify bottlenecks and recommend improvement/fixes.
The code was massively complex and not really easy to understand ,so our analysis of the code didn’t reveal any immediate conclusions other than very heavy use of multi-threaded custom caching… hmmm obviously the items in the cache would have something to do with why this was running slow. Only problem was that the cache was being used for EVERYTHING – even Lucene search results – this would prove important later…
Navigating the content tree immediately raised eyebrows: The solution had more than 40000 items – which in itself isn’t a lot, but they were all in a very flat and non-hierarchical structure, which caused Sitecore to run super slow. You might know that Sitecore does NOT recommend having more than I think 100-200 items on the same level (same parent node). In this case, there were many violations of this common sense “rule”.
Bring out the power tools
We decided to use AQTime to profile the solution. AQTime is a fantastic tool. It is basically a performance profiler that you can use to hook up to many different types of applications and solutions to measure their performance and find the bottlenecks.
Initially, we thought it would be best to run the profiler locally (as that normally proves to be sufficient to detect the bottlenecks), but in this case it gave us unrealistic results, as our use of the solution didn’t match that of the users of the live site. So we decided to run it on one of their live servers (slowing it down a bit maybe, but worth it).
The results were clear
Sitecore was running very fast for what it was used for – but the culprit: Lucene.
Surprising!– isn’t Lucene supposed to be used for search? Or to speed up retrieval of Items in an index? Problem was, Lucene was being used incorrectly and on almost every page, there were multiple Lucene calls. The top 20 most heavy functions were accounting to 88-98% of the CPU time spent:
After a bit of profiling we decided to build a test harness to isolate the Lucene related code and performance-test them manually. We quickly found a common pattern between the calls to Lucene – and the two major culprits for performance were:
- Sorting of the results (always a potential hazard when done incorrectly)
- Filtering using RangeQuery – using non tokenized fields (publish date in this case) as boundaries (dates of the string format yyyyMMddHHmmsss) – as you might have guessed it, that causes most items to have to be evaluated against each other (or close to – dependent on the algorithm Lucene uses for these types of string comparisons).Even worse was that the RangeQuery was used from “0″ (start of time) to Datetime.Now (today, now) – so essentially not really filtering anything but potentially future published articles? Made no sense…
An example of this was a simple search call that took 3600 ms before we removed these filters, and took 30 ms after.
The team that had designed and built the system must have known that the Lucene calls were slow (or may be not), but instead of trying to figure out why, they must have decided to cache the results in a multi-threaded cache (with a 2-minute expiration on some results), it was storing up-to-the-minute news…basically a quick-fix to a long-term architectural issue.
As a developer, you have a responsibility to your client. Hiding performance problems by inserting a cache will not solve the problem long term. You have a duty to fix the problem permanently.