We have moved to GitHub Issues
Created by Stephan 16 May 2014, 15:57:23 Updated by Sebastiaan Janssen 29 Mar 2017, 08:30:58
Relates to: U4-5355
Relates to: U4-4922
Running 6.2.0 in production.... from time to time the application gets slower and slower then eventually locks itself completely. IIS reports plenty of long-running requests that keep piling, and nothing seems to happen. Until IIS decides to recycle everything. Having about 4-6 total locks per days.
Finally took a memory dump of the application while it was locked, using ProcDump (http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx) and doing:
procdump -ma <pid> iis.dmp. This produced a 1.6GB dump file that can be zipped to about 200MB and downloaded from production server.
On a devt machine, use WinDBG to open the dump then dumped the threads: .loadby sos clr ~*e!clrstack
This creates a big text blob containing all the threads in the IIS w3wp process, with the CLR stack trace of each CLR thread. First look reveals that almost all threads are blocked at something looking like: System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean) System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean) System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLockCore(TimeoutTracker)
Which would indicate that the whole application is deadlocked. Next: inspect each of the approx 250 threads manually and figure out what's going on...
Almost every thread is locked waiting for some lock in a resolver, or in HttpRuntimeCacheProvider. And then we have...
System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean) System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean) System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterReadLockCore(TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterReadLock(TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterReadLock(Int32)
Umbraco.Core.ObjectResolution.SingleObjectResolverBase`2[[System.__Canon, mscorlib],[System.__Canon, mscorlib]].get_Value() [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\ObjectResolution\SingleObjectResolverBase.cs @ 106] Umbraco.Core.StringExtensions.ToSafeAlias(System.String) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\StringExtensions.cs @ 927]
1[[System.__Canon, mscorlib]].<GetCacheItem>b__0() [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\Cache\CacheProviderExtensions.cs @ 22] Umbraco.Core.Cache.HttpRuntimeCacheProvider.GetCacheItem(System.String, System.Func1<System.Object>, System.Nullable
1<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.Web.Caching.CacheDependency) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\Cache\HttpRuntimeCacheProvider.cs @ 91] Umbraco.Core.Cache.HttpRuntimeCacheProvider.GetCacheItem(System.String, System.Func1<System.Object>, System.Nullable
1<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.String) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\Cache\HttpRuntimeCacheProvider.cs @ 112] Umbraco.Core.Cache.CacheProviderExtensions.GetCacheItem[[System.__Canon, mscorlib]](Umbraco.Core.Cache.IRuntimeCacheProvider, System.String, System.Func1<System.__Canon>, System.Nullable`1<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.String) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\Cache\CacheProviderExtensions.cs @ 22]
*** WARNING: Unable to verify checksum for umbraco.MacroEngines.dll
umbraco.MacroEngines.ExamineBackedMedia.GetUmbracoMedia(Int32) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\umbraco.MacroEngines\RazorDynamicNode\ExamineBackedMedia.cs @ 54]
umbraco.MacroEngines.DynamicBackingItem..ctor(Int32) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\umbraco.MacroEngines\RazorDynamicNode\DynamicBackingItem.cs @ 50]
umbraco.MacroEngines.DynamicNode..ctor(Int32) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\umbraco.MacroEngines\RazorDynamicNode\DynamicNode.cs @ 67]
*** WARNING: Unable to verify checksum for w310.Core.dll
*** ERROR: Module load completed but symbols could not be loaded for w310.Core.dll
w310.Core.RazorModels.MultiNodeTreePickerModel.Init(Int32, System.String, System.Object ByRef)
umbraco.MacroEngines.DynamicNode.TryCreateInstanceRazorDataTypeModel(System.Guid, System.Type, System.String, System.Object ByRef) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\umbraco.MacroEngines\RazorDynamicNode\DynamicNode.cs @ 659]
=> that thread has a lock on the global HttpRuntimeCacheProvider and is trying to get a lock on a resolver (ShortStringHelperResolver actually).
[HelperMethodFrame_1OBJ: 0000000014849a38] System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean) System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean) System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLockCore(TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLock(TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLock(Int32)
1<System.Object>, System.Nullable1<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.Web.Caching.CacheDependency) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\Cache\HttpRuntimeCacheProvider.cs @ 84]
1<System.Object>, System.Nullable1<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.String) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\Cache\HttpRuntimeCacheProvider.cs @ 112]
Umbraco.Core.Cache.CacheProviderExtensions.GetCacheItem[[System.__Canon, mscorlib]](Umbraco.Core.Cache.IRuntimeCacheProvider, System.String, System.Func
1<System.__Canon>, System.Nullable1<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.String) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\Cache\CacheProviderExtensions.cs @ 22]
Umbraco.Core.CacheHelper.GetCacheItem[[System.__Canon, mscorlib]](System.String, System.TimeSpan, System.Func
1<System.__Canon>) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\CacheHelper.cs @ 304] Umbraco.Core.PluginManager.TryGetCachedPluginsFromFile[[System.__Canon, mscorlib]](TypeResolutionKind) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\PluginManager.cs @ 290] Umbraco.Core.PluginManager.ResolveTypes[[System.__Canon, mscorlib]](System.Func1<System.Collections.Generic.IEnumerable
1<System.Type>>, TypeResolutionKind, Boolean) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\PluginManager.cs @ 651] Umbraco.Core.PluginManager.ResolveTypes[[System.__Canon, mscorlib]](Boolean, System.Collections.Generic.IEnumerable1<System.Reflection.Assembly>) [c:\Users\Stéphane\Documents\Development\Umbraco\src6\src\Umbraco.Core\PluginManager.cs @ 752]
*** WARNING: Unable to verify checksum for cms.DLL
*** ERROR: Module load completed but symbols could not be loaded for cms.DLL
=> that thread is using the infamous Resolution DirtyBackDoor (so it has locked resolution) and wants a lock on the HttpRuntimeCacheProvider.
So we have a deadlock.
Looking at the code... the cause of the deadlock is that we are doing far too many things within locks. In HttpRuntimeCacheProvider we basically lock the whole cache, look for a value, and if the value is not there ''generate the value while still in the lock''. This means that every cache misses are ''serialized''. If generating the value means hitting the database, etc, there's a big perf. hit.
Same with the DirtyBackdoor. We get the lock, then we regenerate many things. And then it's more than perf hit: if generating content for HttpRuntimeCacheProvider implies using other values from that cache, then we end up recursively re-entering the cache and deadlocking.
In our case, entering DirtyBackdoor locks resolution, and then in there we try to use the cache, while another thread has locked the cache and is waiting for resolution.
reduce lock contention on HttpRuntimeCacheProvider by storing Lazy
use DirtyBackdoor (until we can get rid of if) ''exclusively'' for resolution-related stuff. So, prepare data before hand, and enter the DirtyBackdoor ''only'' to reconfigure the resolver that needs to be reconfigured.
Currently running a site with those fixes implemented to see how it works.
Pull request on 6.2.1, for review: https://github.com/umbraco/Umbraco-CMS/pull/384
Will this be pushed to a release on 7 (7.1.4 possibly?) as we are having a similar behaviour issue on our production site running 7.1.3.
The IIS process slows down to a halt and then locks up the site entirely causing the app pool to crash resulting in a 503 error. It happens approx 1-2 times a day and causing about 20 min of outage.
It's the only real critical error we have on the site.
Let me know, and thanks for the hard work!
Due in 6.2.1 and 7.1.4 yes. Glad (well...) to know others are impacted, we were surprised that the issue had not been reported more widely.
We had this happen again over the weekend (around 3 hours of downtime). Attached is the overview of activity on the server. CPU spike, requests drop and errors spike. Stephen, hows the site you are currently running with the fixes?
Sites running with the fixes are running fine with regards to that issue, ie we don't experience lockdowns anymore. Some unrelated high-volume perf issues, but that is something else.
About fixes: Fixes should be part of 6.2.1 and 7.1.4 but they touch the very core of Umbraco's internal cache so I'm reviewing & reviewing them again and again to make sure they don't break anything else. There's already been several iterations of them...
What might be different in your case is... I can see the inbound/outbound traffic going down to zero... but the CPU seems to remain relatively high? Don't know what's the scale of your graph but between 6 and 9 the machine seems busy. When all requests are deadlocked, (a) the machine does more or less nothing and (b) IIS (or maybe it is the CLR itself) kills everything and it takes 15-30 minutes by default, not 3hrs.
Did you do anything special to restore service? Would you be able to do a process dump when the server is down? Can give you instructions and it's a really simple task, and I'd really like to have a look at that dump to figure out whether your machine is locked (same issue) or it's something different.
Hi Stephen, No problem. We have now enabled the processor dump for all requests taking longer than 15 seconds (It's an Azure website) and i'll get these over to you once it happens again (most likely within the next 24 hours).
In terms of resolution, currently the only thing that works is stopping the website for a few seconds and then sparking it up again (Restarting it does not help) and sometimes this does not help either. Usually this hangup and crash only last for approx 20 - 40 minutes, however this instance was much longer as this weekend was the highest expected and actual volume for the year (The site has been live for approx 5 weeks now).
It's a pretty standard Umbraco deployment with no additional transaction logic or code (Mostly just content management) so the locking up of the site is an interesting one. Only complication of the site is the high volume of traffic (well, relatively high volume)
Let me know if you got any more insight into the issue. Much appreciated.
Quite interested in seeing some dumps, send some soon as you can.
Since the release of this is coming up in two days, we won't have time to properly add this to the next release. We're definitely interested in your logs @Gordon.Pretorius and remind me to do a custom build with just these changes after 6.2.1 has been released so you can test if you want.
We had it happen again this evening (again down for 2 hours). I have attached another screenshot of the server performance at the time it happened. As you can see, the processor is very busy whilst the rest of the site grinds to a halt. This time round, restarting the application or shutting down the website didn't have any impact.
I have included a link to the RAW processor dumps as requested. It's 1.4 GB zipped up so it's up on my dropbox
We only dumped processor requests that took greater than 15 seconds, which the creation of these dumps co-inside with the site freezing up.
Let me know if there is anything else you require from me and thanks for your help in getting this issue resolved.
Need to give the client some feedback so any insight would be appreciated.
Getting the dump.
The file contains several dumps... does each dump matches an app lock?
From what I can see in the dumps, most threads are in System.Threading.Thread.SleepInternal(Int32) System.Threading.ReaderWriterLockSlim.SpinWait(Int32) System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLockCore(TimeoutTracker) System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLock(TimeoutTracker) Examine.LuceneEngine.Providers.LuceneSearcher.ValidateSearcher(Boolean) Examine.LuceneEngine.Providers.LuceneSearcher.GetSearcher() Examine.LuceneEngine.Providers.BaseLuceneSearcher.Search(Examine.SearchCriteria.ISearchCriteria, Int32) Examine.LuceneEngine.Providers.BaseLuceneSearcher.Search(Examine.SearchCriteria.ISearchCriteria) Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedMediaCache.GetUmbracoMedia(Int32) Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedMediaCache.GetById(Umbraco.Web.UmbracoContext, Boolean, Int32) Umbraco.Web.PublishedCache.ContextualPublishedCache`1[[System.__Canon, mscorlib]].GetById(Boolean, Int32) Umbraco.Web.PublishedContentQuery.DocumentById(Int32, Umbraco.Web.PublishedCache.ContextualPublishedCache, System.Object) Umbraco.Web.PublishedContentQuery.Media(Int32) Umbraco.Web.UmbracoHelper.Media(System.String) ASP._Page_Views_Partials_SocialCarousel_cshtml.Execute() ...
Or some other sort of lock-waiting in LuceneSearcher.ValidateSearcher, while one thread is in System.Security.Util.StringExpressionSet.AddExpressions(System.Collections.ArrayList, Boolean) System.Security.Permissions.FileIOPermission.AddPathList(System.Security.Permissions.FileIOPermissionAccess, System.Security.AccessControl.AccessControlActions, System.String, Boolean, Boolean, Boolean) System.Security.Permissions.FileIOPermission..ctor(System.Security.Permissions.FileIOPermissionAccess, System.String) System.IO.FileSystemInfo.get_FullName() Lucene.Net.Store.SimpleFSDirectory.OpenInput(System.String, Int32) Lucene.Net.Store.FSDirectory.OpenInput(System.String) Lucene.Net.Index.SegmentInfos.Read(Lucene.Net.Store.Directory, System.String) Lucene.Net.Index.SegmentInfos+AnonymousClassFindSegmentsFile.DoBody(System.String) Lucene.Net.Index.SegmentInfos+FindSegmentsFile.Run(Lucene.Net.Index.IndexCommit) Lucene.Net.Index.SegmentInfos.Read(Lucene.Net.Store.Directory) Lucene.Net.Index.SegmentInfos.ReadCurrentVersion(Lucene.Net.Store.Directory) Lucene.Net.Index.DirectoryReader.IsCurrent() Examine.LuceneEngine.LuceneExtensions.GetReaderStatus(Lucene.Net.Index.IndexReader) Examine.LuceneEngine.Providers.LuceneSearcher.ValidateSearcher(Boolean) ...
So that one has entered the lock in LuceneSearcher.ValidateSearcher, and seems to be doing "something" and is locking all the other threads. Can't tell yet whether it is locked itself, or not... At the moment I think your issue is not related to ours (not same type of lock) but is related to some sort of lock, within Examine.
Running out of time, will try to get back to it asap.
Each one of the dumps relates to a lock (we set it up to log a file for each request taking longer than 15 seconds). It was building up quite quickly so we had to stop the logging, but the ones I sent to you where the first ones.
We originally suspected the cause to be Examine (as all other posts in the forum regarding 503 errors related to Examine in some sense). We are not using Examine anywhere and could remove it, however I suspect there is some core functionality in Umbraco relying on it? Is this assumption correct? If not, can we remove Examine from this installation? We are not seeing any errors being logged in the logs for Examine...
The situation is getting quite desperate with the client so anything you can recommend to prevent it, or to try, please let me know.
Ah OK so the server dumps whenever a request takes longer than 15 seconds. Got it.
Not a good idea to entirely get rid of Examine. As the stacktraces show, sh*t happens when you retrieve a Media (using Umbraco.Media(xxx)), because medias are cached in Examine. If you remove Examine then medias will be fetched from the DB and not cache (assuming it works at all) and prefs will be bad.
Can I suggest going to the dev section and rebuilding the Examine indexes?
Other than that... I need to look into all the dumps + Examine code. Running out of time unfortunately :-( but will do my best.
We have rebuilt the indexes as well as deleted them and forcing the application to rebuild them from scratch. No luck I'm afraid. But will give it a go tonight if it goes down again.
Can you tell us if there is anything out of the ordinary with your setup? Load balancing, etc... ? Do you have any anti-virus software running? If so can you ensure you disable it from scanning the files/folders of your website.
There was an update to Examine for 6.2 to fix other high concurrency issues - you could try downgrading to the previous version: https://www.nuget.org/packages/Examine/0.1.52.2941
The other 503 errors referencing examine found on our i think are quite old and probably not relevant here. There's a ton of sites using Examine that haven't reported a similar issue - but that could also be something to do with the latest version not being used as much on these sites.
I haven't had time to review these dumps but will get on to that on Monday. In the meantime, please let us know if there's anything about your environment that we should be aware of. You can try the Examine downgrade and see how that goes too and I'll have a look at any changes to the Examine code to see if there's anything obvious.
I have a feeling it's a change in the latest with the locking pattern and the use of an upgradeable-read-lock that is not required.
Please downgrade your Examine version to the one specified above and see if that solves your problem. I will update the code and try to get some tests running to verify the locking.
Initially we ran it as a single instance, however the server was locking up quite regularly (2 times a day). It's running in Windows Azure as a website. There is no anti-virus and the website is a stock standard cloud instance.
We noticed the more instances we run, the less likely the chance of the server going down. I suspect it's because whilst one of the instances locks ups, the built in load balancing is off loading the traffic onto other instances. In some cases both, or even all 3 instances would lock up. Currently we are running 5 instances (at a big expense) which has not crashed tonight. This is not ideal as one large instance should be more than sufficient to run the website.
All instances reference the same file system (this is by design on Azure and it even allows you to install Umbraco as a starter kit for your website), however this should not effect the locks, as it occurs when running a single instance.
Let me know if you need any more info. Thanks for the assistance. I'll try the Examine downgrade and get back to you.
Right, so this is definitely part of your problem. Umbraco does not support load balancing on WAWS out of the box for the precise reason that you state - WAWS uses a single file share for all instances in the same region - thus you will get file locks.
You must setup Umbraco very specifically for use with WAWS when load balancing and even still it will not work (yet) due to how Lucene works. There's been much discussion about load balancing and WAWS, some references can be found here:
You MUST do a few things to make this almost work:
We have worked on getting Umbraco to work seamlessly in LB with WAWS and I've been meaning to blog about it but haven't had the time yet. Until then though, Umbraco load balanced with WAWS will have issues.
My advise would be to downgrade your Examine DLL, and move back to a single server and see if your problem is fixed.
We do not want to load balance the site. The only reason we are currently doing it is to prevent the downtime. When running it in a single instance, the site locks up, crashes and then gives the 503 errors (and this happens regularly, approx twice a day). We originally launched the site as a single instance, however due to the downtime had to increase the instances.
Ok, well it's worth noting that increasing the instances may cause you other problems as mentioned above. Give the examine dll downgrade a try, i think that the latest examine changes may be causing your issue.
Will do and will let you know how it goes.
please do, and thanks for your patience!
Just FYI I've updated the examine source to remove the un-necessary upgradeable read lock when it checks the reader status:
I'm not sure what I was thinking when putting that logic in there since an upgradeable read lock still does not allow concurrent thread access, so now the logic is more like how it was in the previous version where a lock is only applied when the reader is out of date.
Let us know how you go with the downgrade, if that works then this was definitely the issue.
We downgraded Examine (https://www.nuget.org/packages/Examine/0.1.52.2941) as suggested (This updated the references to Lucene.Net.dll & ICSharpCode.SharpZipLib.dll which is expected)
We also reduced the instances to one large instance to avoid any locks as suggested above. We uploaded these changes to the production environment (after testing on stageing).
This all happened at around 9AM this morning. As you can see from the attached file, we have had 1h 23min of downtime in the last 5h (all in small intervals). We have new processor dumps (for all requests taking longer than 15 seconds) if you guys are interested in viewing them.
I guess based on this the fix suggested regarding Examine did not work.
We are back to increasing the instance count to 5 instances (this is after the dumps where created) to increase the percieved uptime of the site (which is causing caching issue for the content editors which is to be expected)
P.S. Download to dumps to follow shortly
Sorry guys, just hold that thought for one second, I'm confirming something in Nuget!
OK, quick update. When running the following in Nuget "PM> Install-Package Examine -Version 0.1.52.2941", it does not actually update the DLL's. Great. So effectively the downgraded dll was not updated hence the locks and downtime mentioned above still occurred.
We have now properly downgraded the dll's and uploaded it to production. At least this should give us a good benchmark to test the update against. Will keep you posted and thank you for all your help!
Was away for a couple of days. Back now. The Examine+LB issue makes sense as the dumps indicates that all threads are waiting on one thread, which in turn is waiting for a file lock, which it probably cannot aquire.
If it crashes again with the old Examine, definitively interested in the dumps.
Just wanted to feedback. It looks like the suggested fix (downgrading the Examine) worked a charm. We have not had any downtime since the update. I realise this does not relate to the original problem but just wanted to say thanks to everyone for their help.
Much appreciated! and high fives!
Let me know if I should create a separate ticket outlining the issue and the suggested fix, and you guys can integrate that into the next version.
Great news (well, sort of :P )
I'll get things updated on our end.
Apologies Shannon, was mid typing and hit the "add comment" button without thinking.
Thanks for the help!
We don't see the application slow down but since upgrading to 6.2.0 the application locks several times per day. The only way to get the website back is to recycle the app pool.
I've tried to downgrade the Examine library but no joy.
How could I best check to see if this is the same issue please?
Is there a way to expedite the fix? Anything I can do to help?
Richard - in order to solve your issue we need details about what you are experiencing. Do you have errors in your umbraco log, etc... ?
Gordon - I've released a new Examine version with the lock fix, all tests are passing including concurrent thread tests (ramped it up to 10000 concurrent ones and no issues) https://www.nuget.org/packages/Examine/0.1.56.2941
Just updating the Umbraco core 7.1.5 and latest 6.2.x to use this version.
Have closed existing pull requests (https://github.com/umbraco/Umbraco-CMS/pull/388 and https://github.com/umbraco/Umbraco-CMS/pull/384) and created a new one targetted at 6.2(.2): https://github.com/umbraco/Umbraco-CMS/pull/418.
Has the latest code in it, including a recent fix for an issue there was with Lazy
Much appreciated. we have just started a new Umbraco build for another client and will use this version when ready (7.1.5), for now using the previous suggested fix)
Thanks for getting back to me. We don't see any errors in the umbracotrace.txt apart from ones related to an issue with one of our macros. What details would you like?
Would this new examine.dll fix the issue, in which case is it worth me trying this new file?
Your help with this is much appreciated...
The problem has just happened again and this is the output from the log file. Nothing was added into the log file until I recycled the app pool so I'm guessing the app was locked? Hope this helps but please let me know if you need anything else...
2014-06-18 15:56:51,213  INFO Umbraco.Core.PluginManager - [Thread 1] Determining hash of code files on disk 2014-06-18 15:56:51,291  INFO Umbraco.Core.PluginManager - [Thread 1] Hash determined (took 76ms) 2014-06-18 15:56:51,306  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of umbraco.interfaces.IApplicationStartupHandler 2014-06-18 15:56:51,337  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of umbraco.interfaces.IApplicationStartupHandler, found 28 (took 29ms) 2014-06-18 15:56:51,415  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of Umbraco.Core.PropertyEditors.IPropertyEditorValueConverter 2014-06-18 15:56:51,415  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of Umbraco.Core.PropertyEditors.IPropertyEditorValueConverter, found 0 (took 1ms) 2014-06-18 15:56:51,415  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of Umbraco.Core.PropertyEditors.IPropertyValueConverter 2014-06-18 15:56:51,415  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of Umbraco.Core.PropertyEditors.IPropertyValueConverter, found 11 (took 1ms) 2014-06-18 15:56:51,447  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of Umbraco.Web.Mvc.SurfaceController 2014-06-18 15:56:51,462  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of Umbraco.Web.Mvc.SurfaceController, found 4 (took 1ms) 2014-06-18 15:56:51,462  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of Umbraco.Web.WebApi.UmbracoApiController 2014-06-18 15:56:51,462  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of Umbraco.Web.WebApi.UmbracoApiController, found 3 (took 1ms) 2014-06-18 15:56:51,462  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of Umbraco.Core.Media.IThumbnailProvider 2014-06-18 15:56:51,462  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of Umbraco.Core.Media.IThumbnailProvider, found 3 (took 0ms) 2014-06-18 15:56:51,462  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of Umbraco.Core.Media.IImageUrlProvider 2014-06-18 15:56:51,478  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of Umbraco.Core.Media.IImageUrlProvider, found 1 (took 0ms) 2014-06-18 15:56:51,837  INFO Umbraco.Web.Search.ExamineEvents - [Thread 1] Initializing Examine and binding to business logic events 2014-06-18 15:56:54,473  INFO Umbraco.Web.Search.ExamineEvents - [Thread 1] Adding examine event handlers for index providers: 3 2014-06-18 15:56:54,489  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of umbraco.interfaces.IApplication 2014-06-18 15:56:54,489  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of umbraco.interfaces.IApplication, found 7 (took 1ms) 2014-06-18 15:56:54,613  INFO Umbraco.Core.PluginManager - [Thread 1] Starting resolution types of umbraco.interfaces.ITree 2014-06-18 15:56:54,613  INFO Umbraco.Core.PluginManager - [Thread 1] Completed resolution of types of umbraco.interfaces.ITree, found 27 (took 3ms) 2014-06-18 15:56:55,378  INFO umbraco.content - [Thread 18] Loading content from disk cache... 2014-06-18 15:56:55,815  INFO Umbraco.Core.PluginManager - [Thread 11] Starting resolution types of Umbraco.Core.Persistence.Mappers.BaseMapper 2014-06-18 15:56:55,815  INFO Umbraco.Core.PluginManager - [Thread 11] Completed resolution of types of Umbraco.Core.Persistence.Mappers.BaseMapper, found 20 (took 2ms) 2014-06-18 15:56:57,406  INFO Umbraco.Core.PluginManager - [Thread 19] Starting resolution types of umbraco.interfaces.IDataType
Richard, what would really help is a memory dump of IIS when it is locked. There are instructions at the top of this issue on how to get a dump.
Here's a dump of IIS during lock state as requested: http://www.dropbox.com/s/nbkiwmlj32sgmlo/iis.7z.
Please let me know if I can help in any other way.
Is it worth me trying the new examine dll (https://www.nuget.org/packages/Examine/0.1.56.2941)?
You have about 250 threads in the dump. Most of them are waiting for a read-lock on Umbraco's internal cache. But look at thread 54:
OS Thread Id: 0x13e8 (54)
System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)
System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)
System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker)
Umbraco.Core.Cache.HttpRuntimeCacheProvider.GetCacheItem(System.String, System.Func<System.Object>, System.Nullable<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.Web.Caching.CacheDependency)
Umbraco.Core.Cache.HttpRuntimeCacheProvider.GetCacheItem(System.String, System.Func<System.Object>, System.Nullable<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.String)
Umbraco.Core.Cache.CacheProviderExtensions.GetCacheItem[[System.__Canon, mscorlib]](Umbraco.Core.Cache.IRuntimeCacheProvider, System.String, System.Func<System.__Canon>, System.Nullable<System.TimeSpan>, Boolean, System.Web.Caching.CacheItemPriority, System.Web.Caching.CacheItemRemovedCallback, System.String)
Umbraco.Core.CacheHelper.GetCacheItem[[System.__Canon, mscorlib]](System.String, System.TimeSpan, System.Func<System.__Canon>)
Umbraco.Core.PluginManager.ResolveTypes[[System.__Canon, mscorlib]](System.Func<System.Collections.Generic.IEnumerable<System.Type>>, TypeResolutionKind, Boolean)
Umbraco.Core.PluginManager.ResolveTypes[[System.__Canon, mscorlib]](Boolean, System.Collections.Generic.IEnumerable<System.Reflection.Assembly>)
And then at thread 55:
OS Thread Id: 0x2040 (55)
System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)
System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)
System.Threading.ReaderWriterLockSlim.WaitOnEvent(System.Threading.EventWaitHandle, UInt32 ByRef, TimeoutTracker)
Umbraco.Core.PluginManager.ResolveTypes[[System.__Canon, mscorlib]](System.Func<System.Collections.Generic.IEnumerable<System.Type>>, TypeResolutionKind, Boolean)
Umbraco.Core.PluginManager.ResolveAttributedTypes[[System.__Canon, mscorlib]](Boolean, System.Collections.Generic.IEnumerable<System.Reflection.Assembly>)
What we see is that thread 55 has entered GetCacheItem() and is executing GetXsltExtensionsImpl() - so it has a lock on the cache, but then it calls PluginManager.ResolveTypes() which wants to lock the plugin manager.
In the meantim thread 54 has entered ResolveTypes() and is executing TryGetCachedPluginFromFiles() - so it has a lock on the plugin manager, but then it calls GetCacheItem() which wants to lock the cache.
Deadlock. I believe that the root cause is that GetCacheItem() keeps a lock on the entire cache while generating a value (here, with GetXsltExtensionsImpl). We need finer granularity. This should be fixed with PR https://github.com/umbraco/Umbraco-CMS/pull/418 which is still under review.
@zpqrtbnk the I think about the repercussions of UpgradeableReadLock the more I think we should refrain from using it since I'm pretty sure it doesn't actually solve the problem that I thought it did. I've read this article dozens of times: http://ayende.com/blog/4349/using-readerwriterlockslims-enterupgradeablereadlock
and have only recently realized that it is correct: "Only one thread is able to enter upgradable read lock" which does actually make a lot of sense and in fact if we use that, we might as well just use a lock(object) statement which has far less benefit of using the readerwriterlockerslim in the first place.
It would be worth looking at all places where we use this and refactor. The original reason (quite some time ago... v5) I liked it was because it seemed to solve having to do double check locks where in fact doing a true c# double check lock is still far more efficient. But better yet if we can just refactor our code a little to use correct read / write locks it would be faster and less prone to this sort of issue (i think)
Would the lock errors above be related to the error I am seeing below (this only starting happening after an upgrade from 6.1.6 to 6.2.1)
System.Data.SqlClient.SqlException: Transaction (Process ID 68) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. Generated: Fri, 20 Jun 2014 05:14:54 GMT
System.Data.SqlClient.SqlException (0x80131904): Transaction (Process ID 68) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action
1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows) at System.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more) at System.Data.SqlClient.SqlDataReader.Read() at Umbraco.Core.Persistence.Database.<Query>d__1c1.MoveNext()
1 source) at Umbraco.Core.Persistence.Repositories.VersionableRepositoryBase2.GetPropertyCollection(Int32 id, Guid versionId, IContentTypeComposition contentType, DateTime createDate, DateTime updateDate)
1 dtos) at Umbraco.Core.Persistence.Repositories.RepositoryBase2.GetByQuery(IQuery
1 query) at Umbraco.Core.Services.MemberService.GetByUsername(String userName) at Umbraco.Web.Security.Providers.UmbracoMembershipProvider2.GetUser(String username, Boolean userIsOnline)
at System.Web.Security.Membership.GetUser(String username, Boolean userIsOnline)
at WG2KProtected.ChildrenProtected(DynamicNode d)
@Shandem Getting an upgradeable read lock prevents any other thread from also getting an ''upgradeable'' read lock, but allows other threads to get a ''normal'' read lock. The idea would be that most of your reads are normal, with the exception of a few of them that would be upgradeable.
There ''are'' functions (such as GetCacheItemsByKeySearch) where we ''do'' just get a normal read lock, so they can run // including while we have entered the upgradeable read lock and before we actually upgrade it.
All these "read" methods need to be able to run // so not sure we can just lock().
@Bradley Unrelated, unless I'm totally wrong. What you're seeing is a database-level deadlock. Not saying it's not an issue, and it is not ''also'' about deadlocks, but they happen in a totally different place and for a different reason. You'd probably better create an entirely new issue.
@stephan yup i have already U4-5069 Deadlock occurring in MemberService.GetByUsername
@zpqrtbnk - I understand the concept of the upgradeable read lock but where we are using it in some places is not necessary (and will be in fact worse performance) because we are not using a read lock + a write lock to manage a resource. Not sure if you read all of Ayende's article but the important parts are:
"Only one thread is able to enter upgradable read lock. That means that in code such as this, where this is the only access that we have, we have in effect turned the reader writer lock into a single write lock. Since only one thread can enter upgradable read, it means that we might have well used a standard lock(syncLock) statement."
"Upgradable Read is only useful if you have multiple ways of accessing the data, some that are purely read and some (rare) that are upgradable reads. If most / all of your calls go through upgradable read code paths, you are better off using read/write and handling the lock release explicitly."
It's the places in our code that we aren't using a read + write locking on a resource and only relying on an upgradeable read lock that needs refactoring.
@Shandem I ''think'' I understand it, and have read Ayende's article. And we ''do'' have parts that are "purely read" though our upgradable reads are probably not as "rare" as they should. But a standard lock(syncLock) would hurt all our pure reads, wouldn't it?
The point is to avoid a normal lock all together - When using upgradeable read lock without other read / write locks it's essentially the same as a normal lock which is not ideal - and in those cases it's better to do normal double check locking... Or better yet refactor to use read + write locks correctly to avoid double check
Is it possible for this issue to effect a 6.1.6 install? Over the past few months our Umbraco cache has been crashing at random. The site can easily be brought back up by republishing the entire site in the back-office.
Looking at the logs I can see that the crash is always proceeded by entries like Richard posted on the 18th. At the time of the crash the log always shows something like...the error is always the same, but the actual stack trace varies.
2014-06-23 08:57:05,617  ERROR umbraco.cms.businesslogic.web.DocumentType - [Thread 70] Exception while trying to build DTD for Xml schema; is Umbraco installed correctly and the connection string configured?
System.InvalidOperationException: Resolution is not frozen, it is not yet possible to get values from it.
2.get_Value() at Umbraco.Core.Models.Template..ctor(String path, String name, String alias) at Umbraco.Core.Persistence.Factories.TemplateFactory.BuildEntity(TemplateDto dto) at Umbraco.Core.Persistence.Repositories.TemplateRepository.PerformGet(Int32 id) at Umbraco.Core.Persistence.Repositories.RepositoryBase2.Get(TId id)
at Umbraco.Core.Persistence.Repositories.TemplateRepository.PerformGet(Int32 id)
2.Get(TId id) at Umbraco.Core.Persistence.Repositories.TemplateRepository.PerformGet(Int32 id) at Umbraco.Core.Persistence.Repositories.RepositoryBase2.Get(TId id)
2.MoveNext() at System.Collections.Generic.List1..ctor(IEnumerable
1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable1 source)
at Umbraco.Core.Persistence.Repositories.ContentTypeRepository.PerformGet(Int32 id)
2.Get(TId id) at Umbraco.Core.Persistence.Repositories.ContentTypeRepository.<PerformGetAll>d__2.MoveNext() at Umbraco.Core.Persistence.Repositories.RepositoryBase2.GetAll(TId ids)
at Umbraco.Core.Services.ContentTypeService.GetAllContentTypes(Int32 ids)
The reason behind all mess is examine, they need to optimize that, as in a project we are working on, there is a large number of nodes and nested nodes, when we publish that, it stucks, cpu level rises, in the end we disabled examine, and things went well with a payback of no search in the website at moment.
I cannot believe they can release a cms with such intensively insane bugs and issues.
Hi Stephen and Shanoon, it sees that you stated above that the issue will be fixed in 7.1.5, when is it rolling out. As I have not see this issue as closed yet. Out client is deeply suffering from performance issues, even publishing does not work sometimes.
@Ali i'm not sure if you've read through this whole thread. I'm not sure what version you are using but if you are using 6.2/7.1.4 then you can probably just downgrade the Examine version. see above: http://issues.umbraco.org/issue/U4-4931#comment=67-14687 if that is the problem you are having.
Other people on this thread are having different issues. The more information you give us, the more we can help.
7.1.5 and 6.2.2 are not specifically scheduled but I'd imagine they'll be out within a few weeks.
A newer version of Examine will be released with them, but you can already use that version if you want as it's released:
There were two issues at core:
a- Perfomance issue, CPU level rises and eventually no publishing or stuff works b- Two data entry operators are using the umbraco backoffice to enter data, they face continual random publishing malfunctions, sometimes they publish 10 to 15 times and nothing happens on frontend. Eventually it does after hours of refreshing of browser or deleting the examine file in temp folder which it regenerates. c- Things get slow alter playing around in the backoffice
For this we turned examine off, by that we thought our search will not work, but somehow, search is working without examine, is it normal?
I applied the other cure to add
I added a setting in IIS to kill the worker process if it consumes a lot of CPU usage, is it a good practice, if I do not do it, CPU level rises and site goes down.
And to let you know it is hosted on Azure VM
Ali - Are you load balancing? This setting is only for load balancing: umbracoContentXMLUseLocalTemp. This is Azure VM not Azure websites correct? What version are you using?
Content/media/member search will not work without Examine, in the back office there's rudimentary search capabilities with other things like templates.
@zpqrtbnk and @sebastiaan I am definitely going to vote to have this bug fix included in 6.2.2 and 7.1.5, more and more people are seeing this issue and it is a pretty massive issue. Stephen's current PR looks great and he's been running it on a live site that has this issue for quite some time now. My biggest concern is that it has taken us quite a while to get these versions out and if we don't get this done in these versions it's gonna be ages again until this fix is in there.
@Shandem I'm sorry but I think what we're seeing is a lot of confirmation bias. People experiencing performance problems are throwing all their hopes into this issue without even trying if any of the solutions work.
The fact that the fix runs great on a single site doesn't prove it'll run great on every site. I want more people to test this before we release it. So: apply this one specific fix to 7.1.4 and have some people who have replied to this issue test if it really is the solution to their problems. If yes, we should probably include it.
Again, this issue affects a very select number of people with a specific usage pattern and I suspect a lot of people who have replied to this issue actually have other issues.
@sebastiaan that is untrue, this issue affects 100% of sites, the thread lock may only appear on sites that might have more than one editor at once. The reality is that site performance will be improved with this fix as the current cache locking is incorrect. I'm not sure how you want people to test this fix? It is purely a PR on GitHub atm and under the 7.1.5 release...
But yes, other people on this issue have other issues, but my concern is that this issue affects every single site whether people realize it yet or not and since there is no release date for 7.1.5 I'll assume that 7.1.6 won't be out for quite some time.
@sebastiaan I'm also thinking along the lines of Courier as well since I've had to start looking into that codebase recently. This cache locking will directly affect Courier because it spawns quite a few threads to do it's processing, each of which while doing multiple concurrent updates will have a massive bottleneck to deal with - which is our current cache locking which stephen's PR fixes.
@Shandem Yes, and it has affected 100% of those sites for (apparently) years now. I just don't feel comfortable releasing this in an patch release which haven't had a great track record as of late, it needs more testing first.
As I said, I'll make a build of 7.1.4 with ONLY this PR merged in so that people who've replied here and THINK they have this problem can test it and see if there's any unintended side-effects.
WRT Courier and UaaS we can also have some users test this on UaaS and I could even have the custom version be the default on UaaS. It looks like it might actually benefit currently running sites as some of them are in content entry phase.
@ssebastiaan Ok cool, as long as we can have a testable release with this fix. And this hasn't been in the core for years ! this is 'new', since only version 7
@Shandem Eh..? And from which v6 version is this new then?
@sebastiaan about 6.1
@Shandem Months! Months I say! ;-)
At the moment the PR is against 6... lemme know if you want me to do one against 7.
@zpqrtbnk I just gave it a shot myself this morning and I screwed it up royally, so yes please! If you could do that in another branch that would be great. :-) Would be easy to keep the branches in sync as well I think. Note that these last comments are only visible to Core, if you hit the reply button on those comments the visibility will be automatically set to that.
@sebastiaan There's now branch 7.1-4931 in my fork which is latest 7.1.5 + fixes. I have cherry-picked them because that was easier than merging 6 into 7. From now on, should I change anything, I'll do it both in the 6 and 7 branches. Is that OK? Now do you want me to do a PR for that branch, or just to push it into the main repo?
@zpqrtbnk Please hold off on pushing it into the main repo as said before, I want to put it out for testing first. I'll do a custom 7.1.4 build soon to post here so people can actually start trying it out.
@sebastiaan Sure, it's https://github.com/umbraco/Umbraco-CMS/pull/423 then.
@zpqrtbnk sweeeeet! :) #h5yr
Alright everybody, I've taken the updates that Stephan has been making and applied them to the 7.1.4 release, the dlls that are updated are in the zip file attached. Please note that there have been NO other changes, this is 7.1.4 with Stephan's fix applied.
Before I feel comfortable releasing these fixes in a new version of Umbraco I need some of you to test this solution. As the changes reach pretty deep into the core, I really want to make sure that we don't have any unintended side-effects.
Please make sure that you're on 7.1.4 and then drop these files in. Note that this includes the Examine.dll that was shipped with 7.1.4, so if you've downgraded Examine, it will now be back at the original version. Make absolutely sure to copy that version in otherwise your test is not a valid one.
Final note: make sure to UNBLOCK the zip file before unzipping it (right-click > Properties > Unblock).
If you're still on v6 of Umbraco I can also make a custom build for 6.2.1. Note that I won't be making custom builds for versions other than the latest version.
As an additional note: these changes have been tested on one live site so far (and it's a success there). Other than that I've only tested these dlls locally to make sure that nothing breaks when updating them.
We feel comfortable that it shouldn't break anything but that is also where your testing comes in. So, needless to say: proceed with care and make backups.
Hi Sebastiann, I would like to have the 6.2.1 version. Actually I do not encounter the above issues, but my site went productive just today. But I like to be prepared. And what is good for you - of course I will test it thoroughly;-)
@dseefeld Thanks Dirk, any testing is helpful! See attached.
So far I found no issues for 621-U4-4931!
I updated an existing 6.2.1 version (localhost only, single user), rebuild my custom dlls, published all nodes from context menue with all subpages, published single node via tool bar and rebuilt all examine index via developer Examine Managment dashboard.
Probably this will not cover this whole issue as it is described above. As I already said I do not entcounter the slowing down and locking of the application, thus I can not say whether this issue is solved or not.
Thanks Dirk for at least testing that your site keeps running! :)
Hope other people are trying it out, it's been awfully quiet in here (which might be a good thing).
I would love to help out and test this but we are being very careful about what code we drop into the current live site due to the previous issues (which are sorted by all accounts).
We have a few more Umbraco builds in the pipeline which we can help out on, but these are only looking to roll out in a month or so.
Just wanted to provide an encouraging update. We have a site at the end of development that utilizes several custom Lucene indexes and utilizes Examine calls that search against them. When we started load testing the site with the base 7.1.4 install the site started to slow and IIS locked up and the appPool recycle when we reached 100 requests/sec to one of the pages that makes heavy use of Examine. We then loaded up the custom 7.1.4 dlls that Sebastiaan provided above and again load-tested the site. With the updated dlls were able to push over 250 requests/sec without any errors and the site remained responsive (a bit slower, but responsive). Over the past 12 hours we have pushed million of requests at the entire site and haven't had any issues.
Thanks for all of your hard work - in our case, it appears you have fixed this problem.
Have merged the fix PR in 7.1.5 now.
Have merged the fix PR in 6.2.2 now.
Backwards Compatible: True
Affected versions: 6.2.0, 7.1.2, 7.1.3
Due in version: 7.1.5, 6.2.2