U4-10213 - Randomly getting System.UnauthorizedAccessException when app pool recycles

Created by Sebastiaan Janssen 31 Jul 2017, 14:17:43 Updated by Shannon Deminick 06 Oct 2017, 01:13:01

Tags: Unscheduled

Relates to: U4-10301

Relates to: U4-10503

Relates to: U4-9576

Subtask of: U4-9609

From a support case, they regularly get the following error, which eventually resolves itself but takes the site down for minutes:

System.UnauthorizedAccessException: Access to the path ''D:\home\site\wwwroot\App_Data\TEMP\PluginCache\umbraco-plugins..list'' is denied. at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.File.InternalDelete(String path, Boolean checkHost) at Umbraco.Core.PluginManager..ctor(IServiceProvider serviceProvider, IRuntimeCacheProvider runtimeCache, ProfilingLogger logger, Boolean detectChanges) at Umbraco.Core.CoreBootManager.Initialize() at Umbraco.Web.WebBootManager.Initialize()

2 Attachments

Comments

Stephan 31 Jul 2017, 14:22:19

see also https://our.umbraco.org/forum/ourumb-dev-forum/bugs/39467-cannot-access-umbraco-pluginslist#comment-243553


Stephan 31 Jul 2017, 14:23:04

we have no safety on the plugin manager list, we just try to read and write because reads and writes should be super fast, and hope that there's no collision, and when there's one... it throws an exception


Stephan 31 Jul 2017, 14:25:30

PR https://github.com/umbraco/Umbraco-CMS/pull/2087

review: code review... not sure how to test it as we cannot really repro, only it "happens" on sites


Sebastiaan Janssen 31 Jul 2017, 15:04:34

Tested the build, doesn't break anything. We'll see if the client can test to see if the problem goes away. But merged as the fix seems fine and harmless.


Jeavon Leopold 03 Aug 2017, 18:00:53

I have just starting seeing this fairly frequently with a Azure Web Apps deployment of v7.6.3, is the fix in v7.6.4?


Jeavon Leopold 03 Aug 2017, 18:02:30

I guess not looking at the dates, is it in a v7.6.5 CI build though?


Sebastiaan Janssen 04 Aug 2017, 08:27:56

Sorry, forgot to add the Due in Version date. Yes, it will be 7.6.5!


Jeavon Leopold 12 Aug 2017, 21:53:05

We have been running with v7.6.5-build8511 that includes this fix on a multi instance Azure App site that seems to be affected by this and unfortunately this issue occurred repeatedly today before resolving itself.

I'm going to guess that Umbraco is fighting with the file system sync between the two instances. A simple solution would be to add an option to move this file to the Environment temp in same way as we can no do with the Umbraco XML cache file () as this location isn't ever synced...?


Jeavon Leopold 12 Aug 2017, 22:03:27

Thinking about it if we do go down this route perhaps the new setting should also move the DistCache folder so then there's no chance of anything TEMP wise getting locked or otherwise we could have 2 new settings...

or

or something like that


Jeavon Leopold 15 Aug 2017, 09:31:02

I have been trying to work out why we are seeing this on one particular site all the time when Azure wise it is setup exactly the same as many other sites we have and the main difference seems to be that this site is running v7.6 rather than v7.5. I then noticed that in v7.6 the cache file is no longer XML as it was in v7.5 and also it is roughly 3 times larger than it was previously, I think this is key as previously it was around 60kb rather than the 150kb it is now, I believe the change in size is likely to be meaning that the sync is keeping the lock for long enough to conflict with Umbraco whereas previously this didn't occur.


Shannon Deminick 16 Aug 2017, 04:06:02

@crumpled_jeavon I don't think this will work, the dist cache file is named as the current machine name so putting this on the local fast drive won't change anything. If in fact there are multiple app domains overlapping (which would be bad in general) on startup and fighting for access to this file it won't matter if it's on the network share or the local fast drive.

Any chance you could send me that dist cache file (the 150kb one)? I'd be interested to see why it's more than double the size of before. //cc @zpqrtbnk

I'll have a think about this now and see what I can come up with.


Shannon Deminick 16 Aug 2017, 04:08:59

Ahh, it's bigger because we store everything that is IDiscoverable at the beginning of the file and then add to the file when we filter those out by their requesting type.


Shannon Deminick 16 Aug 2017, 04:14:04

The previous fix was for writing the file, however the latest stack traces are due to deleting the file. The file is deleted anytime a change is detected in the /bin (or a few other places) which means a re-scan needs to take place. The point of this file is that when a re-scan is not needed it just reads the file. We determine a re-scan by checking the hash of the /bin folder (and a few other places) and storing that hash in the hash file. If the hashes match, the file is used. If the hashes don't match, the file is deleted, rescanning occurs and the file is created.


Jeavon Leopold 16 Aug 2017, 09:11:38

@Shandem in this case I'm wondering what is happening to cause the hash to not match, it's certainly not a file change triggered by us. I'm also finding it intriguing that it's occurring at almost exactly that same time across two separate Azure Web Apps (shared DB of course) at almost the same time...??


Shannon Deminick 16 Aug 2017, 23:32:30

This is certainly odd. Here's the code that generates the hash, let me know if you think anything is 'interesting' there. Also note that this hashing code hasn't changed in a very very long time so I don't think that would be the issue unless something has changed in Azure or your web app.

https://github.com/umbraco/Umbraco-CMS/blob/dev-v7/src/Umbraco.Core/PluginManager.cs#L208

I'm awaiting the debug logs from you guys on zendesk which should tell us much more about what is happening.

What I'm wondering is if something odd has changed in Azure or .Net or similar where the GetHashCode method for some objects is no longer being returned as a constant value per AppDomain. This is already the case for several objects but it hasn't been like that for STRING and this is what we've been relying on in the past and it has worked fine. I think it would be more robust to not rely on that and just hash to SHA1 since that is definitely the same between AppDomains. Here's the code for example when generating/combining the full hash, we use a HashCodeCombiner but there's potential that environment specific settings could make this not be reliable especially with all of the changes in Azure and .NET all of the time.

https://github.com/umbraco/Umbraco-CMS/blob/dev-v7/src/Umbraco.Core/PluginManager.cs#L248

I'll update this to use a more reliable Hash and see what happens.


Shannon Deminick 17 Aug 2017, 01:43:27

I've created a new issue with a PR here: http://issues.umbraco.org/issue/U4-10301 which addresses the reliability of the hash produced. However, I don't really know if that is any root cause of the problem, at this point it's only speculation but it certainly has potential to be the prob.


Sebastiaan Janssen 28 Aug 2017, 09:49:03

@crumpled_jeavon I've merged said PR, I would love it if you could test once the build is done, artifacts will appear here: https://ci.appveyor.com/project/Umbraco/umbraco-cms-hs8dx/build/8805/artifacts


Jeavon Leopold 28 Aug 2017, 13:41:56

@sebastiaan absolutely, we have been patiently waiting for the issue to occur with debug logging enabled to get some extra info for @Shandem of course it hasn't yet happened since changing the logging level.


Kraftwerk 08 Sep 2017, 13:31:58

Jeavon Leopold said: I'm going to guess that Umbraco is fighting with the file system sync between the two instances. A simple solution would be to add an option to move this file to the Environment temp in same way as we can no do with the Umbraco XML cache file () as this location isn't ever synced...?

Shannon said: I don't think this will work, the dist cache file is named as the current machine name so putting this on the local fast drive won't change anything. If in fact there are multiple app domains overlapping (which would be bad in general) on startup and fighting for access to this file it won't matter if it's on the network share or the local fast drive.

We think the suggestion from Jeavon is how this problem should be really solved, this implements the same logic as used for examine and umbraco.config to solve the issue with multiple processes and shared file system. Because this way there is only one single process and thus no conflict can happen. Basically umbraco.config and lucene are alredy implemented proof-of-concept for this solution.

Because if the hash calculation decides to delete the file, which is a valid szenario, the same problem will happen again, right?

We looked at the code and the solution could be simple as changing the path for the _tempFolder of the plugincache:

_tempFolder = IOHelper.MapPath("~/App_Data/TEMP/PluginCache"); https://github.com/umbraco/Umbraco-CMS/blob/e7e04de22efaae4ad0504d166bac3f2299b4fb3c/src/Umbraco.Core/PluginManager.cs#L70

to a path which is used for Umbraco.config or examine files are.


Shannon Deminick 03 Oct 2017, 23:40:02

@zpqrtbnk I'm still examining why we get file locks on this file in some Azure instances. The new problem is when the file is being deleted (i.e. the hash values have changed so the file needs to be removed). For some reason during this time in some environments the file is locked (very odd). So I was just looking at the code changes you made for the retry stuff and noticed something, you are using this code to create a file stream:

new FileStream(filePath, FileMode.Create, FileAccess.Write, FileShare.None)

But the equivalent of what we used to use is (and is the default in .net) :

new FileStream(filePath, FileMode.Create, FileAccess.ReadWrite, FileShare.None)

Seems like the safe thing to do would be to use the defaults that are normally configured as there's potential this change may be causing other (strange) issues?


Shannon Deminick 03 Oct 2017, 23:45:34

@kwtech I'm just re-reading your comment (http://issues.umbraco.org/issue/U4-10213#comment=67-41633) and not sure what you mean. I've said that putting this cache file in the EnvironmentTemp folder, the way that the xml cache file and examine indexes work will not solve the issue - but you've just said that it will solve the issue... but it won't. As I mentioned, it doesn't matter if it's on the local fast drive, multiple/concurrent appdomains on the same website still means there's multiple processes acting on the same file.


Stephan 04 Oct 2017, 06:12:33

we could revert to ReadWrite not sure it makes a diff - since we just write, i'm getting Write - this is just declaring what we want to do IIRC

from what I remember pretty sure that when we try to do stuff to the file (delete...) the file is still locked, but not really by Umbraco - kinda like the Azure file system taking time to release the lock on the file ''after'' Umbraco has released it - would need to dive into the issue again for more details


Priority: Normal

Type: Bug

State: Fixed

Assignee:

Difficulty:

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions:

Due in version: 7.6.5

Sprint: Sprint 64

Story Points:

Cycle: 3