We have moved to GitHub Issues
Created by Sebastiaan Janssen 06 Mar 2017, 18:49:08 Updated by Shannon Deminick 03 Apr 2017, 06:24:03
Relates to: UAASSCRUM-822
Subtask of: U4-9609
In some rare cases when updating the umbraco.config xml file it may be that the app domain has previous crashed (hard crashed, no warning) in the middle of xml writing. If that is the case, the xml file may be incomplete and may not have valid xml markup. In this case we try to delete the xml file which is generally ok but in rare cases, that operation may also fail since the file may be locked for another reason. If that were to occur we'd have an unhandled exception so this ensures that the unhandled exception doesn't occur.
Some notes: a quick code diff between 7.5.6 and 7.5.7 does not point to changes in the Xml handling code. Also, the stacktrace mentioned on Slack shows the exception popping in DeleteXmlFile, from LoadXmlFromFile - it means the file could not be loaded (and an exception has been logged) neither could it be deleted (and another exception is logged). And this happens right when an AppDomain starts. Finally, ''in theory'' due to MainDom, the new AppDomain should not ever try to read the file if the previous AppDomain has not released it.
I'd love to see more of the log file (what was happening before and after) and see if we have any chance of repro?
That log contains a number of nasty Lucene exceptions in the Lucene merge thread, ie in a background thread, taking down the whole AppDomain. So we see a number of hard-restarts, ie the entire w3wp process goes down at once. The exception is caused by: System.IO.FileNotFoundException: Could not find file 'C:\inetpub\temp\DWASFiles\Sites\235c6dec-14c6-4aba-9c8b-4acc9e00d907\VirtualDirectory0\site\wwwroot\App_Data\TEMP\ExamineIndexes\External\Index_50.fdt' - which would indicate some nasty corruption of Lucene files (why is it trying to open a non-existing file?).
We see a few failed attempts at reading umbraco.config because the file contains weird chars / is not valid Xml. Either because reading from disk failed, or because the last write failed. We do see occurences where we log that we are writing the file, and the process dies, so the state of the file is unknown (probably corrupt).
Also seeing a few occurences where the entire w3wp.exe process seems to get killed without even reporting an unhandled exception. And some weird AppDomain IDs, ie a new process start with AppDomain 4 or 5 instead of 1.
The first "access denied" is at 11:43:32,613 - before that, the AppDomain failed to read the file because it contained invalid Xml (but it was possible to open/read the file). So the AppDomain reloaded the Xml from database, and now it tries to write it back into umbraco.config, first deleting the file - and this is where it fails. A few seconds later, the process dies hard again, and the next AppDomain fails to even read umbraco.config (at 11:51:25,318).
At that point we continuously see "access denied" errors as obviously the file is locked / owned by some process. Until at some point things magically fix themselves.
At 11:43:12,513 we log "Save Xml to file...", the immediate next entry at 11:43:12,748 is the unhandled Lucene exception, and then the w3wp.exe process is gone. Which quite probably means that it dies right when we are using a file stream for umbraco.config, without disposing anything (hard crash). I ''think'' (but cannot be sure) that w3wp.exe really goes away so the process is not holding any lock. But I ''believe'' that at system level, maybe because we're running on a network fs, the lock / file handle remains for some time.
Also: would love to see more logs - it really is the only way we have to understand what's going on.
In various situations when we cannot read or write the Xml file, we handle the exception but then we try to delete the file, which can fail too, and then we have an unhandled exception. Made some changes so that failing to delete the file will be handled too.
PR https://github.com/umbraco/Umbraco-CMS/pull/1839 - review: code review
There will still be exceptions reported in the log, because the site isn't working properly really. And once the rogue lock exists, there is little we can do. I still ''think'' it has to do with the FS being an SMB and not local disk FS, and keeping rogue locks for some time even after the process that created it has died. But as least the site should work somehow.
Backwards Compatible: True
Due in version: 7.5.12
Sprint: Sprint 56
Story Points: 3