U4-493 - CmsContentXml table is emptied when modifying a DocumentType that is used on a lot of nodes

Created by Sebastiaan Janssen 19 Aug 2012, 14:53:32 Updated by Shannon Deminick 22 Jul 2013, 01:11:49

Relates to: U4-1772

Relates to: U4-2527

Umbraco version 4.7.1

In 3cases we have experienced that the cmsContentXml table is emptied when we modify a DocumentType that is used on a lot of nodes (5-6000) nodes. We suspect that a timeout somewhere is to blame, but we can't figure out what exactly happens.

On the forum their is at least one other similar report: http://our.umbraco.org/forum/core/general/25944-Reasons-for-the-cmsContentXml-table-to-get-wiped

Let me know if further information is needed to debug this problem.

''Originally created on CodePlex by [emilr|http://www.codeplex.com/site/users/view/emilr]'' on 11/24/2011 1:09:52 PM [Codeplex ID: 30610 - Codeplex Votes: 2]

Imported comments

''Comment by [Stegelmann|http://www.codeplex.com/site/users/view/Stegelmann] on 3/13/2012 3:25:50 PM:'' I have added a patch for version 4.7.1.1, see http://allan-laustsen.blogspot.com/2012/03/umbraco-no-node-exists-cmscontentxml.html for description.

2 Attachments

Download 2012-03-13 - DocumentType alias change, flush xml cache using Tasks.patch

Download 2012-12-20 - DocumentType alias change, flush xml cache using Tasks.patch

Comments

Nico Lubbers 19 Dec 2012, 10:31:09

This issue is also occurring on umbraco 4.7.2 on our production environment. We changed a name of a documentType and after that already 3 times a complete empty cmsContentXml table on >10000 nodes. Is this patch that is being submitted available for download somewhere?


Sebastiaan Janssen 19 Dec 2012, 10:36:05

The patch related to this codeplex issue is now attached here!


Allan S. Laustsen 20 Dec 2012, 12:25:09

Hi Sebastian

Nico pinged me, if I could see it the issue still existed in the releases after 4.7.1.1. and I can see that the code is still the same, also I have received quite a few requests from different people if I could help fix this bug for them. So I decided to pitch in a little by fetching the release sources since 4.7.1.1 applying the patch and rebuilding a release version for each, so that none technical people can get their sites up and running again. I have published the files on my blog http://allan-laustsen.blogspot.com/2012/03/umbraco-no-node-exists-cmscontentxml.html

I think this bug is really critical for large scale installs, and needs to be prioritized VERY high. We have been running our 4.7.1.1 install with the patch since January 2012, and all of our problems with content-out-sync has gone away after the patch was applied.

If you need my advice or help in any way regarding this issue please contact me.

Best regards Allan S. laustsen


Sebastiaan Janssen 20 Dec 2012, 12:37:06

Hi Allan, I am not sure if this will still be necessary for 6.x and onwards, but I'll ask Morten about it. In the mean time, do you happen to have a patch for 4.11.1 as well so I can have a better look at it? Thanks in advance!


Nico Lubbers 20 Dec 2012, 14:35:43

Today we were able to reproduce the bug in development, and we can confirm that we are not able to reproduce the problem once this patch is applied. That is good news for us!!


Allan S. Laustsen 20 Dec 2012, 16:20:53

Hi Sebastiaan (sorry I misspelled your name the first time) I have attached the patch for 4.11.1 (it also applies to the 4.10.1 version)


Shannon Deminick 24 Apr 2013, 07:44:00

I've taken an in depth review of this issue today and have made amends to the code base to resolve it. Although the patch supplied might work I feel that it may have other repercussions in the long term and isn't solving the actual underlying problem. We also need to fix this bug in a way that it ports upwards to versions 6.x properly/correctly.

The first thing that I notice is that Document.RePublishAll() is not thread safe which will cause other unwanted problems related to this. The notes in the patch lists that the issue could be caused by a DB or IIS failure or an IIS app pool recycle but this is not really the cause of this issue. I can reproduce this issue by causing the IIS request to timeout in the middle of the Document.RePublishAll() method. IIS does not stop thread execution during a recycle, only when a request times out. If the DB or IIS fails at precisely the moment in the middle of this method then it 'could' cause this issue but this is different matter all together. Many business logic calls require more then one database change in one method and if the DB or IIS fails during any of these methods then there will be problems. To reproduce is easy, you can just add this to your web.config: < httpRuntime executionTimeout =" 10" /> which makes the IIS timeout quite low, then add a Thread .Sleep(95000); in the Document.RePublishAll() just before the while loop and you'll reproduce this problem.

Lets focus on the main issue here which is: this method is not thread-safe, the performance of this method is very poor and will cause the request to timeout if there are a lot of content nodes in the db. This means the thread will be canceled during the db re-population. So to fix this:

  • Instead of using a request thread to perform the save operation, we'll change the EditNodeTypeNew.aspx page to be async.
  • Change the saving logic in the ContentTypeControlNew.ascx.cs to register an async task with the page and run the saving logic in the page's async execution.
  • This will ensure the logic runs on a long running worker thread, not on a request thread which is not prone to the normal IIS timeouts and thread cancellation.
  • When the async task completes, we'll update the UI
  • We can manually set the async timeout to be whatever we want in the markup of the EditNodeTypeNew.aspx page, so we'll set it to 5 minutes (It should definitely not take this long but we'll set it that high just in case)
  • Add locking to the Document.RePublishAll() so 2 threads cannot perform this operation at the same time

Now we'll need to look at performance options so that Document.RePublishAll() doesn't take too long... and definitely not take longer than 5 minutes. The other thing we can look at doing is somehow gracefully degrade if our 5 minute timeout is reached since we can handle that with a callback with async pages.... not sure how we'll do that yet but it might be possible. So far today I've completed the above which will fix the timeout issue but need to look at the performance of this method now. The 2nd part of this issue is also related to U4-1772 but this fix will not be applied until 4.11.8. 4.11.7 will contain the async fix which will fix the timeout issue.


Shannon Deminick 29 Apr 2013, 00:33:39

I've improved the performance of this operation by about 50% which is pretty good. This is done in revision b03c1fb77762 and has been merged upwards to 6.0.x. 6.1 treats this differently and is already optimized. Based on this performance improvement and running the page's methods async that perform this operation, this issue will be fixed.


Allan S. Laustsen 20 Jul 2013, 22:05:40

Hi Shannon

I now had time to review your changes in the 4.11.10 version of Umbraco, after receiving a distress request from a larger US based company where their Umbraco installation was down, with the exact same symptoms as described in this bug report and in my blog post about it.

Unfortunately I could not get direct access to their site, so I could not verify 100% that the issue they experienced was the same, but I re-patched the 4.11.10 sources with my "fix" and their site is now running again.

In this process I was reviewing your changes, and yes they do improve the re-publish process, but the changes does not fix the problem 100%. You describe that DB-fail, IIS-crash, Network, etc. issues are a "different matter all together", I partially agree.

If the number of nodes is relatively high, IIS request timeout is not high enough, or the connection timeout on the database connection is not high enough, or the database max packet size is not high enough, or the memory limit for the app-pool is to low, etc. then the issue will happen over and over again.

The fundamental problem is the "TRUNCATE cmscontentxml TBALE" followed by a "SELECT nodeid FROM cmsdocument WHERE published=1", where the resulting nodes are then looped, and published, one-by-one. This causes a HUGH amount of DB requests (depending on the amount of properties on the nodes), and if there is just a single "clitch" in this loop, the cmscontentxml table becomes invalid.

This is not visible on the frontend right away, but on the next update of the memory cache + umbraco.config file, the frontend will be missing content, and start to fail. The normal backoffice response would be to re-publish all content (will only flush the cmscontentxml data to umbraco.config) followed by a re-publish-xml if the first one does not solved the problem. And again this will properly just recreate the problem if the number of nodes is high enough.

I had the problem for some time, and solved it by re-publish-xml for about 3-4 months, as I thought it was something we did wrong, until every attempt to re-publish-xml simply failed, with various exceptions. So that was why I did the patch that ensures that the cmscontentxml will not become invalid.

I do agree my patch is not 100% perfect, but has now been in production on our installation for about 2 years, and we have had zero issues since the patch was put in production.

I think we should have a discussion about how to optimize the Umbraco publish/re-publish method so that real-world large scale Umbraco installations are not going to fail at random. Please contact me at let us figure out a good way to solve this once and for all, I have had at least 50 different companies around the world reaching out for my help on this, and I'm afraid that it's not taking seriously enough at Umbraco HQ (I could be wrong on this). And I feel that it's important that large scale installations (normally equals large scale companies) does not feel insecure about Umbraco.

Best regards Allan


Shannon Deminick 22 Jul 2013, 01:07:22

Hi Allan,

So looks like we'll have to continue to optimize the way that this is handled. If people want to continue running v4.11.10+ they can of course use your patch. I'll create a new issue on the tracker for this to look at improving performance of the xml cache re-creation. We certainly should not need to remove all data in the table to then rebuild it. Any new fixes however will be applied to 6.1.3+ since we will only be releasing patches for any previous version if they are fundamental security issues (since upgrading from 4.10+ to 6.1+ is straight forward). The fix should still not require the use of Tasks since this is really just band-aiding the underlying problem by just re-trying until it completes. I'm pretty positive we can get this operation to run in a nicely optimized fashion.


Shannon Deminick 22 Jul 2013, 01:11:49

Here's the new issue so we can track the release state of it: http://issues.umbraco.org/issue/U4-2527


Priority: Normal

Type: Bug

State: Fixed

Assignee: Shannon Deminick

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted: Patch

Affected versions: 4.8.0, 4.9.0, 4.10.0, 4.11.0, 6.0.0, 4.9.1, 4.11.1, 4.11.2, 4.11.3, 4.11.4, 6.0.1, 4.11.5, 6.0.2, 4.11.6, 6.0.3

Due in version: 6.0.6, 4.11.9

Sprint:

Story Points:

Cycle: