U4-2527 - Content xml cache re-creation needs to be optimized - otherwise we can end up with an empty xml cache file (no content)

Created by Shannon Deminick 22 Jul 2013, 01:11:23 Updated by Shannon Deminick 28 Aug 2014, 07:38:15

Relates to: U4-493

Relates to: U4-2556

Relates to: U4-2562

Relates to: U4-2617

This relates to this issue: U4-493

We need to overhaul/optimize the way the xml cache is reconstructed. In larger sites there's a chance that a service might timeout, whether that is a db connection, an IIS request, etc.. which will mean that the xml cache is cleared and not re-built. The reason there might be timeouts is due to the performance of how this currently works which is in dire need of optimization especially when running very large sites (due to the poor SQL performance and the number of iterations).

1 Attachments

Download Reload.xlsx

Comments

Allan S. Laustsen 22 Jul 2013, 15:32:28

Today we experienced the issue again, cmscontentxml table was corrupted (missing nodes), this time the cause was a backoffice copy-move action that failed. This was a node with many sub-nodes that was moved to a different location in the backoffice content tree. And with +20.000 content nodes it has now taken almost 3 hours to re-publish the table (with the patch I supplied for issue: http://issues.umbraco.org/issue/U4-493) So the patch is deffently not perfect, it's simply to slow in the re-generation process.


Allan S. Laustsen 22 Jul 2013, 15:47:41

Ideas for a better way to re-publish the entire content tree:

Only re-publish "modified" nodes.

Use separate threads for speed.

Re-publish in "chunks" to not stress DB and/or IIS.

Use a separate umbraco.config file, and overwrite original when the re-publish process is complete.

Re-use the document instance in the process.

Load the document with "optimized" parameter.

Make the process so that it does not fail if a server crash occurs.

If no nodes exists in the content cache or the cmscontentxml table is should automaticly do a re-publish with xml=true.

Do NOT update the umbraco.config and content cache until the entire re-publish process is completed, currently there is a high disk IO, can speed up the re-publish process.

Add events in as many places as possible in the re-publish call, so that individual steps can be overridden.


Shannon Deminick 23 Jul 2013, 01:49:02

Hey Allan, can you please provider full details about the "backoffice copy-move action that failed". The APIs that do this are slightly different from the optimizations we've made last time so need to know exactly what failed. Was the node that was moved already published or was it un-published? After you moved it, did you re-publish it? If so, how did you do that, did you just publish the single node or use the context menu to publish it and all children? etc...

Thanks!


Shannon Deminick 23 Jul 2013, 01:55:36

Also I'd suggest upgrading to the latest version, since 4.10.11 there's been many performance improvements which probably also directly affect the underlying processing that is happening during this sequence of events. You can see lists of bug fixes here for each version: http://our.umbraco.org/contribute/releases

I know we haven't solved the underlying problem but it would definitely be worth a test to see what performance gains you might get out of the latest version.


Shannon Deminick 24 Jul 2013, 03:16:33

Hey @Allan, if you can get back to me regarding the above comment: http://issues.umbraco.org/issue/U4-2527#comment=67-8616 that would be appreciated.


Shannon Deminick 26 Jul 2013, 01:55:49

@Allan, I'm just reviewing some of the changes already implemented in the latest 6.x versions and there's been a lot of perf improvements already with the rebuilding of this table. We no longer actually truncate this table at all, we only repopulate the items that are necessary based on the content type(s) being changed.

Today I'm looking at some other performance improvements we can make for updating this table.


Sean Sachen 26 Jul 2013, 19:04:08

@Shannon, we are running 4.10.11 and are still facing this problem even after applying the patch that Allan provided.

In our case, it appears that there is an automatic rebuild of the cache table in the DB which also does a rebuild of the cache file shortly thereafter. The contentXml table takes roughly 45 minutes to rebuild itself and, since the cache file was rebuilt shortly after the DB started, the file is either empty or has hardly any records to it leaving our sites dead in the water if they even respond at all.

This event hit us early this morning, again. Prior to this, I'd force the republshing of the entire site via the dialogs url, but today I checked the contentXml table and noticed it was still building so I let it run its course - we have 11,232 nodes. After the process completed, the cache file was built and the sites started working again. Maybe this is triggered by an application reset, but I'm not sure. Regardless, this can't continue.

I've attached our umbracoLog data from this morning. At 613 am, the site went down and it started coming back together after 630am. FWIW, we've also seen this on previous versions, back to 4.7.0.

Anything that can be done to help would be greatly appreciated.


Shannon Deminick 29 Jul 2013, 00:02:46

Hi Sean,

I am strongly recommending to everyone that has this issue or any performance issues to upgrade to the latest verion (6.1.3 currently). There has been several improvements since your version regarding performance and the entire pipeline of this processing. The upgrade is straight forward. Any new improvements will also only be made to versions greater than 6.1.3. I am currently working on more improvements (which I started on friday).


Shannon Deminick 29 Jul 2013, 07:37:41

Have made some good progress on this in this branch: https://github.com/umbraco/Umbraco-CMS/tree/6.1.4-cmsContentXml_Optimizations

Have improved 3 aspects of the database interaction for rebuilding the cmsContentXml table:

  • How data is re-inserted into the cmsContentXml table. Before this change we were clearing the data for content (for any content of the content type changed) and then inserting one record at a time which of course would be horrific for performance when there is large amounts of data. I wrote 4 benchmark tests for this:
  • The way we are currently doing it
  • performing Update's on the records that exists and inserts on the records that don't exist
  • clearing the relevant data and performing bulk insert statements
  • clearing the relavent data and performing the bulk insert statements all in one transaction Each test is performed 10x and is done against a dataset of 10000 records.

The results:

Starting truncate + normal insert test Complete (took 41190ms) Starting update test Complete (took 38727ms) Starting truncate + bulk insert test Complete (took 15508ms) Starting truncate + bulk insert test in one transaction Complete (took 14964ms)

So the bulk insert performance is clearly a ton better which takes only 36% of the time to execute as it did before. The change also ensure that all of the data is looked up and ready to be inserted before doing so, then we put the entire delete and bulk insert operation in one transaction so if anything fails in that process it will be rolled back, plus this adds a bit of extra performance.

Next we're 2 benchmarks on how we are looking up data:

Looking up all published content (if we are rebuilding the whole table) and looking up published content of a certain content type (if we are rebuilding content based on particular content types. These benchmarks are with no cache enabled so in real life they may be faster. It is done over 10 iteration with 200 content items at 20 levels.

Get_All_Published_Content:

Getting published content normally Complete (took 4248ms) Getting published content optimized Complete (took 2977ms)

Get_All_Published_Content_Of_Type:

Getting published content of type normally Complete (took 12314ms) Getting published content of type optimized Complete (took 2928ms)

So there are some clear improvements here as well.

These tests can be found in Umbraco.Tests.Services.PerformanceTests


Shannon Deminick 29 Jul 2013, 07:40:52

With the above changes, especially with the transactional support, there should never be an empty cmsContentXml table no matter what. There's still some other improvements to make but this should really solve the underlying issue.


Shannon Deminick 29 Jul 2013, 07:42:46

Next I'll need to figure out what interactions are done in bulk with the cmsContentXml table, like moving/copying since whatever interaction is happening there is probably still using the old api and things are probably still executing one at a time. In some cases we cannot prevent the one-at-a-time execution because certain events need to fire to maintain compatibility but there'll be ways to optimize for sure.


Allan S. Laustsen 29 Jul 2013, 12:34:33

Hey Shannon Just got back from a week of vacation, so I did not see your message before now.

About the comment http://issues.umbraco.org/issue/U4-2527#comment=67-8616

  • It was a published node
  • It had 48 sub nodes in multiple levels (I can send a dump of the DB)
  • No re-publish was done right after, the user who did the action received a 500 error in the back-office UI.
  • After the 500 error was reported to me I then tried, to re-publish just that node and it's children, but it did not work.
  • Then I noticed that suddenly we had a corrupted cmscontentxml table, because the frontend was missing nodes and the umbraco.config was not complete.
  • I then did a republish all nodes with xml=true, but it continued to timeout. (+21.000 content nodes)

After this I have changed the entire logic in the republish methods, and extended the patch that uses the task-approach, so that it now publishes in "chunks" and re-uses Document instances and it can now do the re-publish-xml in 3-5 min.


Sean Sachen 29 Jul 2013, 17:12:50

Thanks Shannon. I have moved our instance to 6.1.3 and, except for some razor errors, it all went pretty smoothly. Just so you know, that the repub process on this setup took almost an hour to complete - 11,230 nodes.

We will be keeping our eye out for updates especially regarding the publishing issues.

Thanks.

Sean


Shannon Deminick 30 Jul 2013, 00:12:20

Hi Sean,

What is the actual repub process you are doing?

If you want the new changes I've committed you can test against the nightly build here: http://nightly.umbraco.org/umbraco%206.1.4/UmbracoCms.6.1.4-build.20.zip


Shannon Deminick 30 Jul 2013, 00:14:19

Hi @Allan, hope you had a nice vacation!

Regarding the original 500 error that your user received, I'm assuming this is due to a request timeout since we haven't applied an async task to copy/move yet (will do though). Do you have any other details about "I then tried, to re-publish just that node and it's children, but it did not work" ?


Allan S. Laustsen 01 Aug 2013, 10:19:39

Hi Shannon

Good news about the progress on the entire publish/re-publish functionality in Umbraco.

About the "I then tried, to re-publish just that node and it's children, but it did not work", After the 500 error, the node had been moved correctly in the backoffice, but I knew that it properly was a publish node error (timeout) that caused the exception, so to fix it I did a simple "right-click>publish node" with all published children, instead of doing a full re-publish of the entire content tree. This action reported "success" but the frontend was not working, the site loaded, but not all sub-nodes where visible on the frontend.


Shannon Deminick 01 Aug 2013, 13:03:22

Thanks Allan, gonna work some more on this tomorrow, hopefully @Sean can reply with the specific repub process that he's doing that took an hour. There's kind of a few issues here, just want to pinpoint each one and create an issue for each so we can track it all properly.


Sean Sachen 01 Aug 2013, 13:17:51

Hey guys, The repub process was using the /umbraco/dialogs/republish.aspx?xml=true. It was excruciatingly slow to watch it rebuild.


Shannon Deminick 01 Aug 2013, 13:29:56

Thx @Sean!

I have a large db but if you don't mind I could test against yours too, might be quite useful. You can attach it here if you are able to and just make the post visible to HQ only.


Allan S. Laustsen 01 Aug 2013, 19:19:33

Hi Shannon

Found a releated bug, when using the constructor (umbraco.cms.businesslogic.web) new Document(optimized, id) in the process of loading the documents for the re-publish. The sortOrder property is NOT loaded, default set to "0", I have found the bug in 4.7.1.1 (my custom sources) but also in the 4.11.10 latest release, where the code is "new Document(true, dr.GetInt("nodeId"))" not sure if the issue also exists in the latest 6.xx release.

After a re-publish the nodes on the frontend would be potentially wrong order, and a save-publish on a single node will mess the sort order completely, because it will then get the correct sortOrder from the DB, but the other nodes on same level will still have "0".

PS. As a comment for @Sean's slow re-publish, my optimized re-publish can now re-publish all our +20.000 nodes in about 4 minutes, and it's all done in the same task-style way, so that any failures will just be corrected at the next run.


Shannon Deminick 02 Aug 2013, 01:56:48

Hi Allan,

Unfortunately there will be no more fixes to any version less than 6.1.x unless it is a security issue. The entire process regarding using the Document object and it's optimized ctor is replaced with the new api so it is (hopefully) not an issue.


Shannon Deminick 02 Aug 2013, 06:44:47

I've also created this task: http://issues.umbraco.org/issue/U4-2580


Shannon Deminick 02 Aug 2013, 07:07:41

Just an update on this in 6.1.4, I have 2000 published nodes and performing a /umbraco/dialogs/republish.aspx?xml=true takes 38 seconds.

I'll look at the profiling of this to find out what the main bottleneck is now.


Shannon Deminick 02 Aug 2013, 08:28:00

More GOOD news, just did some profiling and found the other bottleneck with this. From 38 seconds, it now takes 8 seconds !!!!! :)

Just need to get some tests done now.


Allan S. Laustsen 02 Aug 2013, 08:48:37

Good work @Shannon


Shannon Deminick 05 Aug 2013, 03:23:43

I've pushed up the latest changes to the repo. It takes about 8 seconds to re-build all xml structures for about 2000 published nodes.... this is clearly leaps and bounds better than before. I've re-run the profiler and there's now 2 other bottlenecks which hopefully we can fix in the coming weeks:

  • The main bottleneck is the calls to ContentRepository.GetPropertyCollection. I think it could be pretty 'easy' to add an overload to this method to be able to get the property collections for many nodes at once instead of querying for the collection for each node individually. Even if we could get the property collections in batches (based on maximum sql parameters) or based on the current content level (parentId). This would save another 70% of the processing involved with this. This will probably also help in a few other areas I'm sure.
  • The other smaller bottleneck is MNTP (or any other data type) goes and fetches the pre-values from the db any time it wants to re-gen XML. I'm sure we can request cache the output of umbraco.cms.businesslogic.datatype.PreValues.GetPreValues. That should be pretty straight forward and save a bunch of processing and SQL calls.


Shannon Deminick 28 Aug 2014, 07:38:11

I'm closing this issue now, the underlying issue listed is fixed, there's been plenty of optimizations done, now the bottlenecks are the service/repo levels which we're working on.


Priority: Normal

Type: Bug

State: Closed

Assignee: Shannon Deminick

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions:

Due in version:

Sprint:

Story Points:

Cycle: