We have moved to GitHub Issues
Created by Shannon Deminick 22 Jul 2013, 01:11:23 Updated by Shannon Deminick 28 Aug 2014, 07:38:15
Relates to: U4-493
Relates to: U4-2556
Relates to: U4-2562
Relates to: U4-2617
This relates to this issue: U4-493
We need to overhaul/optimize the way the xml cache is reconstructed. In larger sites there's a chance that a service might timeout, whether that is a db connection, an IIS request, etc.. which will mean that the xml cache is cleared and not re-built. The reason there might be timeouts is due to the performance of how this currently works which is in dire need of optimization especially when running very large sites (due to the poor SQL performance and the number of iterations).
Today we experienced the issue again, cmscontentxml table was corrupted (missing nodes), this time the cause was a backoffice copy-move action that failed. This was a node with many sub-nodes that was moved to a different location in the backoffice content tree. And with +20.000 content nodes it has now taken almost 3 hours to re-publish the table (with the patch I supplied for issue: http://issues.umbraco.org/issue/U4-493) So the patch is deffently not perfect, it's simply to slow in the re-generation process.
Ideas for a better way to re-publish the entire content tree:
Hey Allan, can you please provider full details about the "backoffice copy-move action that failed". The APIs that do this are slightly different from the optimizations we've made last time so need to know exactly what failed. Was the node that was moved already published or was it un-published? After you moved it, did you re-publish it? If so, how did you do that, did you just publish the single node or use the context menu to publish it and all children? etc...
Also I'd suggest upgrading to the latest version, since 4.10.11 there's been many performance improvements which probably also directly affect the underlying processing that is happening during this sequence of events. You can see lists of bug fixes here for each version: http://our.umbraco.org/contribute/releases
I know we haven't solved the underlying problem but it would definitely be worth a test to see what performance gains you might get out of the latest version.
Hey @Allan, if you can get back to me regarding the above comment: http://issues.umbraco.org/issue/U4-2527#comment=67-8616 that would be appreciated.
@Allan, I'm just reviewing some of the changes already implemented in the latest 6.x versions and there's been a lot of perf improvements already with the rebuilding of this table. We no longer actually truncate this table at all, we only repopulate the items that are necessary based on the content type(s) being changed.
Today I'm looking at some other performance improvements we can make for updating this table.
@Shannon, we are running 4.10.11 and are still facing this problem even after applying the patch that Allan provided.
In our case, it appears that there is an automatic rebuild of the cache table in the DB which also does a rebuild of the cache file shortly thereafter. The contentXml table takes roughly 45 minutes to rebuild itself and, since the cache file was rebuilt shortly after the DB started, the file is either empty or has hardly any records to it leaving our sites dead in the water if they even respond at all.
This event hit us early this morning, again. Prior to this, I'd force the republshing of the entire site via the dialogs url, but today I checked the contentXml table and noticed it was still building so I let it run its course - we have 11,232 nodes. After the process completed, the cache file was built and the sites started working again. Maybe this is triggered by an application reset, but I'm not sure. Regardless, this can't continue.
I've attached our umbracoLog data from this morning. At 613 am, the site went down and it started coming back together after 630am. FWIW, we've also seen this on previous versions, back to 4.7.0.
Anything that can be done to help would be greatly appreciated.
I am strongly recommending to everyone that has this issue or any performance issues to upgrade to the latest verion (6.1.3 currently). There has been several improvements since your version regarding performance and the entire pipeline of this processing. The upgrade is straight forward. Any new improvements will also only be made to versions greater than 6.1.3. I am currently working on more improvements (which I started on friday).
Have made some good progress on this in this branch: https://github.com/umbraco/Umbraco-CMS/tree/6.1.4-cmsContentXml_Optimizations
Have improved 3 aspects of the database interaction for rebuilding the cmsContentXml table:
Starting truncate + normal insert test Complete (took 41190ms) Starting update test Complete (took 38727ms) Starting truncate + bulk insert test Complete (took 15508ms) Starting truncate + bulk insert test in one transaction Complete (took 14964ms)
So the bulk insert performance is clearly a ton better which takes only 36% of the time to execute as it did before. The change also ensure that all of the data is looked up and ready to be inserted before doing so, then we put the entire delete and bulk insert operation in one transaction so if anything fails in that process it will be rolled back, plus this adds a bit of extra performance.
Next we're 2 benchmarks on how we are looking up data:
Looking up all published content (if we are rebuilding the whole table) and looking up published content of a certain content type (if we are rebuilding content based on particular content types. These benchmarks are with no cache enabled so in real life they may be faster. It is done over 10 iteration with 200 content items at 20 levels.
Getting published content normally Complete (took 4248ms) Getting published content optimized Complete (took 2977ms)
Getting published content of type normally Complete (took 12314ms) Getting published content of type optimized Complete (took 2928ms)
So there are some clear improvements here as well.
These tests can be found in Umbraco.Tests.Services.PerformanceTests
With the above changes, especially with the transactional support, there should never be an empty cmsContentXml table no matter what. There's still some other improvements to make but this should really solve the underlying issue.
Next I'll need to figure out what interactions are done in bulk with the cmsContentXml table, like moving/copying since whatever interaction is happening there is probably still using the old api and things are probably still executing one at a time. In some cases we cannot prevent the one-at-a-time execution because certain events need to fire to maintain compatibility but there'll be ways to optimize for sure.
Hey Shannon Just got back from a week of vacation, so I did not see your message before now.
About the comment http://issues.umbraco.org/issue/U4-2527#comment=67-8616
After this I have changed the entire logic in the republish methods, and extended the patch that uses the task-approach, so that it now publishes in "chunks" and re-uses Document instances and it can now do the re-publish-xml in 3-5 min.
Thanks Shannon. I have moved our instance to 6.1.3 and, except for some razor errors, it all went pretty smoothly. Just so you know, that the repub process on this setup took almost an hour to complete - 11,230 nodes.
We will be keeping our eye out for updates especially regarding the publishing issues.
What is the actual repub process you are doing?
If you want the new changes I've committed you can test against the nightly build here: http://nightly.umbraco.org/umbraco%206.1.4/UmbracoCms.6.1.4-build.20.zip
Hi @Allan, hope you had a nice vacation!
Regarding the original 500 error that your user received, I'm assuming this is due to a request timeout since we haven't applied an async task to copy/move yet (will do though). Do you have any other details about "I then tried, to re-publish just that node and it's children, but it did not work" ?
Good news about the progress on the entire publish/re-publish functionality in Umbraco.
About the "I then tried, to re-publish just that node and it's children, but it did not work", After the 500 error, the node had been moved correctly in the backoffice, but I knew that it properly was a publish node error (timeout) that caused the exception, so to fix it I did a simple "right-click>publish node" with all published children, instead of doing a full re-publish of the entire content tree. This action reported "success" but the frontend was not working, the site loaded, but not all sub-nodes where visible on the frontend.
Thanks Allan, gonna work some more on this tomorrow, hopefully @Sean can reply with the specific repub process that he's doing that took an hour. There's kind of a few issues here, just want to pinpoint each one and create an issue for each so we can track it all properly.
Hey guys, The repub process was using the /umbraco/dialogs/republish.aspx?xml=true. It was excruciatingly slow to watch it rebuild.
I have a large db but if you don't mind I could test against yours too, might be quite useful. You can attach it here if you are able to and just make the post visible to HQ only.
Found a releated bug, when using the constructor (umbraco.cms.businesslogic.web) new Document(optimized, id) in the process of loading the documents for the re-publish. The sortOrder property is NOT loaded, default set to "0", I have found the bug in 18.104.22.168 (my custom sources) but also in the 4.11.10 latest release, where the code is "new Document(true, dr.GetInt("nodeId"))" not sure if the issue also exists in the latest 6.xx release.
After a re-publish the nodes on the frontend would be potentially wrong order, and a save-publish on a single node will mess the sort order completely, because it will then get the correct sortOrder from the DB, but the other nodes on same level will still have "0".
PS. As a comment for @Sean's slow re-publish, my optimized re-publish can now re-publish all our +20.000 nodes in about 4 minutes, and it's all done in the same task-style way, so that any failures will just be corrected at the next run.
Unfortunately there will be no more fixes to any version less than 6.1.x unless it is a security issue. The entire process regarding using the Document object and it's optimized ctor is replaced with the new api so it is (hopefully) not an issue.
I've also created this task: http://issues.umbraco.org/issue/U4-2580
Just an update on this in 6.1.4, I have 2000 published nodes and performing a /umbraco/dialogs/republish.aspx?xml=true takes 38 seconds.
I'll look at the profiling of this to find out what the main bottleneck is now.
More GOOD news, just did some profiling and found the other bottleneck with this. From 38 seconds, it now takes 8 seconds !!!!! :)
Just need to get some tests done now.
Good work @Shannon
I've pushed up the latest changes to the repo. It takes about 8 seconds to re-build all xml structures for about 2000 published nodes.... this is clearly leaps and bounds better than before. I've re-run the profiler and there's now 2 other bottlenecks which hopefully we can fix in the coming weeks:
I'm closing this issue now, the underlying issue listed is fixed, there's been plenty of optimizations done, now the bottlenecks are the service/repo levels which we're working on.
Assignee: Shannon Deminick
Backwards Compatible: True
Due in version: