U4-9420 - Re-indexing Examine Indexes after Upgrade to 7.5.7 from 7.3.4 takes minutes when it used to take seconds

Created by Jason Prothero 19 Jan 2017, 14:18:25 Updated by Shannon Deminick 08 Apr 2017, 03:55:38

Relates to: U4-9447

I'm managing a site with about 7000 published content items and after the upgrade from version 7.3.4 to 7.5.6 I noticed that re-indexing examine through the Examine Management dashboard in the Developer section of the Back-office takes much longer than it did in the previous version.

The specific index I'm testing is a very small index that limits by Document Type to two types that contains only about 100 items in it. I would expect this to be nearly instant. What I'm seeing is that it takes about 3.5 to 4 minutes to reindex. When I go back to test the old version it takes seconds. This index does only look at published content.

I have no GatheringNodeData events tied to that indexer.

I upgraded to 7.5.7 and the problem persists.

3 Attachments

Comments

Jason Prothero 19 Jan 2017, 14:20:08

@sebastiaan I'll record the details from my debugging of the core in the comments below


Jason Prothero 19 Jan 2017, 14:23:51

When debugging, stepping over this line takes about 3.5 mins:

https://github.com/umbraco/Umbraco-CMS/blob/release-7.5.7/src/UmbracoExamine/UmbracoContentIndexer.cs#L405

Continuing deeper, this is the slow culprit (3.5 mins to step over):

https://github.com/umbraco/Umbraco-CMS/blob/release-7.5.7/src/Umbraco.Core/Persistence/Repositories/VersionableRepositoryBase.cs#L467

When I dig into that, it takes about 30 ms to execute this loop for each item. And for some reason its going through all the published content (about 7000 items). Takes about 3.25 mins to get through all the iterations.

https://github.com/umbraco/Umbraco-CMS/blob/release-7.5.7/src/Umbraco.Core/Persistence/Repositories/ContentRepository.cs#L848

Digging deeper, this statement is where the 30ms time is happening when I step over it:

https://github.com/umbraco/Umbraco-CMS/blob/release-7.5.7/src/Umbraco.Core/Cache/FullDataSetRepositoryCachePolicy.cs#L132

That's where I stopped. Its a little tricker to debug from there. I'm not seeing massive times between steps, but it is looping through about 85 Document Types. Sometimes I see the Select(... DeepClone()) line take 2-3 ms for a loop. Perhaps the number of DocTypes is causing it?


Dan Booth 19 Jan 2017, 17:21:35

I have the same issue with a large-ish site - it has around 8,000 nodes and about 60 doc types. It's a multilingual site, so there are 12 sub-sites. It has four indexes:

InternalMember - only 2 documents (as not used) Internal - 7700 documents WebIndexer - 2435 documents JobBoardIndexer - 2740 documents

Apart from internal indexer, all indexers only look at published content.

It is definitely much slower at indexing in 7.5.7 than it was in 7.3.8 (from which I upgraded it). In fact, when generating all indexes from scratch the server timed out a number of times. And this was running locally on my fairly powerful desktop machine (i7, 16GB RAM, SSD) in IIS Express.

Sorry I can't share the database / site as it contains client-data. If you need any more concrete data I'll try and get it.


Bjarne Fyrstenborg 19 Jan 2017, 23:25:27

I have also noticed this that rebuilding Examine indexes seems to be slower after upgrading from 7.4.1 to 7.5.7. I have a multi-site solution, where the sites have 10K review nodes in listviews, first time after upgrade and no Examine indexes in /App_Data it actually timed out several times before the site was running (although not indexes was built).

I the log files I see errors like:

System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): Ventehandlingen blev midlertidigt afbrudt
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at Umbraco.Core.Persistence.PetaPocoCommandExtensions.<>c__DisplayClass4.<ExecuteReaderWithRetry>b__3()
   at Umbraco.Core.Persistence.FaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func)
   at Umbraco.Core.Persistence.Database.<Query>d__7`1.MoveNext()

and

 2017-01-19 08:24:10,963 [P24740/D11/T1539] ERROR Umbraco.Web.WebServices.ExamineManagementApiController - An error occurred rebuilding index
System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): Ventehandlingen blev midlertidigt afbrudt
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at Umbraco.Core.Persistence.PetaPocoCommandExtensions.<>c__DisplayClass4.<ExecuteReaderWithRetry>b__3()
   at Umbraco.Core.Persistence.FaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func)
   at Umbraco.Core.Persistence.Database.<Query>d__7`1.MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Umbraco.Core.Persistence.Repositories.VersionableRepositoryBase`2.<>c__DisplayClass23.<GetPropertyCollection>b__1a()
   at System.Lazy`1.CreateValue()
   at System.Lazy`1.LazyInitValue()
   at Umbraco.Core.Persistence.Repositories.VersionableRepositoryBase`2.GetPropertyCollection(Sql docSql, IEnumerable`1 documentDefs)
   at Umbraco.Core.Persistence.Repositories.ContentRepository.ProcessQuery(Sql sql, Boolean withCache)
   at Umbraco.Core.Persistence.Repositories.VersionableRepositoryBase`2.GetPagedResultsByQuery[TDto,TContentBase](IQuery`1 query, Int64 pageIndex, Int32 pageSize, Int64& totalRecords, Tuple`2 nodeIdSelect, Func`2 processQuery, String orderBy, Direction orderDirection, Boolean orderBySystemField, Func`1 defaultFilter)
   at Umbraco.Core.Persistence.Repositories.ContentRepository.GetPagedResultsByQuery(IQuery`1 query, Int64 pageIndex, Int32 pageSize, Int64& totalRecords, String orderBy, Direction orderDirection, Boolean orderBySystemField, IQuery`1 filter)
   at Umbraco.Core.Services.ContentService.GetPagedDescendants(Int32 id, Int64 pageIndex, Int32 pageSize, Int64& totalChildren, String orderBy, Direction orderDirection, Boolean orderBySystemField, String filter)
   at UmbracoExamine.UmbracoContentIndexer.PerformIndexAll(String type)
   at UmbracoExamine.BaseUmbracoIndexer.IndexAll(String type)
   at UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild()
   at UmbracoExamine.UmbracoContentIndexer.RebuildIndex()
   at Umbraco.Web.WebServices.ExamineManagementApiController.PostRebuildIndex(String indexerName)

When I try to build one of the review indexes it seems to take a while, but I think it timeout before it finish and therefore the index doesn't contain any documents.


Sebastiaan Janssen 20 Jan 2017, 08:46:17

Thanks for the help! I've been able to reproduce now:

7.3.8 site + db backup: https://dl.dropboxusercontent.com/u/3006713/ExaminePerfTest.zip Same site upgraded to 7.5.8 + new db backup: https://dl.dropboxusercontent.com/u/3006713/ExaminePerfTest758.zip

The backoffice credentials for both are: test@test.com / test1234 In 7.3.8 the Internal index takes about 6 seconds to rebuild (this is ~7000 nodes) and in 7.5.8 it takes over a minute to rebuild. I've even had to update the connection string with Connection Timeout=120 in 7.5.8 to make sure the connection doesn't die in the middle of doing querying.

I'm not sure if the fact that this site has multiple root nodes is a clue?


Jason Prothero 20 Jan 2017, 22:41:56

Thanks for digging in and finding a way to repro @sebastiaan!


Shannon Deminick 22 Jan 2017, 13:56:32

I've pushed a PR for review and testing: https://github.com/umbraco/Umbraco-CMS/pull/1701 which contains some notes about the changes.

I'm not exactly sure what exact change(s) caused this based on the history of how UmbracoExamine was looking up data with the Services/Repositories but I suspect it was several changes over a period of time. The main problems are:

  • For each content item looked up we then lookup it's associated content type, because content types are always cached it was assumed we could just lookup the content type with the content type service in a loop - this is actually not good when you have several thousand content items because everytime an item is resolved from cache it has to be deep cloned which isn't especially quick - so I've fixed that
  • For the large content set looked up (in pages of 10,000), we then make a single query to lookup all of those documents' property data which ends up being a huge data set, close to 90,000 rows. For each document we would then filter this data set but even though this is in memory after being retrieved from the DB, performing a filter on 90,000 items 10,000 times is a lot of overhead. So instead we do something slightly more clever and ensure the doc data set is sorted by node id and the property data set is sorted by node id, then we keep an index location stored of the property data as we read forward through it.

I've fixed up some other various performance problems too like not allocating new instances of Content/Media/Property factory for each item - that was a lot of unnecessary allocations. And fixed up a few obsolete warnings.

In my local tests it rebuilds sebastiaans index in about 10 seconds


Shannon Deminick 22 Jan 2017, 13:57:23

@zpqrtbnk would be good for you to have a look through this, I'll assign to you


Sebastiaan Janssen 22 Jan 2017, 16:59:49

Well it went down to about 7 seconds now so that's definitely a huge improvement from it taking over a minute, I also don't need the longer SQL timeout any more!


Stephan 23 Jan 2017, 08:19:40

Review comments:

  • going from instanciated factories to static factories is going to reduce allocs and that's good
  • deep cloning each content type only once is a massive boost I think
  • so basically I'm all happy with it

Only one Q: IContent/MediaService used to GetPagedDescendant sorted by path, meaning that you'd get the parent before the children and could process what the method returned "in document order" (from root of tree to leaves). Now this is changed and although all our tests seem OK I'd consider it a breaking change?


Dan Booth 23 Jan 2017, 08:24:29

Great work, guys, for responding so quickly and providing a fix! #h5yr


Shannon Deminick 23 Jan 2017, 08:54:53

@zpqrtbnk fair call, the issue with sorting by path is that it is not nice for SQL because we don't have an index on it (i.e. we should make one!... in fact I have several db reports on azure saying we should)


Stephan 23 Jan 2017, 13:29:50

then we probably should have an index and not change the sort order? ok to reassign to you?


Shannon Deminick 23 Jan 2017, 13:36:29

@zpqrtbnk I've pushed a new change, just reverted the default sort order but used umbracoNode.id for the UmbracoContentIndexer. We can't add a new index in this version so that will have to wait.

If you're ok with the changes please assign to @sebastiaan


Stephan 23 Jan 2017, 13:44:13

happy


Sebastiaan Janssen 23 Jan 2017, 13:46:31

Cool, I'll have another look, test and will merge.


Sebastiaan Janssen 23 Jan 2017, 15:19:34

Sweet, just under 9 seconds now with the same ~7000 items:

2017-01-23 16:17:37,392 [P4352/D2/T16] INFO UmbracoExamine.DataServices.UmbracoLogService - Rebuilding index, Provider=InternalIndexer, NodeId=-1 2017-01-23 16:17:37,405 [P4352/D2/T16] INFO UmbracoExamine.DataServices.UmbracoLogService - PerformIndexAll - Start data queries - content, Provider=InternalIndexer, NodeId=-1 2017-01-23 16:17:38,595 [P4352/D2/T16] INFO UmbracoExamine.DataServices.UmbracoLogService - PerformIndexAll - End data queries - content, took 1189ms, Provider=InternalIndexer, NodeId=-1 2017-01-23 16:17:38,595 [P4352/D2/T16] INFO UmbracoExamine.DataServices.UmbracoLogService - PerformIndexAll - Start data queries - media, Provider=InternalIndexer, NodeId=-1 2017-01-23 16:17:38,597 [P4352/D2/T16] INFO UmbracoExamine.DataServices.UmbracoLogService - PerformIndexAll - End data queries - media, took 1ms, Provider=InternalIndexer, NodeId=-1 2017-01-23 16:17:45,933 [P4352/D2/T11] INFO Umbraco.Web.WebServices.ExamineManagementApiController - Rebuilding index 'InternalIndexer' done, 14260 items committed (can differ from the number of items in the index)


Jason Prothero 23 Jan 2017, 18:09:10

Is this testable on dev-v7 now? Or should I wait until I get a heads up from you to try it out?

Thanks, Jason


Shannon Deminick 23 Jan 2017, 23:04:19

Hi @Prothero yes, this is all merged in to dev-v7, please test! (if you can test quickly that'd be amazing since we'll be releasing this very soon)


Dan Booth 23 Jan 2017, 23:15:11

@Shandem Don't suppose you have any release DLLs of the build, to save downloading and building it, do you? :) Then I can test too.


Jason Prothero 23 Jan 2017, 23:30:20

I just got the latest dev-v7, built it, copied only the dlls over to my test site, upgraded the DB, and re-indexed examine.

The small index that I tested with above that took 3.5 to 4 minutes to build now takes about 7 seconds. Yay!

Thanks everyone for getting on this so quickly!

#h5yr


Shannon Deminick 23 Jan 2017, 23:34:17

@Prothero Can you do me a favor and verify all of the data that you expect to be in there is in there? Also verify that your list views are working as expected?


Jason Prothero 24 Jan 2017, 00:46:22

I'd have to get some help for a full regression test, but an initial look and everything seems to be working like normal.


Shannon Deminick 24 Jan 2017, 03:42:14

@Bjarne.Fyrstenborg thanks for that, I have verified that with such large data sets that the SQL continues to timeout and during this the SQL server maxes out CPU :/

I've discovered that it's due to the SQL query that we use for content. Normally to retrieve content we perform this:

SELECT * FROM [cmsDocument]
		INNER JOIN [cmsContentVersion]
		ON [cmsDocument].[versionId] = [cmsContentVersion].[VersionId]
		INNER JOIN [cmsContent]
		ON [cmsContentVersion].[ContentId] = [cmsContent].[nodeId]
		INNER JOIN [umbracoNode]
		ON [cmsContent].[nodeId] = [umbracoNode].[id]
		LEFT OUTER JOIN [cmsDocument] [cmsDocument2] ON ([cmsDocument2].[nodeId]=[cmsDocument].[nodeId] AND [cmsDocument2].[published]=1)

When we do paging we take this same query and transform it to just return Ids with the SQL paging notation. The big problem with this is the LEFT OUTER JOIN [cmsDocument] [cmsDocument2] ON ([cmsDocument2].[nodeId]=[cmsDocument].[nodeId] AND [cmsDocument2].[published]=1). When this is included in the joined sub query to do paging this kills SQL because there is not index on cmsDocument.nodeId + cmsDocument.published. The only reason we do this outer join is when retrieving all of the data so that we can get all information about the content including it's latest version and it's published version but when we transform this query to do paging and just return the IDs this Outer join is entirely useless.

Here's the results which are pretty spectacular, with the small adjustment for the paging query and removing the unnecessary outer join when returning Ids, we save 97% performance on SQL, this is enormous. To return property data for 10000 content items on Bjarne's database it took over 4 minutes (of just SQL processing!) now it takes 3 seconds

I'm just benchmarking a few other things, stay tuned.


Bjarne Fyrstenborg 24 Jan 2017, 06:11:20

@Shandem wow, that's a major difference. I am wondering if that is an issue for large member data aswell? I have previously had similar SQL timeouts when indexing ~100K members. However I haven't been working on a site in v7 with that number of members.


Shannon Deminick 24 Jan 2017, 06:24:57

Member indexing logic has changed in 7.5.8 entirely and uses the cmsContentXml table to perform the indexing so it will be WAY faster. The problem with unpublished content is that it doesn't exist as a serialized document anywhere so we need to use the normal data APIs to query the data but because the data is stored in a complex way/normalized way, it's not the most efficient to query with large data sets.


Shannon Deminick 24 Jan 2017, 06:38:38

The changes I've made are here: https://github.com/umbraco/Umbraco-CMS/commit/c7b505fd908d926c194b78907b02e93c7b043c88

I'll explain a bit about them:

*For Content/Media/Members respositories (VersionableRepositoryBase) I've added a new GetBaseQuery method that takes in a BaseQueryType == Full, Ids or Count *A Full query is what was there before which is the main query to return a full content, media or member object(s) *A Count query is what was there before which returns a count of content, media or members *An ID query is a query that only returns the PK/Id for a media, member or content object(s) - for media and members, this is the same as a Full query but for Content it's slightly different (more below) *This GetBaseQuery method is used in the paging logic, before it was always getting the Full query, this is fine for media and members but not for content. The paging logic now requests the ID query when running is Page query since the actual Page query only ever returned IDs. This Page query is also used to construct the properties query that returns all property data for the media, members or content. *The main issue is that the Full Content query contains a LEFT OUTER JOIN on another cmsDocument table join - this is so that we can return the latest and/or published version of a content item in one query in one row. This join is totally unnecessary for the ID query and as it turns out the Full query with this join to return bulk data with a sort is very inefficient. *The VersionableRepositoryBase.GetPropertyCollection has been updated: **We no longer load all property data into memory with Fetch, instead we iterate forward over the SQL data reader which is both faster and much less memory intensive. **We now eagerly load all pre-values for all property types found in the current data set, we used to lazily load these when we encountered a property type that had tag support but we can no longer do this because we cannot execute another reader while we are iterating through a reader. Based on benchmarks, eagerly loading this data while allowing the non-Fetch of property data executes 30% faster, so even though the eagerly loaded data may not even be necessary if none of the property types support tags, it's still much faster


Stephan 24 Jan 2017, 08:04:51

looks good to me


Shannon Deminick 30 Jan 2017, 23:33:33

@Bjarne.Fyrstenborg can you send me the db for that list view issue?


Bjarne Fyrstenborg 30 Jan 2017, 23:40:21

@Shandem do your have the db from last time? The data should be pretty much the same or enough listview data. Otherwise it should be available from same link.. I have also created a new issue for this part :)


Shannon Deminick 31 Jan 2017, 00:14:30

@Bjarne.Fyrstenborg yup I do, i'll test with that. We tested this quite a lot before making this change so it does seem quite odd.


Dan Booth 07 Apr 2017, 11:44:12

@Shandem has there been any regression in 7.5.12? My subjective feeling is that it seems slow again after upgrading, but I've not got any benchmarks to confirm this - but it certainly seems slower. I'm not the only one - see http://issues.umbraco.org/issue/U4-9752


Shannon Deminick 08 Apr 2017, 03:55:38

@DanDiplo My fault :( http://issues.umbraco.org/issue/U4-9752#comment=67-36061 thanks for the heads up


Priority: Normal

Type: Bug

State: Fixed

Assignee:

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions: 7.5.6, 7.5.7

Due in version: 7.5.8

Sprint: Sprint 51

Story Points:

Cycle: