U4-9598 - Deeply foldered content is not correctly indexed by indexes where supportUnpublished="true"

Created by Mark Bowser 04 Mar 2017, 00:24:39 Updated by Shannon Deminick 28 Mar 2017, 02:33:06

Tags: PR

Relates to: U4-2463

Subtask of: U4-9609

My problem is that when I reindex an indexer with supportUnpublished="true" through the examine manager, nodes that are foldered deep enough to be Level 6 get unindexed. When I save and publish those nodes, they are indexed.

I have a website with the following structure:

CONTENT --us --uk --fr --de --Shared Resources ------Blog ----------Authors ----------Articles ------------2013 ------------2014 ------------2015 ------------2016 ------------2017 ----------------01 ----------------02 --------------------Blog Post Inside of Month Folder ----------------Blog Post Outside of Month Folder

When I publish the "Blog Post Inside of Month Folder", I can see that it is indexed in my External Indexer. If I attach an event handler to the external indexer's GatheringNodeData event, I can see that when I save and publish my node, my event handler is triggered. However, If I go to the examine manager and rebuild the external indexer, my GatheringNodeData event handler is never called and my blog post is removed from the external indexer. This only seems to be the case when my nodes are foldered at level 6. If I take my blog posts and move them outside of their month folders, they index as expected. Also, if I open up the /config/ExamineSettings.config file and set supportUnpublished="true", all of my nodes will index as expected regardless of how deeply they are foldered. I have reproduced this with the ExternalIndexer and our custom BlogIndexer. I have also reproduced this with heavily foldered landing pages outside of the blog area.

Strangely enough, there are some old blog articles don't run into this problem. The deeply foldered blog articles start behaving at blog posts that were created around March 2015. When I save and publish these nodes, they start to misbehave. This whole issue started when we upgraded from umbraco 7.2.8 to 7.5.8. I've tried upgrading again to 7.5.10, but it didn't change anything.

There aren't any exceptions being thrown or interesting errors in my logs. I'm still able to reproduce this and play around with it, so let me know if anyone needs any details or help reproducing.

This issue seems reminiscent of U4-2463. I linked them. Hope that was the right thing to do.

Comments

Mark Bowser 10 Mar 2017, 09:57:32

I figured it out. The problem was in the UmbracoExamine.UmbracoContentIndexer's ReindexWithXmlEntries method. The indexing is done in pages with a max page size of 10,000 nodes. The do-while loop in the ReindexWithXmlEntries() first calls its getPagedXmlEntries function to go out and fetch the current page. One of the things that this getPagedXmlEntries function made sure to do was filter out any results that were the children of unpublished parent nodes. Unfortunately, the do-while loop in the ReindexWithXmlEntries() method would only move on to the next page if the number of filtered results from getPagedXmlEntries was the same as the page size. For sites with more than 10,000 nodes where some of the nodes are unpublished, deeply nested nodes get unindexed.

I submitted a pull request. Let me know if I need to rework anything or submit my pull request in a different way. I'm not 100% sure I'm following the correct protocol.


Shannon Deminick 10 Mar 2017, 10:05:20

PR: https://github.com/umbraco/Umbraco-CMS/pull/1789


Nicholas Westby 13 Mar 2017, 18:40:30

I wonder if this forum post is related to this issue: https://our.umbraco.org/forum/using-umbraco-and-getting-started/84540-problem-with-examine-indexes-with-15000plus-media-items-missing-items


Stephan 24 Mar 2017, 12:06:34

Proposing a slightly different approach in PR https://github.com/umbraco/Umbraco-CMS/pull/1827 (well, it's achieving the same, just differently).

Review = code review, + can be tested by temp. changing the 10000 page size to eg 2, and re-indexing.


Stephan 24 Mar 2017, 12:07:11

(if ok with the 2nd PR, don't forget to close the 1st)


Priority: Major

Type: Bug

State: Fixed

Assignee:

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted: Pull request

Affected versions: 7.5.8, 7.5.9, 7.5.10

Due in version: 7.5.12

Sprint: Sprint 55

Story Points: 1

Cycle: