U4-7295 - Examine search with the data in the grid should be easier OOTB

Created by Shannon Deminick 22 Oct 2015, 17:03:07 Updated by Shannon Deminick 19 May 2016, 09:08:09

Relates to: U4-8437

Currently the grid stores a json structure and we simply put that whole structure into the lucene index which is difficult to search on and requires parsing out the terms into different fields during indexing. OOTB it should be straight forward for developers to search against grid data.

1 Attachments

Comments

Douglas Robar 06 May 2016, 09:06:55

Hi, @Shandem. I've got some ideas about this. Especially around how ezSearch works with ranking the results based on the frequency of hits and the order of properties to be searched. Would enjoy talking to you about that if you're open to speaking to someone who has zero idea how to implement his ideas. :)


Shannon Deminick 10 May 2016, 16:25:25

Here's the gist:

  • We will save a cleaned up (html and json and arbitrary values removed) value in the property field. Example if the grid property type is called "myGrid" then the indexed value for myGrid will contain a cleaned version of the whole json value
  • We will save individual field values for each 'area'. For example, if you have 2 x Headlines and 3 x Article areas in your grid, the values will be saved like:

"myGrid.Headline" = "value 1" "myGrid.Headline" = "value 2" "myGrid.Article" = "article 1" "myGrid.Article" = "article 2" "myGrid.Article" = "article 3"

This is possible because Lucene lets you store multiple values for the same key per document. This means you can search sub-grid data based on the area.


Shannon Deminick 10 May 2016, 16:30:38

PR is here: https://github.com/umbraco/Umbraco-CMS/pull/1260

To test:

  • Create a new document with a grid

  • Create a single row for the grid with a "Headline", add the string: "Findme" to it, whether it's rte, headline, quote, doesn't matter

  • Publish

  • Go to Examine Management in developer section

  • Go to internal searcher

  • Use a Lucene query and type in (for example, if your grid property type alias is 'myGrid'): myGrid:Findme

  • You should get a record

  • Use a Lucene query and type in (for example, if your grid property type alias is 'myGrid'): myGrid:findme

  • You won't get a record because the default is Whitespace analyzer which is case sensitive - i know some hate this but that is a different issue

  • Go to external searcher

  • Use a Lucene query and type in (for example, if your grid property type alias is 'myGrid'): myGrid:Findme

  • You should get a record

  • Use a Lucene query and type in (for example, if your grid property type alias is 'myGrid'): myGrid:findme

  • You should get a record - standard analyzer is case insensitive

  • Do the same two tests but with the lucene query: myGrid.Headline as a property name and you should get the same results


bob baty-barr 10 May 2016, 19:46:00

will this work for all JSON types? like Archetype??


Matt Brailsford 10 May 2016, 20:17:43

Looking at the PR, it seems the responsibility of how the data gets indexed is being given to the property editor in question, so whilst this fixes this specific issue, it would be more beneficial IMO if this "issue" could be widened to address how all JSON based property/grid editors can be indexed in a more generic (and recursive) way, rather than each PE having to handle this themselves.

If you look at the PR from purely a grid perspective, there is still the potential issue where if someone uses a grid editor which stores JSON, you would still end up with JSON stored in the grid fields in lucene, so this fix doesn't prevent JSON getting into the lucene index for the grid.

Whilst I appreciate that this would come down to developer education to know what PE/GE generates JSON, not all devs are going to know this, and for them they will simply wonder why they can't search their data, when it's promoted that the grid supports indexing.


Shannon Deminick 10 May 2016, 20:29:38

Im open for suggestions on how you could possibly index 'any' json with nested of nested of nested, etc properties and explain to a user how to search that. The other issue is that there's probably a bunch of these values you don't want in the index (like much of the grid)

Also note that json will not actually go into index directly because the analyzer will strip a lot of it out. I was actually thinking of doing a lot of this with an analyzer (which I still may consider) but that implementation will not work for everything either.

There's no magic bullet here folks. Archetype can implement this in a similar way as this PR, same goes with other json stored values but considering that these json values can be literally anything and are not like flattened json documents you'd give to a document db I don't see one solution for all of this.

The only thing I can see that would be helpful to streamline this would be to add a method to the base property value editor (or similar) that asks the property editor for key/values that it wants to index based in the db value it is storing.


Kevin Giszewski 10 May 2016, 20:31:24

I'd be interested in a more automated approach for non-developers, but I don't see how the current method is all that complicated for developers: https://github.com/kgiszewski/LearnUmbraco7/blob/master/Chapter%2009%20-%20Searching%20with%20Examine/01%20-%20Built-in%20Functionality.md#complex-property-values

Due to the dynamic nature of JSON editors, it seems like a difficult needle to thread.


bob baty-barr 10 May 2016, 20:36:00

to follow what @kgiszewski linked above... i COULD ACTUALLY follow that example and just implemented it for a project... but in a perfect world... sure would be nice for it to be auto-indexed and clean ;)


Shannon Deminick 10 May 2016, 20:40:07

This is the problem, how can you have clean data based on any json document without knowing what part of the json is relevant. Sure I can flatten the whole thing and index it but it will not be helpful because you won't be able to search it very well. (I'll upload and example of what I'm taking about just gotta grab my laptop)


Matt Brailsford 10 May 2016, 20:50:53

I think giving responsibility to the PE isn't a bad idea, I think what you mention @Shandem about simplifying it into each PE returning a dictionary is more what I was thinking about, as, like you say, it won't make sense to index everything and the PE is the one who knows best what to index / not index, but this saves each dev from having to write the same code to hook into the index writter, and then format data to stuff into the index, so returning a dictionary would be way simpler.

There is still the question of recursive data though I guess, a big thing for Archetype and Nested Content (and the grid if you take into account things like DocTypeGridEditor, which is propably slightly trickier problem for grid editors). Would we just want to extend the lucene key with '.' notation?

myGrid.myDocTypeGridEditor.myNestedContent.Headline?

Matt


Shannon Deminick 10 May 2016, 20:51:08

Here's an example of the flattened JSON you'd get if you flattened the Grid. These keys are not exactly usable. You could remove the indexing that happens when flattening so you don't end up with indexes for parts of JSON collections. Then to search the grid you'd have to do something like in order to search for documents that have 'findMe' in the Headline area .

+sections.rows.areas.controls.value: "findMe" +sections.rows.name: "Headline"

This is not so nice but this would be the only way you can make it automatically 'work' for all json. You guys can tell me what you'd rather have and which is easier to explain to users

(indexes will also end up being much larger because we're storing long keys and arbitrary data)


Matt Brailsford 10 May 2016, 20:58:02

I guess it's a question as well of what type of search do people expect to be able to do OOTB?

  1. just want to find something within the entire field defined on the doc type and so the json structure is irrelevant, just extract all meaningful data and store in a single lucene field
  2. want to search within a specific inner property within the field on the doc type in which case the ability to address a specific inner field like you mention would be required.


Shannon Deminick 10 May 2016, 21:08:44

Exactly, but both of these questions have complexity:

  1. the keyword here is 'meaningful'. If I automatically just put all string values from the flattened json in a field based on the grid value, it will contain all sorts of non-meaningful data which will be indexed and so search doesn't work well
  2. this again comes down to defining what key/value pairs are relavent

I think part of the answer here is to somehow abstract away the indexing side of things here and expose a method on a property editor/property value editor/or maybe even have a new type of property value converter for indexing data which simply returns IEnumerable<KeyValuePair<string,string>> based on the current value being saved in the field.

Just thinking out loud here... maybe the idea of extending a PropertyValueConverter is the way to go? These are already plugins, we already scan for them, we could just say if you also implement IIndexValueConverter, we will use that based on the field being indexed. BUT (and there's always a but) I'm actually not sure this would be possible (now) because: The indexer actually doesn't really know anything about the document being indexed, it's just given some data so it doesn't really know what property type, property editor, etc... the data belongs to. Hrm...


James South 11 May 2016, 01:12:08

@Shandem I think extending {{PropertyValueConverter}} and implementing something like {{IIndexValueConverter}} would be a good direction to go in. A fine grained level of control is imperative. You have the id, and alias of the node at least so you could theoretically grab further details I believe? (Though it would be nice if that was provided to the event handler)

@matt I agree that whatever implementation that is settled on should be applicable for all (well most at least) JSON based PropertyEditors. I'm particularly interested how we would solve culturally specific search results for something like Vorto. (We don't want Japanese results for example returned to an English language user). I had to resort to some awfully complicated regex hackery in order to return something useful to the user in a site I was working on. http://24days.in/umbraco/2015/hacking-around-with-search-and-strong-typed-models/ (Some of that code is simplified and very out of date).


Shannon Deminick 11 May 2016, 05:51:16

I realize we have the ID and then you can go re-look the item back up through the services but this will instantly kill performance so it is not an option. In order to do this I'd have to re-design quite a few things based on Examine's aging codebase. In Examine v2 (which is part of Umbraco 8) we could more easily do this. In any case this I believe this is out of the scope of this task.

If we want to make something like IIndexValueConverter to more easily allow property editors to determine the values they need indexed, that will be a separate task and a separate release version.

For now, I think that what is in this PR will satisfy the majority of users/developers so that at least the Grid is natively indexed. We can explore IIndexValueConverter as a separate issue.


Shannon Deminick 11 May 2016, 06:01:17

I created a separate task for that with some very high level design notes: http://issues.umbraco.org/issue/U4-8437


Matt Brailsford 11 May 2016, 06:51:24

Not an easy one for sure and the landscape of what PE's can do has certainly changed quite dramatically since Lucene/Examine was introduced initially, so waiting till v8 is cool. At least the discussion has begun :)


James South 11 May 2016, 23:35:16

Absolutely agree. This is a very positive step.


Jon Humphrey 18 May 2016, 17:08:25

@everyone, I know this may sound like a simple option but couldn't we add a checkbox on the field creation UI that would highlight it for indexing; or, if that's not enough, then make it a dropdown with 5 options to rank the importance of the field value to the examine index? While I know this won't solve the issue with nested content it would give us a hook that would allow us to gather the data without leaving it up to the index to grab everything and decide it's ranking?


Shannon Deminick 18 May 2016, 17:29:19

Hrm I think you're missing the point of this thread. This task has nothing to do with ranking. You can boost a field easily any way you want with the normal search and examine APIs. What Doug was talking about with rankings has to do with ez search and how that deals with boosting.

This task is about getting the data into the index so you can use it, it doesn't deal with how you search the data.


Jon Humphrey 19 May 2016, 08:54:18

@Shandem, while I understand the point of the thread, my suggestion was to remove the task of having to code the fields into the onGatherNodeData event and resolve the comment you made above:

The only thing I can see that would be helpful to streamline this would be to add a method to the base property value editor (or similar) that asks the property editor for key/values that it wants to index based in the db value it is storing.

I also was trying to move responsibility from the developer/administrator to the new content creator when most of the time the client wants to be able to create content but wouldn't have a clue on how to add items to be indexed.

Hope this clarifies! :-D

PS. I'm also talking about with ezSearch and Archetype issues I've run into


Shannon Deminick 19 May 2016, 09:08:09

Hi,

The reality is that a Property Editor developer (i.e. the person creating it) can (and will) store their data in any JSON format that they choose, this could literally be anything. Therefore, it must be up to them to decide how their data gets into the index. My comment was regarding moving the responsibility of getting data into the index to the developer of the Property Editor - so that umbraco admins, umbraco developers and umbraco end users don't have to do anything. The data will just be indexed correctly based on how the Property Editor works and this will all be automatic.


Priority: Task - Pri 1

Type: Bug

State: Fixed

Assignee:

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions:

Due in version: 7.5.0

Sprint: Sprint 15

Story Points:

Cycle: