U4-8449 - Back office search is strongly tied to Lucene syntax - I'd like to create a PR to resolve this

Created by Darren Ferguson 12 May 2016, 17:40:52 Updated by Shannon Deminick 16 May 2017, 02:29:49

Duplicates: U4-2676

Relates to: U4-9389

Relates to: U4-2676

I'm willing to write a PR for this if it gets the green light in principle.

At the moment when you search the back office it is strongly tied to Lucene query syntax - see Umbraco.Web.Editors.EntityController ExamineSearch method

Presently a lot of string concatenation is used in the method to produce a lucene query.

Proposed - Create an abstraction that could build the query in non lucene format.

Why? - On some sites with a lot of content - the overhead on building an refreshing Lucene indexes to disc causes slow boot times (especially on Azure sites with slow write speed).

With the abstraction above it'd be possible to call an Examine Search provider that simply queried the database which may be preferable in some circumstances. It'd also be possible to write a searcher/indexer pair for Azure search.

I'm asking for feedback here - as I estimate a couple of days dev to do this nicely.

Suggested approach:

Create a switch in the ExamineSearch that says:

if(ExamineManager.Instance.SearchProviderCollection[searcher] is UmbracoExamineSearcher) {

  // Do exactly what it does now.
  // But probably move to a private method - so as to be more readable.

} else { var searcher = ExamineManager.Instance.SearchProviderCollection[searcher];

 // offload the creating of the ISearchCriteria to the provider
 var criteria = searcher.CreateSearchCriteria(query);

 // somehow need to handle entity type here

}

var result = internalSearcher.Search(raw, 200);

// Should just work after here.

Comments

David Brendel 12 May 2016, 20:47:05

@darrenjferguson Hi Darren, also had some thoughts today about the backend search. Not only that the search is tied to examine, it's also tied to search for content, media and member.

I think it would be good to have introduce something like a search controller which uses multiple search services which then resolve the search each for its one purpose. Instead of using the EntityController.

With this we could also enable package devs to write search services for custom sections.

So it would be SearchController gets all services from an registry and performs search on all available services.

With this we also need to abstract the search result object which is currently also tied to examine.

Also maybe performing the search based on the current section would be an option. So you can only get results for the section you are currently using.

Would be willing to also help with this as it is something I would like to have.


Shannon Deminick 13 May 2016, 12:04:44

Hi all,

This is a great topic and something I've been thinking about for a very long time. Let me just start by saying that abstracting out the back office search isn't going to solve this problem

the overhead on building an refreshing Lucene indexes to disc causes slow boot times (especially on Azure sites with slow write speed).

mainly because: Umbraco Media is cached with Examine/Lucene - this ended up by necessity a very long time ago but it's the way it is and it's not exactly easily replaceable right now. In v8 with 'nucache', media cache will be the same as content - and it will not use Lucene (well, unless you really want to)

Here are some important things to note, things that have been discussed in other places or just exist in my head - so here they all are in one place:

=ISearchableTree=

http://issues.umbraco.org/issue/U4-2676

This is the ability for any tree to become involved in the back office search - which has no references to Examine. This interface exists today but hasn't been used - this task has been lying around for a very long time. Basically if you create a custom tree, you can implement this which tells umbraco how to search for data in your tree when a user uses the contextual search in the back office.

With this wired up correctly this means that each tree's data search is abstracted. This would be used for all core trees too... instead of the thing Darren mentions above: EntityController.Search.

Though, even if we used ISearchableTree for the content + media + members sections, there wouldn't be a way to 'replace' these implementations. An option could be to create a plugin/resolver type thing to set custom implementations ISearchableTree for a given tree. This is a potential option for user invoked back office search.

We'd also have to modify the current ISearchableTree to ensure that all required parameters are passed to it and any other info it needs because we need to take into account things like security.

=Query abstraction=

Well this is basically the intention of Examine's fluent syntax, Examine was created (a very long time ago) to be mostly agnostic of the underlying search platform (i.e. Lucene) but over the years became more tied to it because that's the only thing that was ever used. That said, the fluent syntax is still not directly tied to Lucene but doesn't cater for the more complex lucene queries that we are using in the back office.

Some thoughts on this:

*Abstracting complex search is hard *Examine's fluent API syntax is almost there but not quite *Examine v2's fluent API syntax is much closer to what is needed - it could actually work for creating the back office queries that we use (we'd need to check) *Since Examine v2 won't be released until v8 of Umbraco due to a Lucene upgrade, it's certainly possible to backport any changes that we might want if this option is pursued

  • If you wanted to create a non-examine approach to this I think that you'll find that it is quite difficult and you'll end up implementing much of what examine has already for it's fluent api. The other option is to make it far less flexible and very much tied to the queries that Umbraco executes - but this would be encouraging future breaking changes since requirements will always be changing.

=Indexing=

This is really the main issue. I am fully aware the implications of load balancing and auto-scaling. The current way that is achieved is with Flexible Load Balancing which has some caveats:

  • Any new site instance will need to build it's indexes from scratch - there is no 'main' index to sync from on startup
  • To achieve auto-scaling you need to have a token in your examine path. Azure web apps likes to move your site between workers whenever it likes which means the machinename changes and thus your index will rebuild itself

Neither of the above are great and here are several options to note about this (I'm not going to re-discuss this all at length here): https://our.umbraco.org/forum/developers/extending-umbraco/74731-examine-corruption-issues#comment-244293

... of which nobody replied to my comment "I would certainly enjoy some help with any of this since there is quite a lot of options, work, etc..." So thank you!! for chatting about solutions here :)

=Blob Storage=

Based on one of the above options listed on the OUR thread, I've already backported AzureDirectory to work with our version of Lucene 2.9.4.1: https://github.com/Shazwazza/AzureDirectory/tree/Lucene-2.9.4.1

The next phase is updating Examine to easily replace the Lucene Directory instance used for it's indexes. I should be done that today or in the next couple of days. This would enable you to add this to your Examine indexer:

directoryFactory="Examine.Directory.Azure.AzureBlobDirectory, Examine.Directory.Azure"

and then you will have a 'master' index which syncs to all slaves without any need to rebuild when you are scaling or if Azure decides to move your site to a new worker.

... This essentially can 'solve' the problem with Azure we apps

=Custom Examine implementation=

This is an option I know @darrenjferguson has already tried to explore. Examine is an abstraction so in theory (never tested) you could create an Examine provider for another search library like Elastic Search or Azure Search, etc...

If this were done, it would mean that your indexes and stored in an entirely different server/process and REST is used to fetch/index the data.

... This would also 'solve' the problem with Azure we apps

BUT, I can understand this might not be easy and might not even be perfectly possible with the abstraction the way it is and the support that these other libraries have. I can say that creating a provider for Elastic Search should be achievable because they are built on Lucene and support all Lucene concepts and query syntax. So even though we use raw Lucene query syntax in Umbraco, if we didn't have time to create an abstracted query builder/fluent syntax to support all of the lucene searches Umbraco does, these raw lucene searches would still work with Elastic Search.

Azure Search is slightly different ... I'm not sure if it's built on Lucene (but probably) and they support raw lucene queries too (if you enable it): https://azure.microsoft.com/en-us/blog/lucene-query-language-in-azure-search/

=Replace-able Examine?=

Here's where things become difficult and very time consuming ... this is mostly to do with the Media cache thing mentioned before though. In v8, once that is out of the way it would be much easier to abstract Search in umbraco. We would still need to have a query builder thing - and that 'could' still be Examine (we could split Examine into a few nuget packages like Examine.Query for this) or could be something else. In any case, we'd still ship with Examine by default for the search engine. The catch here is that developers would then be responsible for their own indexing - I can tell you from experience, this isn't an easy thing to get right. We could create an abstracted indexing solution to make this easier to make sure all of the correct umbraco events and data are sent to the indexer so developers don't have to worry about that (since that is the tricky part)

In an ideal world we'd have a couple interfaces like: IUmbracoSearcher and IUmbracoIndexer - by default Examine would sit behind these but you could replace this implementation. It's also worth noting that we want to give more influence on indexing operations to Property Editors, see: http://issues.umbraco.org/issue/U4-8437 . This would also make testing much nicer of course.

If it weren't for the media cache, this would be much easier to do ... and is something I'd like to try to achieve for v8

=Conclusion=

So those are the options as I see them currently. IMO to easiest way to solve this primary issue:

the overhead on building an refreshing Lucene indexes to disc causes slow boot times (especially on Azure sites with slow write speed).

Is to complete the AzureDirectory + custom directory changes in Examine mentioned above - which I'm 80% complete now and could be ready for testing this week. The other options like ISearchableTree are on the roadmap and are required anyways but for different reasons.

Would love to hear your feedback, ideas, suggestions, etc... on all of the above.

Thanks for reading!


Darren Ferguson 14 May 2016, 08:18:56

Hey @Shandem

Thanks for wading in here.

Although I agree that this is a big task overall - journey of a thousand miles begins with a single step! At Moriyama we focus on delivering the smallest piece of value as quickly as possible, which is the inspiration for my PR!

The media being cached in lucene is a big deal, but in the first instance we'd have to just let Umbraco fall back to the database. I know it'd be slow - but the out of the box configuration would still be using examine and you'd have to know the implications if you switch it out. I guess one could look at Macro and Donut caching where media is in heavy use, but long term - a cached replacement is obviously ideal.

In my PR I just propose to make any non core Examine searcher use an overload that takes the searchText rather than an ISearchCriteria, then the implementer can decide how to construct their query free of constraint.

By making the simple modification to the back office search we can allow people to possibly disable Umbraco for front end search - or use something totally different.

In my mind, the change will allow people to be creative and experiment - where as just now they are pretty locked in.

As you rightly point out, Examine is getting a bit old - and abstraction of fluent queries is hard, so it'd be nice to open it up to a collection of creative mind to play with how it could work on the future.

A couple of other points on what you've mentioned:

  • Azure search is capable of indexing a database, so if you can write an Indexer query it will index the Umbraco content independently of the web app - I have a prototype of this - and it means no indexing at startup or writing to disc.

  • Azure Directory for Lucene blob storage is good - but still slow, it doesn't solve the issue of a first instance boot!

  • On my experience of writing Examine providers - it is possible, but difficult and not for the novice developer! Indexers are pretty easy - but Searchproviders are complex. There are quite a few Search overloads that aren't in use by the backoffice that have to be implemented - I tend to just Inherit from the UmbracoExamineSearcher (which is wrong) and override what I need.

  • Azure search - it is actually built on elastic search, but doesn't expose all of the API. Although it says that raw lucene query is supported, i've yet to get that to work.

Anyway - to summarise, I'm going to submit a PR so that one can choose to provide an alternative to Lucene query syntax for the backoffice - and I hope it can be accepted for the next release.

I'm happy to help out and offer any support on this moving forward, boot time in Azure is often an issue for us and I'd really like to see a configuration where one can set it for "optimal boot time".


Darren Ferguson 14 May 2016, 09:05:40

@Shandem to add some meat to this, the PR is here: https://github.com/umbraco/Umbraco-CMS/pull/1268


Shannon Deminick 17 May 2016, 12:03:12

Hi @darrenjferguson

Thanks for the info and PR! It sounds to me like implementing the ISearchableTree would solve your immediate issue since that would allow passing in the search string from the user to an implementation of your choosing. I know that your PR essentially performs this same operation but there's a couple of issues that make it something we probably cannot accept:

  • It is a temporary 'fix' for this particular thing and if we put this in the core it's instant technical debt that we'll have to deprecate almost as soon as it's part of the core and then we'll need to mark as a breaking change for the release that we implement ISearchableTree
  • In some cases this change will already be a breaking change for some users because you are going direct to any searcher that is not specifically UmbracoExamineSearcher. Some developers may currently already sub-class this and with this change the back office search will not work as it currently does (or at all)

I think the first step here it to implement ISearchableTree the way we need it. This should be relatively simple with some guidance. I will update this task http://issues.umbraco.org/issue/U4-2857 with all required details as soon as I have time. With this implemented, you'll be able to replace any part of the tree search in the back office for any item type without having any ties to Examine.

Some additional notes based on your points above:

Azure Directory for Lucene blob storage is good - but still slow, it doesn't solve the issue of a first instance boot!

The way AzureDirectory works is by storing the master index in blob storage and the site instance's index in local persisted storage. So yes, when writing to the index it will be a bit slower since it needs to write to two locations, but when reading from the index it will be just as fast as a normal lucene index that is operating locally. The way that the master -> slave file sync works is iterative meaning that when a new server node comes online it does not need to read every lucene file from blob storage to persist locally, it lazily fetches the requested files it currently doesn't have persisted locally when the are asked for. So it does solve the current issue of a first instance boot since there is no index rebuilding necessary. I realize there will be a small lag to fetch the index files but it certainly beats an index rebuild.

I will have this ready this week if you would like to test.

Regarding custom Examine providers for AzureDirectory and ElasticSearch - I'd still like to attempt this when I find time. According to the docs they both support raw lucene searches so the fluent search code to support it shouldn't require too many changes ... in theory of course.


Darren Ferguson 17 May 2016, 12:44:47

@Shandem I'm up for implementing ISearchableTree if you can provide directions... I'm really keen to do so.

Re the AzureDirectory stuff - my honest (brutally) honest feedback is that it isn't of that much interest to me. Sorry if it seems abrupt - but my interest is in taking the responsibility of indexing and maintaining an index outside of the app pool altogether for super quick startup! I know what I am doing is edge case, and the defaults are fine most of the time.

At the weekend I hacked this together - https://github.com/darrenferguson/uFlat it generates a flat table view of Umbraco content. This allows you to write an Azure Search indexer that doesn't require any code in the Umbraco app pool - e.g. http://devslice.net/2015/03/azure-search-indexers-index-data-without-writing-code/ because you'd be able to query all Umbraco content with a simple SQL query.

In this scenario you could have a dummt Examine indexer, and just a searcher.

I'll be interested to here about your attempts to get Lucene and Azure search working :) I've found it frustrating to date - I do believe though that long term, the ability to abstract fluent search outside of lucene is important - but it will be complex!

Please point me in the direction of getting started on ISearchable Tree!


Shannon Deminick 17 May 2016, 13:08:43

Hey @darrenjferguson yup i totally understand your requirements - these are all things I've been thinking about for a very long time but of course time is the main issue there :)

I'm hoping that for many people that AzureDirectory approach will solve the initial headache's people are having with auto-scale. Long term it would certainly be much nicer to have an index stored in a centralized app like you mention.

I'll try to write up ISearchableTree today/tomorrow and will ping you


Darren Ferguson 17 May 2016, 13:51:05

@Shandem thanks - look forward to it.


Priority: Normal

Type: Feature (planned)

State: Duplicate

Assignee:

Difficulty: Easy

Category: Extensibility

Backwards Compatible: True

Fix Submitted:

Affected versions:

Due in version:

Sprint:

Story Points:

Cycle: