U4-8361 - 301 Url Tracking

Created by Shannon Deminick 21 Apr 2016, 13:15:42 Updated by Shannon Deminick 04 Aug 2016, 12:52:06

Tags: Up For Grabs Needs Docs Unscheduled

Relates to: U4-8802

Depends on: U4-8808

NOTE: This spec is a WIP

URL Change Tracking

The core of Umbraco needs to support URL change tracking to support 301 redirects - this occurs when content items are renamed, moved, if an IUrlProvider is changed, if the umbracoUrlName property is changed (there could be other scenarios). There are some scenarios that will not be able to be automatically tracked, such as changes made to an IUrlProvider since that is outside of the control of Umbraco and 100% in control of a developer.

Implementation

The simplest way to achieve this is to include a custom IContentFinder. This content finder can be appended as the last content finder in the queue and can make a query to the database to see if the currently requested URL matches a tracked changed URL and then redirect to the correct URL.

The change tracking can occur with events: publishing content, moving content

To track the URL changes we have some options:

  • Custom database table
  • Potentially use Relations API

API

Parts of this change tracking should include a public API, potentially with some events so that the process of URL tracking can be enhanced by developers or used by developers for their own applications.

TODO

  • What other requirements do we need to list? ** Keep in mind that we are not solving every developer's problems here, we are creating the basic functionality that should work for the majority of cases. Custom/specific requirements can be handled with extension points (MVP = minimum viable product)
  • What type of storage should we use? Custom db table, relations, others?
  • What extensibility points do we need?
  • Do we need to support more than 301 redirects? other status codes or rewrites? - if so how?

Discussion

We can discuss this below and update this spec as we see fit.

Let's get started!!

So how do we start this whole thing?

  • Once spec is complete and/or if someone wants to just jump right in then we'll create a custom branch off of dev-v7 for people to submit PRs to
  • If required I can get some base code written for this so that it's easier to get started from a community standpoint.

Comments

Tim Payne 21 Apr 2016, 13:20:39

It would also be good to support deleted files as well, as these are typically 301 redirected for SEO.


Chriztian Steinmeier 21 Apr 2016, 18:58:38

@AttackMonkey Curious - Where exactly does the "SEO guidelines" tell us to 301 redirect a deleted file to? Isn't there a "30x Gone" (or similar) code to use for that?


Tim Payne 21 Apr 2016, 22:08:16

Normally you'd 301 the URL to somewhere that makes sense, so for example, if you had a staff profile under a staff list page, you'd 301 the deleted profile page back to the list page. You can also return a 410 Gone status, which tells the search engine to remove the entry completely. That said, most of the time you're OK to just return a 404 or 301, depending on your needs, as the search engines are smart enough to figure out what to do.


James South 26 Apr 2016, 02:15:42

I think we need to ask "What functionality is already available to Umbraco developers" to determine the basis for an MVP.

We have an established product in 301 Url Tracker that as far as I'm aware offers the following functionality:

  • Node tracking
  • 301 and 302 status codes
  • Regex support.

Whatever we build should at least offer that.

I think any other status codes, 410 Gone etc should be handled by additional {{IContentFinder}} implementations or by some sort of event handling mechanism.

Storing the data is an interesting question.

A separate table is probably the easiest in terms of backwards compatibility but how do we display that data to the content editor in a simple to manage manner? Also how does caching of the data work to ensure we get a snappy response.

Personally I want to be able to go to the specific node I am redirecting to and add/edit urls there. A large list can grow cumbersome to use and lose data within.

With my own experiments this is the approach I chose. To do that I applied a custom property editor.

https://github.com/JimBobSquarePants/Bloodhound

This approach has its benefits, it's super easy to implement and is very flexible. There are definite caveats through, most notably.

'''You can't enforce the alias of the property you are assigning the editor to.'''

This makes looking up matching nodes an absolute chore. You have to loop through the properties and check if they are implementing the property editor before you can even parse the value to determine a match. That has to be done for each node on each level which can potentially be very slow on a large website even with result caching.

Having the changes made in the core though and enforcing that alias would make it much easier.

This might be waaaay off bat but here's how I would go about it.

We create a new interface {{ITrackablePublishedContent}} that inherits {{IPublishedContent}} containing a single property. I would do this purely to make it easier for you to add methods, tests, extensibility etc. You could add overloads to {{UmbracoHelper}} if need be.

public interface ITrackablePublishedContent : IPublishedContent { ///

/// Gets or sets the collection of rewrite urls /// IEnumberable UrlRewrites {get; set;} }

This property would map to an enumerable collection of the following class. This class ''should'' be enough for you data needs.

///

/// Represents a url rewrite bound to a content node. /// public class UrlRewrite : IEquatable { /// /// Gets or sets the rewrite url /// [JsonProperty("rewriteUrl")] public string RewriteUrl { get; set; }

/// <summary>
/// Gets or sets a value indicating whether this rewrite is a regex
/// </summary>
[JsonProperty("isRegex")]
public bool IsRegex { get; set; }

/// <summary>
/// Gets or sets the status code
/// </summary>
[JsonProperty("statusCode")]
public int StatusCode { get; set; } = 301;

/// <summary>
/// Gets or sets the creation date in utc format
/// </summary>
[JsonProperty("createdDateUtc")]
public DateTime CreatedDateUtc { get; set; } = DateTime.UtcNow;

/// <inheritdoc/>
public bool Equals(BloodhoundUrlRewrite other)
{
    return this.RewriteUrl.Equals(other.RewriteUrl)
        && this.IsRegex.Equals(other.IsRegex)
        && this.StatusCode.Equals(other.StatusCode)
        && this.CreatedDateUtc.Equals(other.CreatedDateUtc);
}

}

That ''should'' allow you to use the node as normal with {{UmbracoHelper}} etc so nothing gets broken.

When creating a document type there would be an checkbox option to make it trackable, that would add the additional property with the specific {{UrlRewrites}} alias (solving the naming issue) and the rest would wireable up.

Now that you have a strict property alias, to match the url, you could use a specific SQL query, Examine, or the published content cache to return the {{IPublishedContent}}/status code combination you need in your {{IContentFinder}} implementation in a much more performant way that I could.

Is this somewhere in the right direction?

Questions...

  • Which events do we need to track Publish, UnPublish are obvious, anything else?
  • What other extensibility points would you require in a public API


Tim Payne 26 Apr 2016, 08:24:40

You're also going to want to track move, as you'll probably want to redirect the old URL to the new one, and possibly deletes as well, so you have the option of redirecting the deleted page's URL if required. The only issue with redirecting deleted stuff is that you need to make sure that if new content is added on a previously deleted URL, that takes precedence over the redirect.


James South 26 Apr 2016, 12:04:10

@atAttackMonkey I think move triggers publish. Delete would be a tricky one which I'm not convinced we should attempt to automate.


Stephan 13 Jun 2016, 18:13:29

PR https://github.com/umbraco/Umbraco-CMS/pull/1325

This is Core Retreat Material, thanks to @peteduncanson @JimbobSquarePants @marcemarc. We put together a "minimum viable product" that creates redirects in a new database table and uses a content finder to handle these rules and issue corresponding redirects. That's all and that's already a lot.

What's missing is a dashboard to... what exactly? What would you want a dashboard to do?

Go test and enjoy!


Asbjørn Riis-Knudsen 15 Jun 2016, 09:36:31

@zpqrtbnk The main use of a dashboard in my view is when you're migrating from an old site and remapping urls. You can do the most common ones ahead of time (enter old URL, pick new Umbraco page in content picker) and/or have the dashboard list 404s (logged somewhere) with an option to create a 301 redirect. Essentially what the 301 tracker package does now.


Simon Dingley 15 Jun 2016, 09:43:29

Logging 404's in the way that the current 301 URL Tracker package does at present can cause a massive performance hit on your site. I have a site that currently has a table of over 4 million records and growing because it creates a new record for every 404 occurrence. When an editor currently loads the dashboard for the tracker it severely impacts the performance of Umbraco. Something to keep in mind!


Asbjørn Riis-Knudsen 15 Jun 2016, 10:06:53

@ProNotion Yes, definitely, just blindly logging 404s is not a great idea. You should definitely be able to turn it on or off easily, so that perhaps you only run it on the first few weeks of a new site to get the major 404s. Or perhaps you could add some intelligence so that only 404s with a certain number of hits within says a week are kept in the log, others are purged. That does sound a bit complicated though. An on/off switch would go a long way.


Simon Dingley 15 Jun 2016, 10:12:04

I've not actually had a chance to investigate it but it would have been better to increment a counter on each occurrence rather than create a new record but that may also come with its own problems. I am sure there was some reasoning behind it. In any case it would be nice to at consolidate the records as part of some sort of maintenance task.

I'm really looking forward to trying out the core implementation of this and will most likely look at a way of migrating the existing data in this large site that has problems with the 301 Url Tracker package at present, it might make a useful package:)


Søren Kottal 01 Jul 2016, 19:38:49

Is this going to be compatible with UaaS? I see the database stores ids for content.


Shannon Deminick 04 Jul 2016, 11:14:24

@zpqrtbnk Can you review the above question - is storing INT IDs the best approach for this table?

We need to think about how Courier handles this. If I rename something on my Dev environment and then deploy it to live, we'd have to expect that Courier will deal with this too so that the 301 redirect also get's transferred. I'm re-opening this issue until we decide on how to deal with this. Seems to me like this table should store GUID ids though.


Stephan 04 Jul 2016, 12:04:44

@ProNotion could you elaborate on what you mean by having a counter on each occurence? currently we are creating a record on each (url, content) pair. we ''could'' update the record corresponding to a url with a new content id, but we don't do it now to preserve history. thoughts?


Stephan 04 Jul 2016, 12:07:41

@Shandem storing IDs was... faster. But yes it means that Courier would need to map ids to guids to ids. Probably as easy to store guids directly. Looking into it.

Also... do we need a dashboard for a 7.5 release?


Shannon Deminick 04 Jul 2016, 12:09:31

Dashboard could wait till 7.5 final, not for beta2


Søren Kottal 04 Jul 2016, 12:14:01

Thanks for updating. My thoughts, was to create some custom functionality for handling redirects like x.com/my-short-url redirecting to some node. Is that in the plans?


Shannon Deminick 04 Jul 2016, 12:18:56

Custom URL mapping is not going to be part of the core, this can be addressed with IIS rewrite rules.


Simon Dingley 04 Jul 2016, 12:30:05

@zpqrtbnk By counter I mean a column that contains the number of occurrences of a 404 at a given url/content pair.


Stephan 04 Jul 2016, 13:49:01

@ProNotion thanks - so that's not directly relevant here as the built-in 301 tracker is not meant to track 404s.


Emil Rasmussen 04 Jul 2016, 14:14:54

@Shandem Custom URL mapping can be addressed with IIS rewrite rules - but not from a content editor perspective: 1) content editors don't have access to web.config and 2) it will reset the app when rules are reloaded.

I guess it's fair to not include 404 tracking and custom url mappings as part of the core. You can make the argument that it is distinct features different that handling 301 redirects due to renaming of nodes.

But on a technical level then are related (tracking and mapping of URLs) Guess 404 tracking and custom url mapping should be done via the API then?


Simon Dingley 04 Jul 2016, 14:48:09

I understand the idea behind not adding custom url mapping to the core but it unfortunately means still having to have 2 solutions in place for handling 301 redirects. One for redirecting nodes old to new when they are moved or removed and another for handling any other redirects of which there could be many especially when migrating an existing site to Umbraco. Handling large volumes of redirects in the web.config or urlrewriting.config file is not really manageable in my opinion. If it's not going to be in the core hopefully we can extend it ourselves through the API?


Stephan 04 Jul 2016, 14:55:06

The API could allow you to insert static rewriting rules, eg "/path/to/whatever" => content 1234. But nothing fancy eg expressions, matches, etc. Would that be ok?


Simon Dingley 04 Jul 2016, 15:01:50

Expressions would also be good :) ...but I guess that's just from my own experience of having to manage large volumes of redirects via 301 Url Tracker.


Sebastiaan Janssen 04 Jul 2016, 15:33:43

@ProNotion :

Handling large volumes of redirects in the web.config or urlrewriting.config file is not really manageable in my opinion

Please don't use urlrewriting.config, it performs very poorly.

Also, handling large volumes of redirects through a database table with expressions performs more poorly than the native IIS rewriter, so I'd recommend you use that after all. Your specific case is migrating existing sites to Umbraco, that's a one time operation and you could script out the redirect rules for that (or hopefully use a pattern).

@emilr I think it would be awesome to add a 404 handler that can be configured by editors at a later time. This is specifically meant to track changed URLs for now though. The added complexity of having to track .html, .asp, .php, etc extensions would have delayed this feature too much. For now, use third party tools or APIs. A word of warning: when using third party packages, Umbraco's native tracking will be disabled. Again, this is due to the fact that it would've delayed this whole feature if we had to figure out how to make the native tracking fire first somehow (we're not convinced yet that it can be done in a reliable way) and to prevent this feature from firing off infinite loops, for example.


Stephan 04 Jul 2016, 16:23:50

New PR migrating from IDs to GUIDs in the table: https://github.com/umbraco/Umbraco-CMS/pull/1370

(and fixing an issue with the recent Xml cache optimizations)


Stephan 04 Jul 2016, 16:25:11

Thoughts about rules, expressions, etc: anything we can do will reproduce the IIS native rewriter and probably never be as robust/efficient. The only bad thing about the IIS rewriter is that it has no clean and nice API to configure the rules. I'd rather spend time creating an interface on top of that rewriter than trying to clone it.


Shannon Deminick 04 Jul 2016, 16:33:56

The Native IIS rewrites can handle huge volumes of rules efficiently. You can use IIS rewrite maps with external files to perform static rewrite operations, there can be tons of these. I'm sure there's examples of IIS rewrite providers that use a database to perform these operations too and it's fairly simple to write an IIS rewrite provider. If it comes down to it, this would mean creating a custom provider that can have an editable list within Umbraco, but this would still be a separate management tool than the 301 redirect tracking that the Core is doing.


Emil Rasmussen 04 Jul 2016, 18:00:35

@zpqrtbnk Inserting static rewrite rules will be sufficient. Totally agree with the priority to get this feature shipped :-)


Emil Rasmussen 04 Jul 2016, 18:16:04

@Shandem I did not know about DB providers for ISS rewrite. Awesome thing, thanks :)

Also agree that using the IIS rewrites is very efficient, and everything that we can leverage from that is good.

I have found the examples (https://www.microsoft.com/en-us/download/confirmation.aspx?id=43353). I haven't studied them carefully enough to figure out, how to map the old url to an Umbraco node id/guid yet, though.

And just to be clear, I support getting this feature shipped and not chase every possible use case. I'm just trying to get my head around what this core feature will actually solve in our scenarios. And hopefully provide you with some valuable feedback from our world. Opening a new issue might be the way forward for this discussion?


Shannon Deminick 05 Jul 2016, 07:37:41

Yeah sure, if a managed rewrite rule list is what people want then please create a new feature request for it, to do that would be leveraging IIS rewrites in one way or another.


Søren Kottal 07 Jul 2016, 11:58:04

I don't think id's will be a problem for UaaS. I guess that the rows in this table are not going to be deployed between environments, as they are created whenever content changes names. And they do that on deploy, so the rows will be created there anyway.


Sebastiaan Janssen 07 Jul 2016, 12:37:03

They will still need to be deployed with the content, Courier doesn't trigger the .Published event. We'll update Courier before 7.5 final is out!


Emil Rasmussen 07 Jul 2016, 16:13:41

Issue for custom url mapping created U4-8711 - feel free to comment and improve.


Allan Kirk 02 Aug 2016, 12:00:25

Could we please not make it 301 by default? I strongly feel, that you should never use 301 redirects at all: http://getluky.net/2010/12/14/301-redirects-cannot-be-undon/

Basically a 301 can never be removed from a browser's cache. If I once create a page called "about us", and then rename it, a 301 record is recorded and served to the browser. If I later want to create an "about us" page, that browser can never access that page, and there is no way to tell the browser to clear that cache.

When is the last time you were 100% sure you would never again have a page called "about us"?


Tim Payne 02 Aug 2016, 12:13:35

From an SEO point of view, a 301 is preferable to any other redirect type, as it actually passes on the link juice from the old URL to the new URL. Any other redirect means that you lose any score attributed to the old URL. Redirects can be set to only cache for a short while using cache control headers, as long as you use those, you won't ruin into the issue you describe.


Sebastiaan Janssen 02 Aug 2016, 12:19:02

I'm looking into adding a no-cache header or a short cache (1 hour?).

It seems like no-cache should actually be fine, this also ensures that an accidental rename (and a revert of the rename) doesn't confuse editors for a whole hour.


Sebastiaan Janssen 02 Aug 2016, 14:54:02

Was reminded today that we might need an option to disable this feature, the following code (drop it in App_Code or in a class library) will stop the redirects from happening, they'll still be recorded if you ever want to remove this and start using the feature:

using Umbraco.Core; using Umbraco.Web.Routing;

namespace MyNameSpace { public class StartupHandler : ApplicationEventHandler { protected override void ApplicationStarting(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext) { ContentFinderResolver.Current.RemoveType(); } } }


Sebastiaan Janssen 02 Aug 2016, 15:03:28

Created a new PR to add the no-cache directives required as listed in this SO post - http://stackoverflow.com/a/22468386/5018

PR: https://github.com/umbraco/Umbraco-CMS/pull/1408


Matthew 02 Aug 2016, 15:52:31

I'm glad this is becoming a part of the core. Are there going to be instructions on how to migrate from Kipusoep's awesome 301 URL Tracker package? https://our.umbraco.org/projects/developer-tools/301-url-tracker/


Sebastiaan Janssen 03 Aug 2016, 14:43:39

@Matthew - we will not have all the functionality that package provides, as we only track internal link changes, for which you'd need to find the uniqueId. So to migrate you could: Look up the RedirectNodeId from the icUrlTracker table and insert it into the new umbracoRedirectUrl table. Something like this should make the first selection:

SELECT '/' + OldUrl, uniqueID FROM icUrlTracker LEFT JOIN umbracoNode ON icUrlTracker.RedirectNodeId = umbracoNode.Id WHERE uniqueID IS NOT NULL

Then, to be able to insert it, you need to hash the URL you get as well, the hashing is done like so in our code:

private static string HashUrl(string url) { var crypto = new MD5CryptoServiceProvider(); var inputBytes = Encoding.UTF8.GetBytes(url); var hashedBytes = crypto.ComputeHash(inputBytes); return Encoding.UTF8.GetString(hashedBytes); }

Note: we detect the presence of existing URL tracking packages in your site, so if InfoCaster.Umbraco.UrlTracker.dll is found in your bin folder, the built in URL tracker will not work because we don't want to conflict with it.


Matthew 03 Aug 2016, 15:05:15

Fantastic! Thanks.


Priority: Up for grabs

Type: Feature (request)

State: Fixed

Assignee:

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions:

Due in version: 7.5.0

Sprint: Sprint 39

Story Points:

Cycle: