U4-750 - Add single quotes to the list of url characters to remove

Created by Pete Duncanson 05 Sep 2012, 13:15:36 Updated by Sebastiaan Janssen 14 Oct 2016, 09:46:37

Is duplicated by: U4-292

Is duplicated by: U4-860

Is duplicated by: U4-1329

Is duplicated by: U4-6390

Is duplicated by: U4-6487

Is duplicated by: U4-8342

Is duplicated by: U4-9069

Relates to: U4-751

Relates to: U4-3157

Relates to: U4-6717

Relates to: U4-1952

The amount of times a client reports a broken link for it to be caused by having a document with a single quote in its name eg "Pete's Pizzas" is getting silly. Can we just add this to the list in web.config with all new releases?

Comments

Drew 05 Sep 2012, 13:56:40

A general update to the list of URL-replacing chars in umbracoSettings.config would be good - it's always easier to remove items than it is to add them!

The standard list I've used that is in addiction to the standard set is: - -


- s - -

Anyone else got any common chars that cause problems?


Matthew Bliss 05 Sep 2012, 14:12:05

While the URL replacing rules in the config are very flexible it requires updating each time we find that a user puts something into the name that as developers we were not expecting - it is in effect a blacklist approach for those cases that the replacement is with an empty string.

Could there be an additional entry in the config which provides the opposite approach - a whitelist - and allows us to specify a RegEx of all valid characters for use in the address and a replacement string for anything that falls out side of this set.

Once the new 'whitelist' replacement has been applied to the address the existing 'blacklist' replacements could then be applied too.


Sebastiaan Janssen 05 Sep 2012, 14:21:25

I agree for the ones that actually cause errors (YSODs) it's good to add them to the list now. But if people start upgrading and take these new ones in as well then after the first subsequent publish, the URL will change, leading to 404s. So we don't want to do that. I think most (or all) of the ones Drew lists actually do work as a URL.

So I would probably add them but have them commented out so upgrades can easily decide if they want to run the risks of 404'ing some pages.

For what it's worth though, single quotes and " have been in there for a while.. Did you forget to merge your config files while upgrading? ;-)


Pete Duncanson 05 Sep 2012, 14:22:38

Loving the whitelist idea, must admit trying to keep on top of them all is getting a pain. Whitelist might be the way to do. What to include though by default A-Za-z0-9 and -,_,& then replace "--" with "-" etc. and "&" with " and " as per Matts grumble (which I'm 100% behind) in http://issues.umbraco.org/issue/U4-751 that would be uber flexible and a good base layer


Pete Duncanson 05 Sep 2012, 14:25:00

@Sebastiaan, been working on a lot of second hand sites so yeah probably running old configs. They soon get so out of whack that you don't want to go merging/changing them. Another good reason to pull as much of that stuff as possible out of web.config to separate umbraco settings from site settings I guess. Something to ponder.


Sebastiaan Janssen 05 Sep 2012, 14:38:59

@Pete they're in the umbracoSettings.config already.

I just added \ and | as they cause real problems. The other ones are perfectly valid (unicode) characters in URLs, right?


Douglas Robar 05 Sep 2012, 15:02:44

A whitelist may not be feasible when you consider that unicode/non-ascii characters are supported in urls and domains these days (that is, they work and have done for ages and domain registrars and the RFCs are slowly moving in that direction). See http://www.w3.org/International/articles/idn-and-iri/ I've never found a good solution for unicode characters and regex but perhaps there's a solution I'm unaware of... in which case maybe a whitelist is an option.

My recommendation would be to comment the replacements into sections such as Legacy, International (for umlauts, etc), Not allowed by the spec, and Recommended. The 'Recommended' would be things like handling commas, single quotes, dashes, double-dashes, etc.

For people who upgrade, they will want to be careful about the potential to break their search engine links if they alter the url replacements. So a comment at the top of the replacements in the config file would be helpful to remind upgraders of the effect of changing their existing replacements.


Sebastiaan Janssen 05 Sep 2012, 15:09:56

FYI just added < and > as well as they don't work at all. I like Doug's suggestion of having some sections to divide types of replacements in. But yeah, comment the things that could break existing URLs out by default, as people tend to read pretty badly. :-)


Drew 05 Sep 2012, 16:30:50

Our list contains technically valid URL characters that we remove so that the URL is formatted 'nicer', there's also some SEO consideration in that too - although this only applies to sites with the core language as English.

Although yes, there's a reduced power in "friendly" URLs now quite a lot of people expect URLs to contain only A-Za-z0-9 & and a dash (underscore is also valid, but again, there's a readability and SEO arguments against it's usage, however minor).

Blacklist with breaking additions commented out is great, although I'm now wondering if there's a package in allowing a regex override.


Matthew Bliss 06 Sep 2012, 14:53:58

Sebastiaan, that's a valid point about Urls changing on an upgrade and causing 404's. Likewise Doug, valid point that there are many RFC valid characters for international sites that are not easily handled using Regex. So I can see strong arguments for not adopting my earlier suggestion as part of the standard installation.

However, Pete's comments and my own experience made me think that the approach could be interest for some who would welcome a more restrictive, approach to URL naming and Drew's comment about adding the functionality in a package caught my interest.

After a quick look at the Umbraco source though I cannot see a way of easily doing this using the current event model. It looks to me like the relevant code is not easily accessible (Though I might have missed something obvious).

The following code formats the URL based on the rules defined in the config, could this be made accessible using the event model to allow easy overrides/extensions to be developed if desired?

{cut umbraco\cms\helper\url.cs} namespace umbraco.cms.helpers { ///

/// Summary description for url. /// public class url { public url() { // // TODO: Add constructor logic here // }

	public static string FormatUrl(string url) 
	{
		string _newUrl = url;
		XmlNode replaceChars = UmbracoSettings.UrlReplaceCharacters;
		foreach (XmlNode n in replaceChars.SelectNodes("char")) 
		{
			if (n.Attributes.GetNamedItem("org") != null && n.Attributes.GetNamedItem("org").Value != "")
				_newUrl = _newUrl.Replace(n.Attributes.GetNamedItem("org").Value,xmlHelper.GetNodeValue(n)); 
		}

        // check for double dashes
        if (UmbracoSettings.RemoveDoubleDashesFromUrlReplacing)
        {
            _newUrl = Regex.Replace(_newUrl, @"[-]{2,}", "-");
        }

		return _newUrl;
	}

}

}

This way the standard install could then remain unchanged, but compiling and including a dll in a project - and it would only need a few lines of code - would allow us to define our own Url rules that get hooked into the event model.

This could also then be used for [U4-751 Replace & in urls with 'and'|http://issues.umbraco.org/issue/U4-751] and if combined with some code to access the dictionary we could write code to translate & to 'and', 'und', 'og' etc. dependent on language.

What are your collective thoughts?


Drew 06 Sep 2012, 18:45:55

+1 on the ability to implement our own URL handler via an override or similar. I was thinking about a straight forward regex, but this causes problems as you'd still want to replace some characters with others (rather than just strip them out). So the ability to implement our own implementation would be handy.


Matt Brailsford 09 Nov 2012, 09:45:12

It's a shame we don't still have the v5 issues around as I'm sure we come up with a very flexible way of encoding URLs taking into account non english languages. Might be worth trying to dig it out.


Lee Kelleher 13 Nov 2012, 11:18:06

Just for reference Tim Gaunt has a post for his ["ultimate" urlReplacing character list|http://blogs.thesitedoctor.co.uk/tim/2012/11/09/The+Ultimate+UrlReplacing+Character+List+For+Umbraco.aspx].


esunxray 13 Nov 2012, 11:44:48

Can I add all Chinese to the list? How is the performance? For example: 中 replaced with zhong


Matthew Bliss 13 Nov 2012, 12:21:34

可是汉语有很多的汉字


esunxray 13 Nov 2012, 13:09:02

@Matthew Bliss, 你是中国人呢,还是外国人?呵呵 I don't know the performance after I add all Chinese characters. So I asked it here.


Matthew Bliss 13 Nov 2012, 14:27:01

我是英国人. I don't know either, but even if you add only the top 1000 characters that is still a lot of rules to create. I believe though that the site will only perform this replacement process when a page is published so even if it causes a short delay on page publish it should not effect the site performance for visitors. (Matt, Seb or Lee can you confirm that this is correct?)


Alberto Soares 20 Aug 2013, 08:58:43

I submitted a pull request today with some chars to replace, and was refused, for the (now) obvious reasons. But I got an idea, can we put the added chars, commented? For the people that upgrade/or just don't care about it, it still works without worry's of losing the url's, and for people on new development, know that just has to go there and uncomment the lines, and the list would be centralized on the right place ;)


Stephan 31 Oct 2013, 08:06:01

Was not aware of that issue. One thing to know: the whole process of processing the urls characters has been revamped under the hood (should come in 6.2.0 and 7.0.0). It still honors the urlReplacing characters list in umbraco.settings but by default it will turn every url segment in a compliant, ascii string.

So the urlReplacing characters list should be used to turn, say, '*' into 'star' -- but 'ô' will become 'o' automatically and even 'ß' will become 'ss'. So running with an empty list should be pretty safe.

In my tests, a page named "1234 page 8 - Have&a#'nice$$url yëklôô £$" has url " /1234-page-8-have-a-nice-url-yekloo.aspx" - automatically.

Reading what Doug writes, I understand we might want to run with a more relaxed version that would still accept some utf-8 characters eg with accents. I need to look into it, which probably means figuring out which utf-8 char is valid in a url.

I also realize that it means that upgrading an install from 6.1.x to 6.2.0 and re-publishing a document, might change that document's url. Well, the "old" code is still there and you just need one line of code to put it back in place -- but I guess I must document all this as "sort-of" breaking change -- or revert.

Thoughts and comments?


Damiaan Peeters 31 Oct 2013, 08:27:26

Stephan, what if you add a config option to revert back to the "legacy". Then it's up the the developer to break or not to break. For v6.2 the config could be by default "legacy" and for v7 by default "new".


Douglas Robar 31 Oct 2013, 09:27:27

I've continued to ponder this issue and the situation is somewhere between murky and impenetrable. The fundamental question is, "what should be done?" not "how to do it?"

There are different specs for domains and for URI's. That is, the rules for the 'example.com' part and the '/some-folder/some-page.aspx' part. To make things more confusing, browsers have, for years, done on-the-fly translation into requests that confirm to the specs but no-one realizes that when using the web. We create an big german page, or type www.google.com/große-seite.aspx into a browser's url bar and it will work just fine (well, it would work if the page existed). It sure looks like non-ascii is supported, doesn't it?

[** this is my understanding of the situation, but I may be mistaken in some details... I'm not expert here **] But what is really happening is that the browser is translating the 'human readable' thing typed into a specification-confirming request. Take a look with Chrome Inspector and you'll see that the request if actually:

GET /gro%C3%9Fe-seite.aspx

That's the url-encoded version of the request, to make it 'safe' and specification-friendly.

The only real problem with leaving url's in their 'native' format with non-ascii characters is that even though browsers handle things properly, letting us use any characters we like in urls, not all requests go through browsers. Some go through command-line tools or code, and those would need to be properly encoded before using them.

Back to the question... what should be done?

I guess this really boils down to asking who are urls for?

#If for humans and users then we don't need a lot of url-replacing rules, just the obvious ones. That is, a subset of the existing rules in v6.1 (no need for the international ones). If a human can type and understand the international characters it's reasonable that they'd like to use them in urls. If a user can type 'große-seite.aspx' they ought to do so. The reason for converting it in Umbraco in years past is that nobody can type url-encoded on the url properly and browsers of generations gone by didn't auto-encode requests. That's why the convention was to convert the url from 'große-seite.aspx' to 'grosse-seite.aspx', which at least was easy to type if not exactly obvious or friendly. The penalty for using 'international' urls is that sometimes a developer might need to manually encode Umbraco's urls when performing http requests directly in code. For the sake of clarity/friendliness to most Umbraco users, one wonders if this is something that can reasonably be left for devs to do when needed?

#If we want to maximize convenience for devs then Umbraco should simplify and character-replace all urls all the time. Then devs have nothing special to do. The penalty here is that the rules for url replacing become quite extreme, especially when you consider non-latin-based languages (Russian, Hebrew, Arabic, Japanese, Chinese, etc etc).

My opinion (and it's an opinion) is that Umbraco should be for users first and foremost. Urls should be simplified (spaces become dashes and that sort of thing) but no international characters should be replaced. Given that such a small set of international characters have ever been included in Umbraco it seems a bit late to try to add the full set (as though that were possible). The proof that this isn't needed is the lack of problems related to this issue on the forums. For those that want url-replacing, it can still be used for upgrades and anyone who wants to see 'ß' replaced with 'ss' in urls. For dev's, they can either remember to encode urls in their code or an additional method can be made in the Umbraco API for providing encoded urls (perhaps .EncodedUrl or an optional param to the existing .Url(bEncode)). This would seem to be the best for users world-wide, easiest to implement, and still allowing full control and legacy support.

EDIT: A good read, especially the comments by Matthias. Includes links to the relevant specs. http://perishablepress.com/stop-using-unsafe-characters-in-urls/


Matt Bliss 31 Oct 2013, 09:59:31

Hi Stephan,

In view of the fact that the URL naming process has been revamped and also taking the above comments into account can I suggest the following as a target:

  1. Keep the new process and flag it as a breaking change for upgrades.
  2. If it is not already present provide a config option to revert to the 'legacy' encoding process to allow updates of sites without breaking existing URLs.
  3. Provide an easy override that developers can hook into. This would allow us to handle any edge cases in whatever way needed.

As regards how to handle this technically; perhaps this could be handled in a similar way to the reverse process when a URL is requested. Umbraco works its way through the 404handlers.config and the first handler to return a node is used to resolve the URL into an Umbraco content item. Could we have a URL creation config which by default would have the revamped (new) process at the top, followed by the existing (legacy) process second and then if needed a developer could add their own process at the relevant point in the sequence. (Also they could just swap the order if they wanted to achieve point 2 above)

Hopefully this should be a relatively simple but very flexible approach that would cover all situations and would be consistent with the reverse process already well established in Umbraco.


Stephan 31 Oct 2013, 12:28:43

The good news is that the old process and the new process are both in the core, it's (almost) just a matter of config, and the new process is rather flexible and can itself be configured. Oh and Matt: we already have a way to override the whole process. Just write your own IUrlSegmentProvider and you can do ''exactly'' what you want.

That being said. I listen to the Wise Doug and realize that the new process might be a bit too extreme, by default. So I need to figure out a more relaxed default, bearing in mind that you can always enable the stricter solution.

What's not easy is that... as soon as you allow international chars such as chars with accents or the whole unicode world of chinese chars... there's no way really to tell what's right and wrong in the url. Why would we replace a space with a dash but accept a completely exotic char? What about non-breaking spaces? Or my favorite unicode char, the non-breaking zero-length space?


Pete Duncanson 22 Nov 2013, 14:46:47

Could "exotic" characters be allowed by a config flag? Out the box just allow "normal" characters and render out clean urls? Makes life a little easier for the common cases and if you venture into needing special characters the flag lets you know that it might not do as you wanted.


Funka! 22 Nov 2013, 22:26:11

Pete, check out my related issue U4-3157 regarding one way of allowing this that I'd personally love to see... (As this issue is seemingly just about the single-quote/apostrophe character.)

I think the pattern approach I'd be hoping for would be a bit more flexible than just a yes/no flag, since you'd be able to have different whitelist patterns for different cultures at least. Then there is no single universal question about what is "exotic" or not since each culture could define (or override) its own list.

The default of course would need to respect backward compatibility by NOT using this behavior unless/until explicitly configured to do so.


Damiaan Peeters 25 Apr 2016, 08:57:35

This issue can be closed. 7.4 removes single quotes from the url nodes.


Stephan 25 Apr 2016, 08:59:46

Cool thanks.


Priority: Normal

Type: Bug

State: Closed

Assignee:

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions:

Due in version:

Sprint:

Story Points:

Cycle: