U4-4056 - What happened to urlReplacing and how to set unicode url support?

Created by Douglas Robar 15 Jan 2014, 12:34:44 Updated by Can Koluman 26 Sep 2016, 10:35:51

Using (nightly builds of) 7.0.2, I noticed that the urlReplacing section of the umbracosettings.config file have been removed.

I also notice that unicode characters in page names are allowed through into the generated URL for that page. But not always.

For instance, a page named this (please forgive my copy-paste of foreign languages): ÆØÅ and æøå and 中文测试 and אודות האתר and größer page

generates this url: æøå-and-aeoeaa-and-中文测试-and-אודות-האתר-and-groesser-page

Notice that some characters are passed through unchanged (though lower-cased) and others are replaced with ascii equivalents ala the old urlReplacing strategy. This is quite confusing to me though I have followed the discussion for a very long time (see U4-3732 and U4-750 among others).

Things I find confusing about the current situation:

æ becomes ae, while Æ becomes æ in a generated url

öß becomes oess, while Chinese and Hebrew are unchanged

why the urlReplacing section has been removed from umbracoSettings.config, which would (I think) allow the behaviour to be changed by users but which at the moment does magic transforms the user can't discern without looking at the source code?

how to enable/disable unicode urls entirely (aka, backwards compatibility vs 'the new way')

why wouldn't all unicode chars be passed through unaltered if unicode urls are enabled? (this would mean #1 and #2 wouldn't have happened if unicode urls are turned on, making it 'all or nothing' for any legal character)

Thanks for any insight.

cheers,
doug.

Comments

Stephan 15 Jan 2014, 12:40:00

Confusing, sure. Such as, points 1 and 5 I don't understand myself. Which prob. means I have screwed something else. Oh my. Will work on it later today.

In ''theory'' no Unicode should be replaced, by default, and there's an easy way (though via code) to tell Umbraco to please replace all unicode chars by ASCII. Will document it here once it works. As for why the urlReplacing section is gone... I have no idea as that's still supposed to work.

Conclusion: it's a mess where I thought it was all clean. Give me a bit of time!


Douglas Robar 15 Jan 2014, 12:41:41

Thank you, my friend!


Douglas Robar 15 Jan 2014, 12:46:04

I think the reason for #1 and #2 are that they are handled by the (invisible) list of urlReplacing chars, which specifically includes those characters but not the capitalized versions.

My suggestion (at lest for a 'quick fix') would be:

reinstate the urlReplacing section in umbracosettings.config so users can see/modify the list

add option for do/don't convert unicode > ascii. If set to allow unicode, then all entries in urlReplacing with an original character that is not ascii are ignored and the unicode chars are passed through unaltered.

Would that get 95% of the way in 5% of the effort?


Marc Stöcker 15 Jan 2014, 15:26:49

Doesn't this new "magic auto" instead of the former config way prevent transliteration? For instance with russian sites we use that all the time: http://our.umbraco.org/forum/developers/extending-umbraco/22910-Transliterating-cyrillic-URLs-with-umbracoSettingsconfig


Stephan 15 Jan 2014, 16:07:46

If the output is unicode, since we're still using urlReplacing, nothing should change. If the output is ascii, because we do urlReplacing first, nothing should change. But, to do unicode -> ascii we use the transliteration library that is used, amonst others, by Lucene.Net, and takes care of a ''lot'' of chars. I will check whether it takes care of the characters listed in that Our post, and report here.

At the moment, working on fixing the issues Doug pointed out.


Douglas Robar 15 Jan 2014, 16:10:51

Thanks, Stephan!

FWIW, I tried inserting the section from a 6.1.5 installation but changes to the config file settings made no change to the url generated. It seems the magic url replacements can't be overridden without c# coding on the site builder's part?


Douglas Robar 15 Jan 2014, 16:17:59

My mistake... manually adding the section does indeed override the built-in replacements. whew.


Douglas Robar 15 Jan 2014, 16:47:05

As noted in http://our.umbraco.org/documentation/Using-Umbraco/Config-files/umbracoSettings/#RequestHandler, the urlReplacing section has been removed from the default config file but the content of the built-in settings are shown for reference.

I guess that's the definitive answer on my request to reinstate the urlReplacing section of the config file. Since the link to the documentation is included at the top of the umbracosettings.config file there's no need for further action in that regard, and I think that handles Marc's concern as well.

If I didn't want the unicode chars (such as æøå) that are specifically mentioned in the default urlReplacing I could add my own urlReplacing section in my local umbracoSettings.config file:

æ ø å ä ö ü ß Ä Ö

Note that an application pool recycle is required for the changes to take effect.

@Stephan -- I think this goes a long way to alleviating my concerns. What do you think?


Stephan 15 Jan 2014, 16:55:20

@Doug: I still don't understand how you've been able to obtain that first url you posted, with some chars being lowercased and some being transliterated. Can't reproduce on 6.2, now about to test on 7.0.2 (although it should be the same code).

Now what I take from this discussion is that a) yes, it's only a matter of documentation but b) we now have built-in ''very'' powerful transliterating code, and ppl would need to enable it through ''code'', and maybe it should instead be a config option?

That code does not handle the cyrilic chars yet, but it would take me 10 minutes to add them to the list.

So I will a) try to figure out how you produced your crazy url, and b) see if there's a way we can make it very easy to enable the builtin unicode-to-ascii conversion (no code).

Sounds good?


Douglas Robar 15 Jan 2014, 18:21:20

@Stephan - I downloaded 7.0.2 nightly #238 and ran it in webmatrix/sqlce. I then made a single doctype with no properties and created a content node. Pasted in the above text for the name and published it. All my tests were done with that build, if that helps (didn't try it on 6.2.0 nightly).

As I'm understanding the situation and how things work better, I think this is not such a major issue. The easy replacing can be done by adding one's own urlReplacing section to the config file, mentioned here and with references to the docs already in the config file for users. More complex replacing can be done with code as you've mentioned, which is mega-powerful! That's a good mix of config and code I think.

An option to enable/disable unicode-to-ascii conversion in the umbracosettings.config file would be nice indeed.


Marc Stöcker 15 Jan 2014, 21:12:31

I don't know about the quality of the transliteration that Lucene.Net provides, so I will have to look into that. Generally I like the idea of having a quality transliteration libery in use instead of any homebrew thing.


Stephan 16 Jan 2014, 08:39:22

@Mac: was about to add cyrilic tranliteration to our code, but it looks like ppl can't agree on how to properly transliterate cyrilic ;-) Found dozens of variants eg should Й become Y or J, etc. If you can provide me with an ascii equivalent of the following string, with reasonable confidence that most ppl will be happy with it, it might go into Core:

"А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я"

what about: "A B V G D E E Zh Z I I K L M N O P Q R S T U F Kh F Ch Sh Shch " Y ' E Yu Ya"

though... this will transliterate Анастасия as Anastasiya which if I understand correctly is ugly and should be Anastasia... but that means that some chars transliterate differently depending on their adjacent chars? Is this true? Is there a list of such subtleties?


Stephan 16 Jan 2014, 08:46:09

@Doug: 7.0.2 is currently setup to ''not'' auto-transliterate urls. So if some chars were replaced it would be because of some urlReplacing chars in the settings. If some chars were replaced while others were not, that would be because some were defined in urlReplacing and some were not. So... no bug in our code, not major issue, you're right!

Just looking at 1/ how we could enable unicode-to-ascii without code... 2/ cyrilic transliteration before considering this issue fixed.


Stephan 16 Jan 2014, 16:36:59

Pushed 326309e to 6.2.0. Adds some cyrilic support to the built-in transliterating library (although @Mac if you have a minute to reply to me that'd be great), and adds a new settings:

That toAscii="true" will trigger the transliterating library which will convert urls to ascii, and should already take care of most accentuated chars. Though you can still specify your own replacements -- which will run before the built-in stuff.

Sounds good?

Still need to merge to 7.0.2 -- later tonight.


Marc Stöcker 16 Jan 2014, 16:46:13

From our projects we have the experience that there is not the "one cyrillic", but serveral of them - at least regarding transliteration. I.e. there is a russian alphabet and a georgian one, et al. I'm no expert in that field, but will try to find out further details and post them. Could take a few days, though.


Douglas Robar 16 Jan 2014, 17:12:00

Nice stuff, Stephan! When merged, please note the toAscii="true" attribute in http://our.umbraco.org/documentation/Using-Umbraco/Config-files/umbracoSettings/

Otherwise we'll never remember it! :)


Stephan 16 Jan 2014, 19:05:23

Merged into 7.0.2. Have pushed updates to the doc too. Closing the issue.

@Mac: that's what I've come to understand... there are many ways to transliterate cyrilic because it depends on the language. At the moment I have implemented a "default" russian transliteration. Will think about how to make that modular so ppl can set it up by themselves.

The general idea being that the built-in transliteration is more efficient, perfs-wise, than the urlReplacing config. I'll try to take the best of both worlds, ie make the built-in transliteration configureable...


Marc Stöcker 16 Jan 2014, 19:37:24

Got the feedback from my co-worker much earlier than expected. He seconds it: There is no "default cyrillic", thus we need to i.e. define russian as the standard and if that doesn't suite the project, one needs to provide it's own custom scheme.

Thanks for the prompt fix - awesome work! :)


Stephan 17 Jan 2014, 07:28:08

Damn. Seing some issues in 7 that are not in 6 though we're running the same code?


Stephan 17 Jan 2014, 09:28:22

Pushed e6226d3 that should fix it.


Priority: Major

Type: Bug

State: Fixed

Assignee:

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions: 6.1.6, 7.0.1

Due in version: 6.2.0, 7.0.2

Sprint:

Story Points:

Cycle: