We have moved to GitHub Issues
Created by Douglas Robar 15 Jan 2014, 12:34:44 Updated by Can Koluman 26 Sep 2016, 10:35:51
Using (nightly builds of) 7.0.2, I noticed that the urlReplacing section of the umbracosettings.config file have been removed.
I also notice that unicode characters in page names are allowed through into the generated URL for that page. But not always.
For instance, a page named this (please forgive my copy-paste of foreign languages): ÆØÅ and æøå and 中文测试 and אודות האתר and größer page
generates this url: æøå-and-aeoeaa-and-中文测试-and-אודות-האתר-and-groesser-page
Notice that some characters are passed through unchanged (though lower-cased) and others are replaced with ascii equivalents ala the old urlReplacing strategy. This is quite confusing to me though I have followed the discussion for a very long time (see U4-3732 and U4-750 among others).
Things I find confusing about the current situation:
Thanks for any insight.
Confusing, sure. Such as, points 1 and 5 I don't understand myself. Which prob. means I have screwed something else. Oh my. Will work on it later today.
In ''theory'' no Unicode should be replaced, by default, and there's an easy way (though via code) to tell Umbraco to please replace all unicode chars by ASCII. Will document it here once it works. As for why the urlReplacing section is gone... I have no idea as that's still supposed to work.
Conclusion: it's a mess where I thought it was all clean. Give me a bit of time!
Thank you, my friend!
I think the reason for #1 and #2 are that they are handled by the (invisible) list of urlReplacing chars, which specifically includes those characters but not the capitalized versions.
My suggestion (at lest for a 'quick fix') would be:
Would that get 95% of the way in 5% of the effort?
Doesn't this new "magic auto" instead of the former config way prevent transliteration? For instance with russian sites we use that all the time: http://our.umbraco.org/forum/developers/extending-umbraco/22910-Transliterating-cyrillic-URLs-with-umbracoSettingsconfig
If the output is unicode, since we're still using urlReplacing, nothing should change. If the output is ascii, because we do urlReplacing first, nothing should change. But, to do unicode -> ascii we use the transliteration library that is used, amonst others, by Lucene.Net, and takes care of a ''lot'' of chars. I will check whether it takes care of the characters listed in that Our post, and report here.
At the moment, working on fixing the issues Doug pointed out.
FWIW, I tried inserting the
My mistake... manually adding the
As noted in http://our.umbraco.org/documentation/Using-Umbraco/Config-files/umbracoSettings/#RequestHandler, the urlReplacing section has been removed from the default config file but the content of the built-in settings are shown for reference.
I guess that's the definitive answer on my request to reinstate the urlReplacing section of the config file. Since the link to the documentation is included at the top of the umbracosettings.config file there's no need for further action in that regard, and I think that handles Marc's concern as well.
If I didn't want the unicode chars (such as æøå) that are specifically mentioned in the default urlReplacing I could add my own urlReplacing section in my local umbracoSettings.config file:
Note that an application pool recycle is required for the changes to take effect.
@Stephan -- I think this goes a long way to alleviating my concerns. What do you think?
@Doug: I still don't understand how you've been able to obtain that first url you posted, with some chars being lowercased and some being transliterated. Can't reproduce on 6.2, now about to test on 7.0.2 (although it should be the same code).
Now what I take from this discussion is that a) yes, it's only a matter of documentation but b) we now have built-in ''very'' powerful transliterating code, and ppl would need to enable it through ''code'', and maybe it should instead be a config option?
That code does not handle the cyrilic chars yet, but it would take me 10 minutes to add them to the list.
So I will a) try to figure out how you produced your crazy url, and b) see if there's a way we can make it very easy to enable the builtin unicode-to-ascii conversion (no code).
@Stephan - I downloaded 7.0.2 nightly #238 and ran it in webmatrix/sqlce. I then made a single doctype with no properties and created a content node. Pasted in the above text for the name and published it. All my tests were done with that build, if that helps (didn't try it on 6.2.0 nightly).
As I'm understanding the situation and how things work better, I think this is not such a major issue. The easy replacing can be done by adding one's own urlReplacing section to the config file, mentioned here and with references to the docs already in the config file for users. More complex replacing can be done with code as you've mentioned, which is mega-powerful! That's a good mix of config and code I think.
An option to enable/disable unicode-to-ascii conversion in the umbracosettings.config file would be nice indeed.
I don't know about the quality of the transliteration that Lucene.Net provides, so I will have to look into that. Generally I like the idea of having a quality transliteration libery in use instead of any homebrew thing.
@Mac: was about to add cyrilic tranliteration to our code, but it looks like ppl can't agree on how to properly transliterate cyrilic ;-) Found dozens of variants eg should Й become Y or J, etc. If you can provide me with an ascii equivalent of the following string, with reasonable confidence that most ppl will be happy with it, it might go into Core:
"А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я"
what about: "A B V G D E E Zh Z I I K L M N O P Q R S T U F Kh F Ch Sh Shch " Y ' E Yu Ya"
though... this will transliterate Анастасия as Anastasiya which if I understand correctly is ugly and should be Anastasia... but that means that some chars transliterate differently depending on their adjacent chars? Is this true? Is there a list of such subtleties?
@Doug: 7.0.2 is currently setup to ''not'' auto-transliterate urls. So if some chars were replaced it would be because of some urlReplacing chars in the settings. If some chars were replaced while others were not, that would be because some were defined in urlReplacing and some were not. So... no bug in our code, not major issue, you're right!
Just looking at 1/ how we could enable unicode-to-ascii without code... 2/ cyrilic transliteration before considering this issue fixed.
Pushed 326309e to 6.2.0. Adds some cyrilic support to the built-in transliterating library (although @Mac if you have a minute to reply to me that'd be great), and adds a new settings:
That toAscii="true" will trigger the transliterating library which will convert urls to ascii, and should already take care of most accentuated chars. Though you can still specify your own replacements -- which will run before the built-in stuff.
Still need to merge to 7.0.2 -- later tonight.
From our projects we have the experience that there is not the "one cyrillic", but serveral of them - at least regarding transliteration. I.e. there is a russian alphabet and a georgian one, et al. I'm no expert in that field, but will try to find out further details and post them. Could take a few days, though.
Nice stuff, Stephan! When merged, please note the toAscii="true" attribute in http://our.umbraco.org/documentation/Using-Umbraco/Config-files/umbracoSettings/
Otherwise we'll never remember it! :)
Merged into 7.0.2. Have pushed updates to the doc too. Closing the issue.
@Mac: that's what I've come to understand... there are many ways to transliterate cyrilic because it depends on the language. At the moment I have implemented a "default" russian transliteration. Will think about how to make that modular so ppl can set it up by themselves.
The general idea being that the built-in transliteration is more efficient, perfs-wise, than the urlReplacing config. I'll try to take the best of both worlds, ie make the built-in transliteration configureable...
Got the feedback from my co-worker much earlier than expected. He seconds it: There is no "default cyrillic", thus we need to i.e. define russian as the standard and if that doesn't suite the project, one needs to provide it's own custom scheme.
Thanks for the prompt fix - awesome work! :)
Damn. Seing some issues in 7 that are not in 6 though we're running the same code?
Pushed e6226d3 that should fix it.
Backwards Compatible: True
Affected versions: 6.1.6, 7.0.1
Due in version: 6.2.0, 7.0.2