Bug #644
closedASKLIBREOFFICE: Japanese site transliterates question URLs incorrectly
0%
Description
http://ask.libreoffice.org/ja is transliterating question titles into URLs as though from Chinese rather than Japanese.
Assuming that the point of this is to make a URL that can be read rather than only relying on a unique identifier, the result is jarring and unpleasant to read.
For instance, the question:
カスタムインストールで機能を選べません
becomes:
http://ask.libreoffice.org/ja/question/39107/kasutamuinsutorudeji-neng-woxuan-bemasen/
Within the URL, the three Chinese characters (* written Japanese partly uses Chinese characters, but they aren't pronounced anything like Chinese) are given the readings "ji" "neng" and "xuan", which are Chinese and not Japanese.
If this had been tokenised into words and transliterated correctly, it would read something like:
...kasutamu-insutoru-de-kino-wo-erabemasen/
Assuming this is something missing in AskBot itself rather than its configuration, fixing this problem while retaining the transliteration would involve adding the use of a Japanese morphological analyser. There are obvious candidates for this in C (chasen, mecab) and Java, but unfortunately in terms of AskBot, the only Python solutions I can see appear to involve interfacing with one or other of the C packages, which doesn't seem ideal in terms of a web server environment.
I don't know a great deal about Python, but feel free to ask me if you need any more help with the analysing Japanese text aspect, where I have some prior experience.
Related issues
Updated by Florian Effenberger almost 10 years ago
- Assignee set to Evgeny Fadeev
Not sure if this is a bug or a feature request, adding Evgeny and poking him via e-mail
Updated by Florian Effenberger almost 10 years ago
- Status changed from New to Feedback
The library that causes this is third party, so there might be little that we can do. If there is a chance to fix it on our side, to quote: "Firstly I would need to obtain the list of Japanese names of the Kanji characters."
Updated by Matthew Francis almost 10 years ago
Unfortunately it's not that easy, as unlike Chinese (where one character most usually only has one pronunciation, so a simple transliteration list will be right ~most of the time), in Japanese the correct pronunciation is always highly context dependent (changes fundamentally based on the surrounding text)
As above, the correct solution involves calling out to some existing Japanese morphological analyser, which will do magic with the Viterbi algorithm and other such mathematical whizzery to figure out the most statistically probable answer
Updated by Florian Effenberger almost 10 years ago
Thanks for the fast reply! Let's see if we find a solution, but it looks
like this is a problem outside of AskBot, but in a third-party library
we cannot fix in the near future
Updated by Matthew Francis almost 10 years ago
Could you possibly point me to where in the code whichever this existing library is is linked to?
If I can get an actual AskBot instance running locally I may be able to offer a more specific suggestion on how to proceed, or even do something about it myself...
Updated by Evgeny Fadeev almost 10 years ago
Matthew Francis wrote:
Could you possibly point me to where in the code whichever this existing library is is linked to?
If I can get an actual AskBot instance running locally I may be able to offer a more specific suggestion on how to proceed, or even do something about it myself...
The library is unidecode (pip install unidecode) and just try there.
I've seen the author stating somewhere that the library does not handle distinction of japanese and chinese languages.
Updated by Alexander Werner over 8 years ago
- Status changed from Feedback to Rejected
Unidecode: this library does not even attempt to address it.
The amount of work required does not justify the outcome, so unfortunately we have to reject the issue.
Updated by Alexander Werner over 8 years ago
- Related to Task #414: AskBot improvements added