Project

General

Profile

Bug #644

ASKLIBREOFFICE: Japanese site transliterates question URLs incorrectly

Added by Matthew Francis over 3 years ago. Updated almost 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Askbot
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:
URL:

Description

http://ask.libreoffice.org/ja is transliterating question titles into URLs as though from Chinese rather than Japanese.
Assuming that the point of this is to make a URL that can be read rather than only relying on a unique identifier, the result is jarring and unpleasant to read.

For instance, the question:

カスタムインストールで機能を選べません
becomes:
http://ask.libreoffice.org/ja/question/39107/kasutamuinsutorudeji-neng-woxuan-bemasen/

Within the URL, the three Chinese characters (* written Japanese partly uses Chinese characters, but they aren't pronounced anything like Chinese) are given the readings "ji" "neng" and "xuan", which are Chinese and not Japanese.

If this had been tokenised into words and transliterated correctly, it would read something like:
...kasutamu-insutoru-de-kino-wo-erabemasen/

Assuming this is something missing in AskBot itself rather than its configuration, fixing this problem while retaining the transliteration would involve adding the use of a Japanese morphological analyser. There are obvious candidates for this in C (chasen, mecab) and Java, but unfortunately in terms of AskBot, the only Python solutions I can see appear to involve interfacing with one or other of the C packages, which doesn't seem ideal in terms of a web server environment.

I don't know a great deal about Python, but feel free to ask me if you need any more help with the analysing Japanese text aspect, where I have some prior experience.


Related issues

Related to Infrastructure - Task #414: AskBot improvementsClosed

History

#1 Updated by Florian Effenberger about 3 years ago

  • Assignee set to Evgeny Fadeev

Not sure if this is a bug or a feature request, adding Evgeny and poking him via e-mail

#2 Updated by Florian Effenberger about 3 years ago

  • Status changed from New to Feedback

The library that causes this is third party, so there might be little that we can do. If there is a chance to fix it on our side, to quote: "Firstly I would need to obtain the list of Japanese names of the Kanji characters."

#3 Updated by Matthew Francis about 3 years ago

Unfortunately it's not that easy, as unlike Chinese (where one character most usually only has one pronunciation, so a simple transliteration list will be right ~most of the time), in Japanese the correct pronunciation is always highly context dependent (changes fundamentally based on the surrounding text)

As above, the correct solution involves calling out to some existing Japanese morphological analyser, which will do magic with the Viterbi algorithm and other such mathematical whizzery to figure out the most statistically probable answer

#4 Updated by Florian Effenberger about 3 years ago

Thanks for the fast reply! Let's see if we find a solution, but it looks
like this is a problem outside of AskBot, but in a third-party library
we cannot fix in the near future

#5 Updated by Matthew Francis about 3 years ago

Could you possibly point me to where in the code whichever this existing library is is linked to?

If I can get an actual AskBot instance running locally I may be able to offer a more specific suggestion on how to proceed, or even do something about it myself...

#6 Updated by Evgeny Fadeev about 3 years ago

Matthew Francis wrote:

Could you possibly point me to where in the code whichever this existing library is is linked to?

If I can get an actual AskBot instance running locally I may be able to offer a more specific suggestion on how to proceed, or even do something about it myself...

The library is unidecode (pip install unidecode) and just try there.
I've seen the author stating somewhere that the library does not handle distinction of japanese and chinese languages.

#7 Updated by Alexander Werner almost 2 years ago

  • Status changed from Feedback to Rejected

Unidecode: this library does not even attempt to address it.
The amount of work required does not justify the outcome, so unfortunately we have to reject the issue.

#8 Updated by Alexander Werner almost 2 years ago

  • Related to Task #414: AskBot improvements added

Also available in: Atom PDF