Bug #644: ASKLIBREOFFICE: Japanese site transliterates question URLs incorrectly - Infrastructure - The Document Foundation Redmine

Actions

Copy link

Bug #644

closed

ASKLIBREOFFICE: Japanese site transliterates question URLs incorrectly

Added by Matthew Francis over 11 years ago. Updated about 10 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Evgeny Fadeev

Category:

AskLibO

Target version:

Start date:

Due date:

% Done:

Tags:

URL:

Description

http://ask.libreoffice.org/ja is transliterating question titles into URLs as though from Chinese rather than Japanese.
Assuming that the point of this is to make a URL that can be read rather than only relying on a unique identifier, the result is jarring and unpleasant to read.

For instance, the question:

カスタムインストールで機能を選べません
becomes:
http://ask.libreoffice.org/ja/question/39107/kasutamuinsutorudeji-neng-woxuan-bemasen/

Within the URL, the three Chinese characters (* written Japanese partly uses Chinese characters, but they aren't pronounced anything like Chinese) are given the readings "ji" "neng" and "xuan", which are Chinese and not Japanese.

If this had been tokenised into words and transliterated correctly, it would read something like:
...kasutamu-insutoru-de-kino-wo-erabemasen/

Assuming this is something missing in AskBot itself rather than its configuration, fixing this problem while retaining the transliteration would involve adding the use of a Japanese morphological analyser. There are obvious candidates for this in C (chasen, mecab) and Java, but unfortunately in terms of AskBot, the only Python solutions I can see appear to involve interfacing with one or other of the C packages, which doesn't seem ideal in terms of a web server environment.

I don't know a great deal about Python, but feel free to ask me if you need any more help with the analysing Japanese text aspect, where I have some prior experience.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Florian Effenberger over 11 years ago

Assignee set to Evgeny Fadeev

Not sure if this is a bug or a feature request, adding Evgeny and poking him via e-mail

Actions

Copy link

Updated by Florian Effenberger over 11 years ago

Status changed from New to Feedback

The library that causes this is third party, so there might be little that we can do. If there is a chance to fix it on our side, to quote: "Firstly I would need to obtain the list of Japanese names of the Kanji characters."

Actions

Copy link

Updated by Matthew Francis over 11 years ago

Unfortunately it's not that easy, as unlike Chinese (where one character most usually only has one pronunciation, so a simple transliteration list will be right ~most of the time), in Japanese the correct pronunciation is always highly context dependent (changes fundamentally based on the surrounding text)

As above, the correct solution involves calling out to some existing Japanese morphological analyser, which will do magic with the Viterbi algorithm and other such mathematical whizzery to figure out the most statistically probable answer

Actions

Copy link

Updated by Florian Effenberger over 11 years ago

Thanks for the fast reply! Let's see if we find a solution, but it looks
like this is a problem outside of AskBot, but in a third-party library
we cannot fix in the near future

Actions

Copy link

Updated by Matthew Francis over 11 years ago

Could you possibly point me to where in the code whichever this existing library is is linked to?

If I can get an actual AskBot instance running locally I may be able to offer a more specific suggestion on how to proceed, or even do something about it myself...

Actions

Copy link

Updated by Evgeny Fadeev over 11 years ago

Matthew Francis wrote:

Could you possibly point me to where in the code whichever this existing library is is linked to?

If I can get an actual AskBot instance running locally I may be able to offer a more specific suggestion on how to proceed, or even do something about it myself...

The library is unidecode (pip install unidecode) and just try there.
I've seen the author stating somewhere that the library does not handle distinction of japanese and chinese languages.

Actions

Copy link

Updated by Alexander Werner about 10 years ago

Status changed from Feedback to Rejected

Unidecode: this library does not even attempt to address it.
The amount of work required does not justify the outcome, so unfortunately we have to reject the issue.

Actions

Copy link

Updated by Alexander Werner about 10 years ago

Related to Task #414: AskBot improvements added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Custom queries

Bug #644

ASKLIBREOFFICE: Japanese site transliterates question URLs incorrectly

Updated by Florian Effenberger over 11 years ago

Updated by Florian Effenberger over 11 years ago

Updated by Matthew Francis over 11 years ago

Updated by Florian Effenberger over 11 years ago

Updated by Matthew Francis over 11 years ago

Updated by Evgeny Fadeev over 11 years ago

Updated by Alexander Werner about 10 years ago

Updated by Alexander Werner about 10 years ago