NAME Lingua::JA::NormalizeText - Text Normalizer SYNOPSIS use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); print $normalizer->normalize('é³¥ãŒãŒ§ãŒ¦ã§ã‚ã‚Šã‚“ã™♥'); # -> é³¥ãŒãƒˆãƒ³ãƒ‰ãƒ«ã§ã™â™¥ sub dearinsu_to_desu { my $text = shift; $text =~ s/ã§ã‚ã‚Šã‚“ã™/ã§ã™/g; return $text; } # or use Lingua::JA::NormalizeText qw/old2new_kanji/; use utf8; print old2new_kanji('惡ã®è¯'); # -> 悪ã®è¯ DESCRIPTION Lingua::JA::NormalizeText normalizes text. METHODS new(@options) Creates a new Lingua::JA::NormalizeText instance. The following options are available: OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- --------------------- ----------------------- lc DdD ddd uc DdD DDD nfkc ㌦ ドル (length: 2) nfkd ㌦ ドル (length: 3) nfc nfd decode_entities ♥ ♥ strip_html <em>ã‚</em> ã‚ alnum_z2h ABC123 ABC123 alnum_h2z ABC123 ABC123 space_z2h space_h2z katakana_z2h ãƒã‚¡ãƒã‚¡ ハァハァ katakana_h2z スーハースーハー スーãƒãƒ¼ã‚¹ãƒ¼ãƒãƒ¼ katakana2hiragana パンツ ã±ã‚“㤠hiragana2katakana ã±ã‚“㤠パンツ wave2tilde 〜, 〰 ~ tilde2wave ~ 〜 wavetilde2long 〜, 〰, ~ ー wave2long 〜, 〰 ー tilde2long ~ ー fullminus2long ï¼ ãƒ¼ dashes2long — ー drawing_lines2long ─ ー unify_long_repeats ヴァーーー ヴァー nl2space (LF)(CR)(CRLF} (space)(space)(space) unify_nl (LF)(CR)(CRLF) \n\n\n unify_long_spaces ã‚(space)(space)ã‚ ã‚(space)ã‚ unify_whitespaces \x{00A0} (space) trim (space)ã‚(space)ã‚(space) ã‚(space)ã‚ ltrim (space)ã‚(space) ã‚(space) rtrim ã‚ã‚(space)(space) ã‚ã‚ old2new_kana ã‚ヰゑヱヸヹ ã„イãˆã‚¨ã‚¤ã‚™ã‚¨ã‚™ old2new_kanji äºžï©§é¬ äºœé€¸é—˜ tab2space (tab)(tab) (space)(space) remove_controls ã‚\x{0000}ã‚ ã‚ã‚ remove_spaces (space)ã‚(space)ã‚(space) ã‚ã‚ dakuon_normalize ã•\x{3099} ã– handakuon_normalize ã¯\x{309A} ã± all_dakuon_normalize ã•\x{3099}ã¯\x{309A} ã–ã± The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.) External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.) normalize($text) normalizes $text. OPTIONS dashes2long Note that this option does not convert hyphens into long. drawing_line2long This option converts drawing lines which are similar to long(U+30FC) in appearance. unify_long_spaces Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000). remove_controls Note that this option does not remove the following characters: CHARACTER TABULATION LINE FEED CARRIAGE RETURN remove_spaces Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000). unify_whitespaces This option converts the following characters into SPACE(U+0020). LINE TABULATION FORM FEED NEXT LINE NO-BREAK SPACE OGHAM SPACE MARK MONGOLIAN VOWEL SEPARATOR EN QUAD EM QUAD EN SPACE EM SPACE THREE-PER-EM SPACE FOUR-PER-EM SPACE SIX-PER-EM SPACE FIGURE SPACE PUNCTUATION SPACE THIN SPACE HAIR SPACE LINE SEPARATOR PARAGRAPH SEPARATOR NARROW NO-BREAK SPACE MEDIUM MATHEMATICAL SPACE Note that this does not convert the following characters: CHARACTER TABULATION LINE FEED CARRIAGE RETURN IDEOGRAPHIC SPACE AUTHOR pawa <pawapawa@cpan.org> SEE ALSO æ–°æ—§å—体表: <http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html> Lingua::JA::Regular::Unicode Lingua::JA::Dakuon Lingua::JA::Moji Unicode::Normalize HTML::Entities HTML::Scrubber LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.