python识别中文、日文、韩文

import imp

imp.reload(sys)

s=”””

en: Regular expression is a powerful tool for manipulating text.

zh: Chinese is the world’s most beautiful language, the regular
expression is a useful tool

jp: Existing regular expression ni ni wa very tsu stand zu su Suites
Hikaru Te ki wo su ru ko と operation desu.

jp-char: あアいイうウえエおオ

kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.

“””

print ( “original utf8 character”)

#utf8

print (“——–” )

print( repr(s) )

print( “——–n” )

Original utf8 character

‘ N en: Regular expression is a powerful tool for manipulating text
n zh:. Chinese is the world’s most beautiful language, regular
expressions are a useful tool in the n jp: regular expression wa very
ni Battle ni Li-tsu tsu Hikaru Te ki su Suites wo su ru ko と operation
desu. N jp-char: thou アイいえ Ester お Dow Bio u n kr: 정규
표현식 은 매우 유용한 도구 텍스트 를 조작 하는 것 입니다 n ‘.

Non-ansi

re_words=re.compile(r”[x80-xff]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “non-ansi characters”)

print (“——–” )

print (m1 )

#print (m.group() )

print (“——–n” )

Non-ansi characters

[]

Chinese

re_words = re.compile(u”[u4e00-u9fa5]+”)

#m = re_words.search(s)

m1=re.findall(re_words, s)

#print(‘’.join(m1))

print ( “unicode Chinese”)

print(m1)

print( “——–” )

unicode Chinese

[ ‘Chinese is the world’s most beautiful language,’ ‘Regular
expressions are a very useful tool’, ‘regular expression’, ‘very’,
‘labor’, ‘stand’, ‘Operation’]

Korean

#unicode korean

re_words=re.compile(u”[uac00-ud7ff]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode Korean”)

print(m1)

print( “——–n” )

unicode Korean

[‘정규’, ‘표현식은’, ‘매우’, ‘유용한’, ‘도구’, ‘텍스트를’,
‘조작하는’, ‘것입니다’]

Japanese Katakana

#unicode japanese katakana

re_words=re.compile(u”[u30a0-u30ff]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode Japanese Katakana”)

print (“——–” )

print(m1)

print( “——–n” )

unicode Japanese Katakana

[‘ツールテキスト’, ‘ア’, ‘イ’, ‘ウ’, ‘エ’, ‘オ’]

Japanese Hiragana

#unicode japanese hiragana

re_words=re.compile(u”[u3040-u309f]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode Japanese hiragana”)

print (“——–” )

print(m1)

print( “——–n” )

unicode Japanese Hiragana

[‘は’, ‘に’, ‘に’, ‘つ’, ‘を’, ‘することです’, ‘あ’,
‘い’, ‘う’, ‘え’, ‘お’]

完整的日文Unicode列表：

http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

Punctuation

#unicode cjk Punctuation

re_words=re.compile(u”[u3000-u303fufb00-ufffd]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode punctuation”)

print (“——–” )

print(m1)

print( “——–n” )

unicode punctuation

[‘，’, ‘。’]

Python

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

python中lxml模块生成xml文件上一篇

re.sub借助group替换字符串示例下一篇