python识别中文、日文、韩文

import imp

imp.reload(sys)

s=”””

en: Regular expression is a powerful tool for manipulating text.

zh: Chinese is the world’s most beautiful language, the regular
expression is a useful tool

jp: Existing regular expression ni ni wa very tsu stand zu su Suites
Hikaru Te ki wo su ru ko と operation desu.

jp-char: あアいイうウえエおオ

kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.

“””

print ( “original utf8 character”)

#utf8

print (“——–” )

print( repr(s) )

print( “——–n” )

  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

  • 8

  • 9

  • 10

  • 11

  • 12

  • 13

  • 14

  • 15

Original utf8 character


‘ N en: Regular expression is a powerful tool for manipulating text
n zh:. Chinese is the world’s most beautiful language, regular
expressions are a useful tool in the n jp: regular expression wa very
ni Battle ni Li-tsu tsu Hikaru Te ki su Suites wo su ru ko と operation
desu. N jp-char: thou ア イ い え Ester お Dow Bio u n kr: 정규
표현식 은 매우 유용한 도구 텍스트 를 조작 하는 것 입니다 n ‘.


  • 1

  • 2

  • 3

  • 4

Non-ansi

Non-ansi

re_words=re.compile(r”[x80-xff]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “non-ansi characters”)

print (“——–” )

print (m1 )

#print (m.group() )

print (“——–n” )

  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

  • 8

  • 9

  • 10

  • 11

Non-ansi characters


[]


  • 1

  • 2

  • 3

  • 4

Chinese

re_words = re.compile(u”[u4e00-u9fa5]+”)

#m = re_words.search(s)

m1=re.findall(re_words, s)

#print(‘’.join(m1))

print ( “unicode Chinese”)

print(m1)

print( “——–” )

  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

unicode Chinese

[ ‘Chinese is the world’s most beautiful language,’ ‘Regular
expressions are a very useful tool’, ‘regular expression’, ‘very’,
‘labor’, ‘stand’, ‘Operation’]


  • 1

  • 2

  • 3

Korean

#unicode korean

re_words=re.compile(u”[uac00-ud7ff]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode Korean”)

print(m1)

print( “——–n” )

  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

  • 8

unicode Korean

[‘정규’, ‘표현식은’, ‘매우’, ‘유용한’, ‘도구’, ‘텍스트를’,
‘조작하는’, ‘것입니다’]


  • 1

  • 2

  • 3

Japanese Katakana

#unicode japanese katakana

re_words=re.compile(u”[u30a0-u30ff]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode Japanese Katakana”)

print (“——–” )

print(m1)

print( “——–n” )

  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

  • 8

  • 9

unicode Japanese Katakana


[‘ツールテキスト’, ‘ア’, ‘イ’, ‘ウ’, ‘エ’, ‘オ’]


  • 1

  • 2

  • 3

  • 4

Japanese Hiragana

#unicode japanese hiragana

re_words=re.compile(u”[u3040-u309f]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode Japanese hiragana”)

print (“——–” )

print(m1)

print( “——–n” )

  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

  • 8

  • 9

unicode Japanese Hiragana


[‘は’, ‘に’, ‘に’, ‘つ’, ‘を’, ‘することです’, ‘あ’,
‘い’, ‘う’, ‘え’, ‘お’]


完整的日文Unicode列表:

http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

  • 1

  • 2

  • 3

  • 4

Punctuation

#unicode cjk Punctuation

re_words=re.compile(u”[u3000-u303fufb00-ufffd]+”)

#m = re_words.search(s,0)

m1=re.findall(re_words, s)

print ( “unicode punctuation”)

print (“——–” )

print(m1)

print( “——–n” )

  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

  • 8

  • 9

  • 10

unicode punctuation


[‘,’, ‘。’]



本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!