python识别中文、日文、韩文
import imp
imp.reload(sys)
s=”””
en: Regular expression is a powerful tool for manipulating text.
zh: Chinese is the world’s most beautiful language, the regular
expression is a useful tool
jp: Existing regular expression ni ni wa very tsu stand zu su Suites
Hikaru Te ki wo su ru ko と operation desu.
jp-char: あアいイうウえエおオ
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다.
“””
print ( “original utf8 character”)
#utf8
print (“——–” )
print( repr(s) )
print( “——–n” )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Original utf8 character
‘ N en: Regular expression is a powerful tool for manipulating text
n zh:. Chinese is the world’s most beautiful language, regular
expressions are a useful tool in the n jp: regular expression wa very
ni Battle ni Li-tsu tsu Hikaru Te ki su Suites wo su ru ko と operation
desu. N jp-char: thou ア イ い え Ester お Dow Bio u n kr: 정규
표현식 은 매우 유용한 도구 텍스트 를 조작 하는 것 입니다 n ‘.
1
2
3
4
Non-ansi
Non-ansi
re_words=re.compile(r”[x80-xff]+”)
#m = re_words.search(s,0)
m1=re.findall(re_words, s)
print ( “non-ansi characters”)
print (“——–” )
print (m1 )
#print (m.group() )
print (“——–n” )
1
2
3
4
5
6
7
8
9
10
11
Non-ansi characters
[]
1
2
3
4
Chinese
re_words = re.compile(u”[u4e00-u9fa5]+”)
#m = re_words.search(s)
m1=re.findall(re_words, s)
#print(‘’.join(m1))
print ( “unicode Chinese”)
print(m1)
print( “——–” )
1
2
3
4
5
6
7
unicode Chinese
[ ‘Chinese is the world’s most beautiful language,’ ‘Regular
expressions are a very useful tool’, ‘regular expression’, ‘very’,
‘labor’, ‘stand’, ‘Operation’]
1
2
3
Korean
#unicode korean
re_words=re.compile(u”[uac00-ud7ff]+”)
#m = re_words.search(s,0)
m1=re.findall(re_words, s)
print ( “unicode Korean”)
print(m1)
print( “——–n” )
1
2
3
4
5
6
7
8
unicode Korean
[‘정규’, ‘표현식은’, ‘매우’, ‘유용한’, ‘도구’, ‘텍스트를’,
‘조작하는’, ‘것입니다’]
1
2
3
Japanese Katakana
#unicode japanese katakana
re_words=re.compile(u”[u30a0-u30ff]+”)
#m = re_words.search(s,0)
m1=re.findall(re_words, s)
print ( “unicode Japanese Katakana”)
print (“——–” )
print(m1)
print( “——–n” )
1
2
3
4
5
6
7
8
9
unicode Japanese Katakana
[‘ツールテキスト’, ‘ア’, ‘イ’, ‘ウ’, ‘エ’, ‘オ’]
1
2
3
4
Japanese Hiragana
#unicode japanese hiragana
re_words=re.compile(u”[u3040-u309f]+”)
#m = re_words.search(s,0)
m1=re.findall(re_words, s)
print ( “unicode Japanese hiragana”)
print (“——–” )
print(m1)
print( “——–n” )
1
2
3
4
5
6
7
8
9
unicode Japanese Hiragana
[‘は’, ‘に’, ‘に’, ‘つ’, ‘を’, ‘することです’, ‘あ’,
‘い’, ‘う’, ‘え’, ‘お’]
完整的日文Unicode列表:
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
1
2
3
4
Punctuation
#unicode cjk Punctuation
re_words=re.compile(u”[u3000-u303fufb00-ufffd]+”)
#m = re_words.search(s,0)
m1=re.findall(re_words, s)
print ( “unicode punctuation”)
print (“——–” )
print(m1)
print( “——–n” )
1
2
3
4
5
6
7
8
9
10
unicode punctuation
[‘,’, ‘。’]
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!