[python] unicode string, check digit and alphabet

utf-8로 인코딩된 string의 경우는 isdigit(), isalpha()를 이용해서 string의 개별 요소(index로 접근)가 digit인지 alphabet인지 판단한다. 이것은 c언어와 정확히 같은 방식이다. 하지만, string.decode('utf-8')을 이용해서 unicode로 변환된 string에 대해서 개별 요소를 판단할때는 isdigit(), isalpha()를 사용하면 안된다. (사실 되는 것으로 착각하기도 하는데)

이때 사용하는 방법은 re를 이용하는 것도 있긴한데, 속도가 느리다.
간단하게는 아래와 같이 처리한다.

# python2.7x
# python2.6x does not support dict comprehension

DIGITS = {entry.decode('utf-8') for entry in map(chr, range(48,58))} # '0-9'
ALPHABETS = {entry.decode('utf-8') for entry in map(chr, range(65,91)) + map(chr, range(97,123))} # 'A-Z,a-z'

string = string.decode('utf-8')
for ch in string :
     if ch in DIGITS : do_something()
     if ch in ALPHABETS : do_something()

<참고>
re를 이용해서 regular expression matching을 수행할때, 패턴매칭할 대상 string이 unicode로 인코딩되었다면 compile할 대상 패턴도 unicode로 변환해줘야한다. 그리고 compile할때 옵션으로 re.UNICODE를 부여

영어/숫자/기호 문자 검사

def is_alnumsymbol(chr_u) :
    if chr_u in DIGITS : return True
    if chr_u in ALPHABETS : return True
    # see http://www.unicode.org/reports/tr44/#GC_Values_Table
    if category(chr_u)[0] == 'S' or category(chr_u)[0] == 'P' : return True
    return False

한글 문자 검사

def is_korean_chr(chr_u) :
    '''
    check if korean characters
    see http://www.unicode.org/reports/tr44/#GC_Values_Table
    '''
    category = unicodedata.category
    if category(chr_u)[0:2] == 'Lo' : # other characters
        if 'HANGUL' in unicodedata.name(chr_u) : return True
    return False

encode(unicode string -> utf-8 bytes seq) decode(utf-8 bytes seq -> unicode string)
ref : https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string#31322359

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] unicode string, check digit and alphabet

Clone this wiki locally