7.1 텍스트 문자열 · 초보 웹 프로그래머

문자열 텍스트 데이터에 사용되는 유니코드 문자의 시퀀스
바이트와 바이트 배열 이진 데이터에 사용되는 8비트 정수의 시퀀스
7.1.1 유니코드

유니코드 문자열
```
def unicode_test(value):
import unicodedata
name = unicodedata.name(value)
value2 = unicodedata.lookup(name)
print('values"%s", name="%s", valeu2="%s"' % (value, name, value2))
```
```
unicode_test('A')
```
values”A”, name=”LATIN CAPITAL LETTER A”, valeu2=”A”
```
unicode_test('$')
```
values”$”, name=”DOLLAR SIGN”, valeu2=”$”
```
unicode_test('\u00a2')
```
values”¢”, name=”CENT SIGN”, valeu2=”¢”
```
unicode_test('\u20ac')
```
values”€”, name=”EURO SIGN”, valeu2=”€”
```
unicode_test('\u2603')
```
values”☃”, name=”SNOWMAN”, valeu2=”☃”
```
place = 'caf\u00e9'
```
```
place
```
‘café’
```
place2 = 'caf\N{LATIN SMALL LETTER E WITH ACUTE}'
```
```
place2
```
‘café’
```
u_umlaut = '\N{LATIN SMALL LETTER U WITH DIAERESIS}'
```
```
u_umlaut
```
‘ü’
```
len('$')
```
1
문자열 len함수는 유니코드의 바이트가 아닌 문자수를 센다.
```
len('\U0001f47b')
```
1

UTF-8 인코딩과 디코딩
문자열을 바이트로 인코딩
바이트를 문자열로 디코딩
utf-8은 파이썬, 리눅스, HTML의 표준 텍스트 인코딩이다. UTF-8은 빠르고 완전하고 잘 동작함.
인코딩
문자열을 바이트로 인코딩 해보자. 문자열 encode() 함수의 첫번째 인자는 인코딩 이름이다.
```
snowman = '\u2603'
```
```
len(snowman)
```
1
```
ds = snowman.encode('utf-8')
```
```
ds
```
b’\xe2\x98\x83’
```
len(ds)
```
3

아스키 인코딩을 사용할 때, 유니코드가 문자가 유효한 아스키 문자가 아닌 경우 실패한다.

ds = snowman.encode('ascii')

  UnicodeEncodeError Traceback (most recent call last)
  <ipython-input-22-5f5dc414d940> in <module>()
  ----> 1 ds = snowman.encode('ascii')

  UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128)

snowman.encode('ascii', 'ignore')

b’’

snowman.encode('ascii', 'replace')

b’?’

snowman.encode('ascii', 'backslashreplace')

b’\u2603’

유니코드 이스케이프 처럼 파이썬 유니코드 문자의 문자열을 만든다.
snowman.encode('ascii', 'xmlcharrefreplace')
b’☃’ 유니코드 이스케이프 시퀀스를 출력 할 수 있는 문자열로 만듬

디코딩
외부 소스에서 텍스트를 얻을 때 마다 바이트 문자열로 인코딩 되어 있다. 실제로 사용된 인코딩을 알기위해, 인코딩 과정을 거꾸로 하여 유니코드 문자열을 얻을 수 있따.
place = 'caf\u00e9'
place
‘café’
type(place)
str
place_bytes = place.encode('utf-8')
place_bytes
b’caf\xc3\xa9’
type(place_bytes)
bytes
len(place_bytes)
5
place2 = place_bytes.decode('utf-8')
place2
‘café’
place3 = place_bytes.decode('ascii')

UnicodeDecodeError Traceback (most recent call last)
<ipython-input-34-b2f66e476fde> in <module>()
----> 1 place3 = place_bytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

place4 = place_bytes.decode('latin-1')
place4

‘cafÃ©’

place5 = place_bytes.decode('windows-1252')
place5

‘cafÃ©’

가능하면 UTF-8을 써라
7.1.2 포맷
문자열 출력 관련

고전 스타일의 %

'%s' % 42

‘42’

'%s' % 7.03

‘7.03’

'%d%%' % 100

‘100%’

wow = 'World'
'Hello %s' % wow

‘Hello World’

신상 스타일의 {}와 format

새로운 스타일의 포매팅을 사용 하는 것을 추천

'{} {} {}'.format('이름', '심볼', 'temp')

‘이름 심볼 temp’

'{2} {0} {1}'.format('이름', '심볼', 'temp')

‘temp 이름 심볼’

'{name} {symbol} {temp}'.format(name='이름', symbol='심볼', temp='temp')

‘이름 심볼 temp’

d = {'n': 42, 'f': 7.03, 's': 'string cheese'}

'{0[n]} {0[f]} {0[s]} {1}'.format(d, 'other')

‘42 7.03 string cheese other’

n = 42
f = 7.03
s = 'cheese'
'{0:d} {1:f} {2:s}'.format(n, f, s)

‘42 7.030000 cheese’

'{n:d} {f:f} {s:s}'.format(n=42, f=7.03, s='cheese')

‘42 7.030000 cheese’

'{0:10d} {1:10f} {2:10s}'.format(n, f, s)

’ 42 7.030000 cheese ‘

'{0:>10d} {1:>10f} {2:>10s}'.format(n, f, s)

‘42 7.030000 cheese ‘

'{0:<10d} {1:<10f} {2:<10s}'.format(n, f, s)

‘42 7.030000 cheese ‘

'{0:^10d} {1:^10f} {2:^10s}'.format(n, f, s)

’ 42 7.030000 cheese ‘

'{0:^10.4d} {1:^10.4f} {2:^10.4s}'.format(n, f, s)

  ValueError Traceback (most recent call last)
  <ipython-input-64-fd7489b930c8> in <module>()
  ----> 1 '{0:^10.4d} {1:^10.4f} {2:^10.4s}'.format(n, f, s)

  ValueError: Precision not allowed in integer format specifier

'{0:^10d} {1:^10.4f} {2:^10.4s}'.format(n, f, s)

’ 42 7.0300 chee ‘

'{0:!^20s}'.format('BIG SALE')

’!!!!!!BIG SALE!!!!!!’

IntroducingPython

7.1.1 유니코드

유니코드 문자열

UTF-8 인코딩과 디코딩

인코딩

디코딩

7.1.2 포맷

Comments

Other Posts