Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create README.md for Charset (#246) #258

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions euphony/src/main/cpp/core/charset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Charset

## Table of Contents
* [Charset](#Charset)
* [ASCIICharset](#ASCIICharset)
* [DefaultCharset](#DefaultCharset)
* [UTFCharset](#UTFChraset)
* [UTF8Charset](#UTF8Charset)
* [UTF16Charset](#UTF16Charset)
* [UTF32Charset](#UTF32Charset)
* [CharsetAutoSelector](#CharsetAutoSelector)
* [Reference](#Reference)

## Charset
* In euphony, Follow the Charset interface.
* Currently, Euphony library supports 5 types charsets such as ASCIICharset, DefaultCharset, UTF8Charset, UTF16Charset, UTF32Charset.
* In case of the DefaultCharset, Hex Value Itself.
* In case of the ASCIICharset, English & some symbols can be expressed.
* In addition, euphony supports UTF charset.


## ASCIICharset
All of you are familiar with ASCII codes. When you learn C language in your freshman year of college, you must have experienced that each letter is matched with a number. 'a' is stored as 61, 'b' as 62, and 'c' as 63.
ASCII code is stored as a total of 128 characters from 0 to 127. With only 7 bits, everything can be expressed. This means that it is stored in one byte. Everyone knows that a char type is 1 byte. ASCII is expressed with only 7 bits, so it is neatly stored in a char. All you need to know about ASCII code is this.

## DefaultCharset
Hex Value Itself.


## UTFCharset
UTF is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format.

#### Unicode
An industry standard designed to consistently represent and handle all characters in the world on your computer.

#### Configuration
ISO 10646 Character Set, Character Encoding, Character Information Database, Algorithms to Handle Characters, etc.

#### Code Point
The number of tables on which characters correspond.

#### Encoding
Rules for mapping code points to binary data.
Unicode mapping methods include UTF (Unicode Transformation Format) and UCS (Universal Coded Character Set).

#### Example
Text: "A"
Code point: U+0041 (0000000 01000001)
ASCII encoding: 01000001
UTF-8 encoding: 01000001

Text: "가"
Code Point: U+AC00 (1010110 0000000)
ASCII encoding: None
UTF-8 encoding: 11101010 10110000 10000000


### UTF8Charset
* Variable length encoding (1-3 bytes)
* Most used
* High compatibility (superset in ASCII)

### UTF16Charset
* Fixed length encoding (2 bytes)
* Easy to implement (same as bit representation of code points on a 2-byte basis)

### UTF32Charset
* Fixed length encoding (4 bytes)
* Easy to compare with Unicode (it is placed in a 4-byte position)
* Has the advantage of simplifying string processing (the size of one character is fixed to 4 bytes.)

## CharsetAutoSelector
It is a class that returns a character set that shows the shortest encoding result when a desired string is entered.

## Reference
* https://github.com/orgs/euphony-io/discussions/44
* https://github.com/orgs/euphony-io/discussions/58