Golang UTF 8 Encoding

Golang UTF 8 Encoding

Does golang used utf-8 encoding ?

Yes, Go (Golang) uses UTF-8 encoding for its strings. Here are some details about how Go handles UTF-8:

Key Points

  1. String Representation:
    • In Go, strings are represented as UTF-8 encoded sequences of bytes. This means that strings in Go can include any valid Unicode character, and the encoding is handled natively in UTF-8.
  2. String Literals:
    • When you use string literals in Go (like "hello", "こんにちは"), these are stored as UTF-8 encoded bytes. Go’s string type does not store character information directly but rather a sequence of bytes. The interpretation of these bytes as Unicode characters is done based on the UTF-8 encoding.
  3. String Functions:
    • Go’s standard library provides functions that work with UTF-8 encoded strings. For example, the unicode package contains functions for working with Unicode properties, and many other packages handle UTF-8 encoded strings efficiently.
  4. Encoding and Decoding:
    • The encoding/utf8 package in the Go standard library provides functions to encode and decode UTF-8 data, check if a byte slice is valid UTF-8, and work with code points and their byte representations.
  5. Character Access:
    • You can access individual characters in a string using indexing (e.g., s[0]), but this will give you a byte value. To work with Unicode code points (runes), you can use the range loop, which iterates over the string and provides the Unicode code points:
for _, r := range s {
    fmt.Printf("%c ", r)
}

Example Code-1

Here’s a simple example showing how Go handles UTF-8 encoding:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "⌘ こんにちは"

    fmt.Println("String:", s)

    fmt.Println("Bytes:")
    for i := 0; i < len(s); i++ {
        fmt.Printf("%x ", s[i])
    }
    fmt.Println()

    fmt.Println("Runes:")
    for _, r := range s {
        fmt.Printf("%c ", r)
    }
    fmt.Println()

    fmt.Println("Rune count:", utf8.RuneCountInString(s))
}

Output

String: ⌘ こんにちは
Bytes:
e2 8c 98 20 e3 81 93 e3 82 93 e3 81 82 e3 81 95 
Runes:
⌘  こ ん に ち は 
Rune count: 8

This example shows how UTF-8 encoded bytes are printed, how Unicode runes are iterated over, and how many runes are in the string.

Example Code-2

func main() {
    const placeOfInterest =

    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")

    //Unicode code point '\u2318'
    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("hex bytes: ")
    // e2 8c 98 
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[i])
    }
    fmt.Printf("\n")
}

******************************
Output
******************************
plain string: ⌘
quoted string: '\u2318'
hex bytes: e2 8c 98 

The UTF-8 encoding of the symbol (Unicode code point U+2318) into source text is as follows:

  • The Unicode code point U+2318 for is 2318 in hexadecimal.
  • When encoded in UTF-8, this becomes a sequence of 3 bytes: E2 8C 98.

Breakdown of the UTF-8 encoding process

  1. The code point 2318 falls into the range 0800 to FFFF, which requires three bytes in UTF-8 encoding.
  2. The encoding rule for three bytes in UTF-8 is:
    • 1110xxxx 10xxxxxx 10xxxxxx
    • where xxxx are the bits from the Unicode code point.
  3. The binary value of 2318 is 0010 0011 0001 1000.
  4. Splitting this into bits for the three bytes: 11100010 10001100 10011000
  5. These binary values correspond to the hexadecimal values: E2 8C 98.

Thus, the UTF-8 encoded bytes for the symbol are E2 8C 98, which you can include in the source text using escape sequences like:

const placeOfInterest = "\xe2\x8c\x98"

This is equivalent to the string in UTF-8 encoding.

Low Level Details

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	s := "こんにちは"

	fmt.Println("Character | Unicode (Hex) | Unicode (Binary) | UTF-8 (Hex) | UTF-8 (Binary)")
	fmt.Println("--------------------------------------------------------------------------")

	// Iterate through each rune (character)
	for i, r := range s {
		// Get the Unicode code point in hexadecimal and binary
		unicodeHex := fmt.Sprintf("U+%04X", r)
		unicodeBinary := fmt.Sprintf("%016b", r)

		// Get the UTF-8 encoding in hex and binary
		utf8Bytes := make([]byte, utf8.RuneLen(r))
		utf8.EncodeRune(utf8Bytes, r)

		fmt.Printf("len(utf8Bytes)=%d\n", len(utf8Bytes))

		utf8Hex := ""
		utf8Binary := ""
		for _, b := range utf8Bytes {
			utf8Hex += fmt.Sprintf("%02X ", b)
			utf8Binary += fmt.Sprintf("%08b ", b)
		}

		// fmt.Println("Character | Unicode (Hex) | Unicode (Binary) | UTF-8 (Hex) | UTF-8 (Binary)")
		// fmt.Println("--------------------------------------------------------------------------")

		// Print the Unicode and UTF-8 representations
		fmt.Printf("%d: %c  | %s | %s | %s| %s\n", i, r, unicodeHex, unicodeBinary, utf8Hex, utf8Binary)
	}
}

******************************
Output
******************************
// a) 1110xxxx 10xxxxxx 10xxxxxx

// b) The binary value of 3053 is 0011 0000 0101 0011.

// This values from b are populated in a to get UTF8 binary

// Character  | Unicode (Hex) | Unicode (Binary)    | UTF-8 (Hex)  | UTF-8 (Binary)
// --------------------------------------------------------------------------------------------
// 0: こ      | U+3053 	       | 0011 0000 0101 0011 | E3 81 93 	| 11100011 10000001 10010011
// 3: ん      | U+3093 	       | 0011 0000 1001 0011 | E3 82 93 	| 11100011 10000010 10010011
// 6: に      | U+306B 	       | 0011 0000 0110 1011 | E3 81 AB 	| 11100011 10000001 10101011
// 9: ち      | U+3061 	       | 0011 0000 0110 0001 | E3 81 A1 	| 11100011 10000001 10100001
// 12: は     | U+306F 	       | 0011 0000 0110 1111 | E3 81 AF 	| 11100011 10000001 10101111

Unicode and UTF-8 Breakdown:

  1. (U+3053)
    • Unicode: U+3053 → binary: 11000001010011 (16-bit)
    • UTF-8 encoding:
      • Binary: 11100011 10000101 10010011 (three bytes)
      • Hex: E3 81 93

Mindblowing ! But why they made it so complex ?

The complexity of UTF-8 encoding comes from the need to balance several goals in a global, diverse computing environment. Here’s why it was designed this way:

1. Backward Compatibility with ASCII

  • Problem: In the early days, most computers used 7-bit or 8-bit ASCII (0–127) for text encoding. This was simple but limited to the English alphabet and a few control characters.
  • UTF-8 Solution: UTF-8 was designed to be backward-compatible with ASCII. All characters in the ASCII range (0x00 to 0x7F) are represented in UTF-8 by a single byte with the same value, so existing ASCII-based systems can still read and process UTF-8 without modification. Other encodings, like UTF-16, lack this compatibility.

2. Variable Length Encoding

  • Problem: There are over 1.1 million possible Unicode code points (symbols from all languages, scripts, and symbols). A fixed-length encoding would be either inefficient or unable to represent all of them in a reasonable number of bytes.
  • UTF-8 Solution: UTF-8 uses variable-length encoding, meaning common characters (like those in ASCII) use just 1 byte, while more complex characters require 2, 3, or 4 bytes. This makes it both efficient for common text (like English) and flexible enough to represent all of Unicode.

3. Self-Synchronization

  • Problem: In text streams (such as files or network transmissions), it’s easy to lose track of where one character starts and another ends if each character takes a different number of bytes.
  • UTF-8 Solution: UTF-8 encodings are self-synchronizing. The first byte of a character tells you whether it’s a single-byte or multi-byte character, and how many additional bytes follow. For example:
    • 0xxxxxxx for 1-byte characters (ASCII).
    • 110xxxxx for the start of a 2-byte sequence.
    • 1110xxxx for the start of a 3-byte sequence (like your ).
    • Continuation bytes are always 10xxxxxx, allowing the decoder to detect if it’s in the middle of a multi-byte character. This structure makes it easy to recover from errors or byte loss by quickly finding the start of the next valid character.

4. Global Language Support

  • Problem: The world needs to represent not just the Latin alphabet, but scripts from every language, including Chinese, Arabic, Devanagari, and many more.
  • UTF-8 Solution: UTF-8 can encode all characters from Unicode, allowing for multilingual text and international applications to be handled by the same system. It was designed to efficiently encode these different scripts while minimizing the storage cost for common characters.

Use in real Life

The UTF-8 encoding (11100011 10000001 10010011, for example) is used for storing and transferring data, not the raw Unicode code points.

Here’s a more detailed explanation of how and why UTF-8 encoding is used in storing and transferring data:

1. UTF-8 for Storage

  • Storage: When you store text in a file, database, or any storage medium, the text is typically stored in UTF-8 encoded bytes, not as Unicode code points. For example, the character (U+3053 in Unicode) will be stored as the UTF-8 byte sequence E3 81 93 (or 11100011 10000001 10010011 in binary).
  • This is more space-efficient because UTF-8 is a variable-length encoding. Characters from the ASCII range (0–127) use only one byte, and more complex characters like use multiple bytes (3 in this case), making it flexible for a wide range of text.

2. UTF-8 for Data Transmission

  • Transfer over Networks: When text is sent over the internet or transferred between systems, it’s transmitted as a sequence of bytes. UTF-8 is ideal for this because it is byte-oriented and designed to be compact for common characters, while still allowing for the full range of Unicode characters.
  • For example, the character (U+2318) is transmitted as E2 8C 98 (or 11100010 10001100 10011000 in binary) over the network.

3. Why UTF-8?

  • Efficiency: UTF-8 is more space-efficient for languages that primarily use ASCII characters (like English), where characters are represented by just 1 byte. For other languages with complex scripts (like Japanese, Arabic, etc.), it uses 2, 3, or 4 bytes per character.
  • Compatibility: Since UTF-8 is backward-compatible with ASCII, any existing system that can handle ASCII text can also handle UTF-8 text, making it a universal solution.
  • Self-Synchronizing: UTF-8’s encoding makes it easy to detect where a character starts and ends in a byte stream, which is critical when transferring data over networks where transmission errors may occur.

Real-world Example

When you store the string "こんにちは" in a file:

  • The string is encoded in UTF-8 as a sequence of bytes:
    • : E3 81 93
    • : E3 82 93
    • : E3 81 AB
    • : E3 81 A1
    • : E3 81 AF

These bytes are what gets written to the file. Similarly, when this string is sent over the network, it’s transmitted as this same sequence of bytes.

Summary

While UTF-8 may seem complex, this complexity enables:

  • Compatibility with legacy systems (ASCII).
  • Efficient use of space (common characters are 1 byte).
  • Support for all languages and symbols in the world.
  • Robust error handling and easy recovery from data corruption.
  • UTF-8 is the actual encoding used for storing, transmitting, and handling text data in most modern systems.
  • Unicode code points (like U+3053 for ) are abstract representations of characters, while UTF-8 bytes (like E3 81 93) are how these characters are encoded into a format suitable for storage and transmission.

It’s a smart balance of efficiency, compatibility, and universality, making it the most widely-used encoding standard today.

Leave a Reply

Your email address will not be published. Required fields are marked *