[RFC] LTF-B4-17: A 34-bit Unicode Transformation Format with 17.18B codepoints — UTC Proposal + Python package

khoa181101

Registered Member
I've been working on **LTF-B4-17** (LOGOS Transformation Format, Base-4, N=17), a
new fixed-width Unicode encoding I've submitted as a formal UTC proposal.

---

## What is LTF-B4-17?

A 34-bit encoding that extends UTF-32 while keeping 100% backward compatibility:

| Encoding | Bits/char | Capacity | Efficiency |
|---|---|---|---|
| UTF-8 | 8–32 | 1,114,112 | Variable |
| UTF-16 | 16–32 | 1,114,112 | Variable |
| UTF-32 | 32 | 4,294,967,296 | 100% |
| **LTF-B4-17 ★** | **34** | **17,179,869,184** | **100% (zero waste)** |

---

## Key properties

- **4× UTF-32 capacity** — 17,179,869,184 codepoints (4^17 = 2^34)
- **Identity encoding** — encode(c) == c, no lookup table needed
- **100% bit efficiency** — Base-4 = 2², each digit maps to exactly 2 bits
- **Full UTF-32 backward compat** — lower 32 bits identical to UTF-32
- **+12.89B extension slots** — codepoints 4,294,967,296–17,179,869,183

---

## Formula (verifiable in 3 lines)

```python
# quaternary(c, 17): divide c by 4 seventeen times
# digit map: 0→00 1→01 2→10 3→11
# total bits: 17 × 2 = 34
```

---

## Python package (pip-installable)

```bash
pip install logos_b4n17-2.1.0-py3-none-any.whl
```

```python
from logos_b4n17 import encode, decode, encode_text, decode_text, zone_of

# Identity encoding: encode(65) == 65
print(encode(65)) # 65
print(decode(65)) # 65

# Text round-trip
blob = encode_text("Hello LTF-B4-17!")
print(decode_text(blob)) # Hello LTF-B4-17!

# Zone info
z = zone_of(0x4E2D) # '中'
print(z) # UNICODE-BMP

# Capacity
from logos_b4n17 import CAPACITY
print(CAPACITY) # 17179869184
```

---

## Performance (x86-64, gcc -O3)

| Operation | LTF-B4-17 | UTF-32 |
|---|---|---|
| Encode single | 67,114 M/s | ~67,000 M/s |
| Decode single | 22,600 M/s | ~22,000 M/s |
| Stream encode | 36,400 M/s | N/A |
| Stream decode | 144,600 M/s | N/A |

Stream overhead vs UTF-32: **+6.25%** (2 extra bits/char).

---

## Zone map (12 zones)

| Zone | Range | Size |
|---|---|---|
| ASCII | 0–127 | 128 |
| UNICODE-BMP | 243–65,535 | 65,293 |
| UNICODE-SMP | 65,536–131,071 | 65,536 |
| UTF-32 full | 0–4,294,967,295 | 4.29B |
| **LTF-B4-17-EXT** | **4,294,967,296–17,179,869,183** | **12.89B new** |

---

UTC_Proposal_LTF_B4_17: https://docs.google.com/document/d/...ouid=109480604438012146765&rtpof=true&sd=true

logos_b4n17-2.1.0-py3-none-any: https://drive.google.com/file/d/1Jxgnp8IO9_xLoTJWeH61uZ3GGXxKRks2/view?usp=drive_link
## UTC Submission

A formal proposal has been submitted to the Unicode Technical Committee
(UTC) requesting evaluation of LTF-B4-17 as a new Unicode Transformation Format.

**Document**: UTC Proposal LTF-B4-17 — submitted 2026-05-24
**Author**: HUA VAN ANH KHOA (TAO HUA)
**Copyright**: © 2026 AXIOM CODE 010 — All Rights Reserved

Feedback welcome — especially on stream format design and the backward-compatibility guarantee.
 
Your proposal increases the storage requirement for each character by 2 bits, from 32 to 34.

Since many computer architectures use either 32 or 64 bit "words", this would seem to result in a lot of "wasted" storage (e.g. storing 34 bits might, for practical reasons, actually require storing 64 bits).

What are the advantages of this system that justify the wasted storage space?
 
khoa181101

Some readers could also use a plain language abstract. It's a good on-ramp even for some, like me, whose IT training is a bit rusty. UTF 32, though it has waste, does allow processors to quickly index and navigate at constant speed. Not sure how 34 is better. Unicode only needs at most 21 bits to encode all characters, so there's already waste. Lots of zero padding. There are variable encodings like UTF-8 (which dynamically shift byte sizes to save space) and those are preferred for everyday file storage and web applications. Or has that changed?
 
Back
Top