feat: dict based Hyphenation (#305)

## Summary

* Adds (optional) Hyphenation for English, French, German, Russian
languages

## Additional Context

* Included hyphenation dictionaries add approximately 280kb to the flash
usage (German alone takes 200kb)
* Trie encoded dictionaries are adopted from hypher project
(https://github.com/typst/hypher)
* Soft hyphens (and other explicit hyphens) take precedence over
dict-based hyphenation. Overall, the hyphenation rules are quite
aggressive, as I believe it makes more sense on our smaller screen.

---------

Co-authored-by: Dave Allie <dave@daveallie.com>
This commit is contained in:
Arthur Tazhitdinov
2026-01-19 17:56:26 +05:00
committed by GitHub
parent 5fef99c641
commit 8824c87490
40 changed files with 36465 additions and 52 deletions

View File

@@ -0,0 +1,66 @@
# Hypher Binary Tries
CrossPoint embeds the exact binary automata produced by
[Typst's `hypher`](https://github.com/typst/hypher).
## File layout
Each `.bin` blob is a single self-contained automaton:
```
uint32_t root_addr_be; // big-endian offset of the root node
uint8_t levels[]; // shared "levels" tape (dist/score pairs)
uint8_t nodes[]; // node records packed back-to-back
```
The size of the `levels` tape is implicit. Individual nodes reference slices
inside that tape via 12-bit offsets, so no additional pointers are required.
### Node encoding
Every node starts with a single control byte:
- Bit 7 set when the node stores scores (`levels`).
- Bits 5-6 stride of the target deltas (1, 2, or 3 bytes, big-endian).
- Bits 0-4 transition count (values ≥ 31 spill into an extra byte).
If the `levels` flag is set, two more bytes follow. Together they encode a
12-bit offset into the global `levels` tape and a 4-bit length. Each byte in the
levels tape packs a distance/score pair as `dist * 10 + score`, where `dist`
counts how many UTF-8 bytes we advanced since the previous digit.
After the optional levels header come the transition labels (one byte per edge)
followed by the signed target deltas. Targets are stored as relative offsets
from the current node address. Deltas up to ±128 fit in a single byte, larger
distances grow to 2 or 3 bytes. The runtime walks the transitions with a simple
linear scan and materializes the absolute address by adding the decoded delta
to the current nodes base.
## Embedding blobs into the firmware
The helper script `scripts/generate_hyphenation_trie.py` acts as a thin
wrapper: it reads the hypher-generated `.bin` files, formats them as `constexpr`
byte arrays, and emits headers under
`lib/Epub/Epub/hyphenation/generated/`. Each header defines the raw data plus a
`SerializedHyphenationPatterns` descriptor so the reader can keep the automaton
in flash.
To refresh the firmware assets after updating the `.bin` files, run:
```
./scripts/generate_hyphenation_trie.py \
--input lib/Epub/Epub/hyphenation/tries/en.bin \
--output lib/Epub/Epub/hyphenation/generated/hyph-en.trie.h
./scripts/generate_hyphenation_trie.py \
--input lib/Epub/Epub/hyphenation/tries/fr.bin \
--output lib/Epub/Epub/hyphenation/generated/hyph-fr.trie.h
./scripts/generate_hyphenation_trie.py \
--input lib/Epub/Epub/hyphenation/tries/de.bin \
--output lib/Epub/Epub/hyphenation/generated/hyph-de.trie.h
./scripts/generate_hyphenation_trie.py \
--input lib/Epub/Epub/hyphenation/tries/ru.bin \
--output lib/Epub/Epub/hyphenation/generated/hyph-ru.trie.h
```