Language Packs & Tokenization
TileTangle treats language behaviour as data. The same engine can validate English crosswords, Hebrew crosswords, emoji anagrams, or half-width Japanese tiles—no code changes required.
Tip: Build dictionaries with the Rust API when you need advanced normalization or custom tokenizers, then reuse the compiled artifacts (FST, DAWG, GADDAG) from Python, WASM, Unity, or Godot.
Normalization & case-folding
DictionaryOptions controls how input words are normalized before being stored. NFC is the default,
but you can opt into NFKC (useful for ligatures and half-width kana) or disable case folding entirely.
use tiletangle_engine::{DictionaryOptions, FstDictionary, NormalizationMode, TokenizerRef};
let opts = DictionaryOptions {
norm: NormalizationMode::NFKC,
case_fold: false,
tokenizer: TokenizerRef::default(),
min_len: None,
max_len: None,
};
let dict = FstDictionary::from_words_opts(words, opts);
With this configuration a tile placed as "パ" matches the dictionary entry "パ".
Reading direction
Set CrosswordRules::reading_dir to ReadingDirection::RTL for scripts like Hebrew or Arabic. The
validation and scoring pipeline now reads horizontal words right-to-left while continuing to render
on a standard grid. WASM/Unity/Godot expose set_reading_direction(rtl) for the same behaviour.
let mut rules = CrosswordRules { free_word_mode: false, ..Default::default() };
rules.reading_dir = ReadingDirection::RTL;
Our regression tests cover Hebrew (שלום) and Arabic (سلام) placements to ensure glyph ordering
remains correct.
Tokenization hooks
Not every language splits neatly into Unicode scalars. Emoji sequences, digraph tiles (QU), or
stacked accents often need custom segmentation. Implement the Tokenizer trait and hand it to the
dictionary via TokenizerRef:
use std::sync::Arc;
use unicode_segmentation::UnicodeSegmentation;
use tiletangle_engine::{Tokenizer, TokenizerRef, DictionaryOptions, DawgDictionary};
struct QuTokenizer;
impl Tokenizer for QuTokenizer {
fn segment<'a>(&self, text: &'a str) -> Vec<String> {
let mut out = Vec::new();
let mut chars = text.chars().peekable();
while let Some(ch) = chars.next() {
if ch == 'q' && chars.peek() == Some(&'u') {
chars.next();
out.push("qu".into());
} else {
out.push(ch.to_string());
}
}
out
}
}
let tokenizer = TokenizerRef::new(Arc::new(QuTokenizer));
let opts = DictionaryOptions {
tokenizer,
..DictionaryOptions::default()
};
let dawgd = DawgDictionary::from_words_opts(vec!["squid".into()], opts);
assert!(dawgd.has_prefix("squ"));
WASM bindings automatically use the default grapheme tokenizer. To reuse a custom tokenizer there,
construct the dictionary in Rust (as above) and load the serialized bytes via
`set_dictionary_from_fst_bytes`.
The same tokenizer automatically flows into `GaddagDictionary` and the move generator, ensuring
prefix walks and cross-checks stay in sync.
## Multi-grapheme & stacked tiles
Tiles may already carry multi-codepoint symbols (`"👨👩👧👦"`, hangul syllables, etc.). Because
words are now reconstructed through the tokenizer, each stack entry is treated as one logical
token—regardless of how many scalars it contains.