ADR: Glossary Search and Sort Implementation¶
Status¶
Accepted: Fully Implemented (Date: 2026-03-31)
Pull Requests:
Manual updates https://github.com/omegat-org/omegat/pull/1849
UI label updates: https://github.com/omegat-org/omegat/pull/1838
Full Stemming: https://github.com/omegat-org/omegat/pull/1660
Refactored sorting: https://github.com/omegat-org/omegat/pull/1424
CJK Hash Conflict Workaround: https://github.com/omegat-org/omegat/pull/1342
Sorting by source length: https://github.com/omegat-org/omegat/pull/1154
Context¶
Glossaries in OmegaT provide terminology assistance to translators by matching terms in the source segment with entries in glossary files. These files are typically stored in the project’s glossary folder (or subfolders) in CSV, TSV, or TBX formats.
The implementation needs to handle:
Efficiently searching for terms in source segments.
Supporting various languages, including those that are not space-delimited (e.g., CJK).
Sorting results according to user preferences and linguistic rules.
Deduplicating and merging results for display.
Handling technical issues like tokenization and potential hash collisions in specific languages.
For a general user-oriented description, see https://omegat.sourceforge.io/manual-snapshot/en/chapter.panes.html#panes.glossary.
Decision¶
The core logic for glossary searching and result processing is encapsulated in org.omegat.gui.glossary.GlossarySearcher.
1. Search Logic¶
Searching is performed based on user preferences and the nature of the source language.
searchSourceMatches: The primary method for finding glossary hits for a given source segment.Token-based matching: Uses
ITokenizerto break both the segment and glossary terms into tokens. Matching is performed usingDefaultTokenizer.searchAll.CJK matching: For languages that are not space-delimited (detected via
Language.isSpaceDelimited()), a substring search (String.contains) is used as a fallback or supplement if the term is CJK.
2. Sort Order¶
Results are sorted in sortGlossaryEntries using compareGlossaryEntries with the following precedence:
Priority: Entries from the “writable” glossary or priority glossaries come first. (*4)
Source Term Length: If enabled (
GLOSSARY_SORT_BY_SRC_LENGTH), longer source terms are prioritized when terms are similar (i.e., one starts with the other). (*1)Source Term Alphabetical: Language-dependent sorting using
Collator(Primary, Secondary, then Tertiary strength). (*2)Target Term Length: If enabled (
GLOSSARY_SORT_BY_LENGTH), longer target terms are prioritized. (*3)Target Term Alphabetical: Language-dependent sorting of the target terms.
Notes¶
Note (1): Source term length sorting behavior
In OmegaT 6.0.0 and earlier, sorting by source term length was always performed, regardless of similarity.
In OmegaT 6.0.1 and later, sorting by source term length is performed only for terms that share the same starting characters (based on
String#startsWith).
Note (2): Collator usage
The use of a collator was introduced in OmegaT 6.1.0.
In the earlier versions, Java’s natural sort order was used.
Note (3): Term length option
The term length option was added in OmegaT 6.0.1.
Note (4): Glossary priority implementation
The glossary priority was implemented in OmegaT 3.0.5. As of 6.1, priority is only implemented for the writable glossary.
3. Filtering and Deduplication¶
The filterGlossary method processes the sorted list:
Exact duplicates: Entries with identical source, target, and comment are removed.
Merging: If
mergeAltDefinitionsis enabled, multiple entries for the same source term are combined into a single display entry. Target terms, comments, and origins are aggregated.
4. Tokenization and Case Sensitivity¶
Normalization: Unicode normalization is applied to all terms.
Case Sensitivity: Matching is generally case-insensitive (using
toLowerCasewith the source language locale). However,keepMatchcan enforce “similar case” requirements (e.g., if a glossary term is ALL CAPS, only match if the source is also ALL CAPS).Stemming: Optional stemming, light stemming, is supported during tokenization if configured in preferences. Some languages (e.g., Italian) optionally support full stemming for the better match.
Tag Exclusion: Tokens falling within OmegaT tags are excluded from matching.
5. CJK Hash Conflict Detection/Workaround¶
A specific workaround is implemented in getMatchingTokens to address high reported hash collision rates for short Japanese strings (Bug #1034).
When a match is found for a CJK term via the tokenizer, an additional “raw match” check (rawMatch) is performed.
This check verifies that at least one of the matched tokens is actually a substring of the glossary term.
This ensures that the match is linguistically relevant and not just a result of internal tokenizer collisions or overly aggressive stemming/tokenization of short CJK sequences.
Consequences¶
The use of
Collatorensures that sorting respects language-specific rules (e.g., accents, casing).The CJK-specific logic allows OmegaT to remain effective for languages where traditional word-boundary tokenization is difficult.
Merging and filtering logic reduces clutter in the Glossary pane, improving user experience.
The “hash conflict” workaround maintains accuracy for Japanese/CJK users while still benefiting from token-based matching for other languages.
References¶
Priority glossary
Feature request: #853 “A priority glossary”
User list discussion thread: “A priority glossary - potentially useful new feature”
Implemented in f1f9caf1 Alex Buloichik <alex***@***com> on 9/23/13 at 12:05PM