ADR: Glossary Search and Sort Implementation

Status

Accepted: Fully Implemented (Date: 2026-03-31)

Context

Glossaries in OmegaT provide terminology assistance to translators by matching terms in the source segment with entries in glossary files. These files are typically stored in the project’s glossary folder (or subfolders) in CSV, TSV, or TBX formats.

The implementation needs to handle:

  • Efficiently searching for terms in source segments.

  • Supporting various languages, including those that are not space-delimited (e.g., CJK).

  • Sorting results according to user preferences and linguistic rules.

  • Deduplicating and merging results for display.

  • Handling technical issues like tokenization and potential hash collisions in specific languages.

For a general user-oriented description, see https://omegat.sourceforge.io/manual-snapshot/en/chapter.panes.html#panes.glossary.

Decision

The core logic for glossary searching and result processing is encapsulated in org.omegat.gui.glossary.GlossarySearcher.

1. Search Logic

Searching is performed based on user preferences and the nature of the source language.

  • searchSourceMatches: The primary method for finding glossary hits for a given source segment.

  • Token-based matching: Uses ITokenizer to break both the segment and glossary terms into tokens. Matching is performed using DefaultTokenizer.searchAll.

  • CJK matching: For languages that are not space-delimited (detected via Language.isSpaceDelimited()), a substring search (String.contains) is used as a fallback or supplement if the term is CJK.

2. Sort Order

Results are sorted in sortGlossaryEntries using compareGlossaryEntries with the following precedence:

  1. Priority: Entries from the “writable” glossary or priority glossaries come first. (*4)

  2. Source Term Length: If enabled (GLOSSARY_SORT_BY_SRC_LENGTH), longer source terms are prioritized when terms are similar (i.e., one starts with the other). (*1)

  3. Source Term Alphabetical: Language-dependent sorting using Collator (Primary, Secondary, then Tertiary strength). (*2)

  4. Target Term Length: If enabled (GLOSSARY_SORT_BY_LENGTH), longer target terms are prioritized. (*3)

  5. Target Term Alphabetical: Language-dependent sorting of the target terms.

Notes

  • Note (1): Source term length sorting behavior

    • In OmegaT 6.0.0 and earlier, sorting by source term length was always performed, regardless of similarity.

    • In OmegaT 6.0.1 and later, sorting by source term length is performed only for terms that share the same starting characters (based on String#startsWith).

  • Note (2): Collator usage

    • The use of a collator was introduced in OmegaT 6.1.0.

    • In the earlier versions, Java’s natural sort order was used.

  • Note (3): Term length option

    • The term length option was added in OmegaT 6.0.1.

  • Note (4): Glossary priority implementation

    • The glossary priority was implemented in OmegaT 3.0.5. As of 6.1, priority is only implemented for the writable glossary.

3. Filtering and Deduplication

The filterGlossary method processes the sorted list:

  • Exact duplicates: Entries with identical source, target, and comment are removed.

  • Merging: If mergeAltDefinitions is enabled, multiple entries for the same source term are combined into a single display entry. Target terms, comments, and origins are aggregated.

4. Tokenization and Case Sensitivity

  • Normalization: Unicode normalization is applied to all terms.

  • Case Sensitivity: Matching is generally case-insensitive (using toLowerCase with the source language locale). However, keepMatch can enforce “similar case” requirements (e.g., if a glossary term is ALL CAPS, only match if the source is also ALL CAPS).

  • Stemming: Optional stemming, light stemming, is supported during tokenization if configured in preferences. Some languages (e.g., Italian) optionally support full stemming for the better match.

  • Tag Exclusion: Tokens falling within OmegaT tags are excluded from matching.

5. CJK Hash Conflict Detection/Workaround

A specific workaround is implemented in getMatchingTokens to address high reported hash collision rates for short Japanese strings (Bug #1034).

When a match is found for a CJK term via the tokenizer, an additional “raw match” check (rawMatch) is performed. This check verifies that at least one of the matched tokens is actually a substring of the glossary term. This ensures that the match is linguistically relevant and not just a result of internal tokenizer collisions or overly aggressive stemming/tokenization of short CJK sequences.

Consequences

  • The use of Collator ensures that sorting respects language-specific rules (e.g., accents, casing).

  • The CJK-specific logic allows OmegaT to remain effective for languages where traditional word-boundary tokenization is difficult.

  • Merging and filtering logic reduces clutter in the Glossary pane, improving user experience.

  • The “hash conflict” workaround maintains accuracy for Japanese/CJK users while still benefiting from token-based matching for other languages.

References