ADR: Format Compliance and Purpose of project_save.tmx and TMX Validation Strategy

Status

Implementation proposed (Date: 2025-07-19)

  • Pull Requests:

  • Related Threads:

    • “[OmTdev] The internal TMX has DOCTYPE and DTD” (2025-07-13 to 14)

    • “[OmTdev] Some inconsistencies in TMX test cases and format handling” (2025-07-18 to 19)

    • “[OmTdev] Clarifying XML Processing in OmegaT and Use of XSD over DTD” (2025-07-15 to 16)

    • Mailing List: SourceForge OmegaT-devel July 2025

Context

OmegaT stores its internal translation memory in a file called project_save.tmx.https://github.com/omegat-org/omegat/pull/1577 Historically, this file has included a DOCTYPE declaration for TMX 1.1 (e.g., <!DOCTYPE tmx SYSTEM "tmx11.dtd">). While this implies TMX conformance, the file is:

  • Meant for internal use,

Separately, recent reviews of our test infrastructure revealed that:

  • Many functional test TMX files mix TMX 1.1 and 1.4 constructs,

  • Files that should be invalid are accepted without error,

  • Our TMXReader2 uses StAX, which does not support DTD validation.

These two concerns are interrelated and raise key questions:

  • What TMX version should OmegaT use internally?

  • How should we validate internal and external TMX files?

  • How strict should we be in parsing vs exporting?

Extended Discussion from Development List

Recent discussions have clarified several important technical and architectural points:

Hiroshi’s Technical Clarification (2025-07-15):

  • OmegaT no longer relies on DTDs for XML processing for parsing and validating TMX files

  • XSD offers more precise validation, better namespace handling, and avoids DTD security issues (e.g., XXE attacks)

  • The <!DOCTYPE> declaration in project_save.tmx can sometimes be misleading for users opening files in validating editors like OxygenXML if they don’t have a copy of the DTD.

  • OmegaT supports TMX export with DTDs for external tool compatibility but doesn’t use DTD-based validation internally

Philippe’s User Experience Perspective (2025-07-15):

  • Internal handling methods shouldn’t dictate output file format decisions

  • Removing DOCTYPE while keeping TMX format would confuse users examining file content

  • Users won’t interpret missing DOCTYPE as “internal use only” - they expect compliant TMX files

  • Short-term focus should be ensuring schema compliance rather than format changes

Performance and Scalability Concerns (khagaroth, 2025-07-16):

  • Questioned whether XML is appropriate for working data storage, especially for frequently changing content

  • Performance issues with larger projects highlight potential need for alternative storage formats

  • Current XML format doesn’t scale well for bigger projects without splitting

Thomas’s Storage vs. Exchange Perspective (2025-07-16):

  • Emphasized that project_save.tmx is primarily used for storage, not for exchange

  • Need to balance current storage structure with features that might be required in the future

Options

All the options need to consider backward compatibility. We do not know which version of OmegaT a given user is using. We need to have a reasonable number of old versions that still can understand new projects, and reciprocally.

Option A: Treat project_save.tmx as Internal and Drop DOCTYPE

Remove the <!DOCTYPE> declaration from project_save.tmx, clarify in documentation that it’s for OmegaT’s internal use only, and not necessarily TMX-compliant.

Pros

  • Prevents confusion in validating editors

  • Frees internal format from TMX limitations

  • Easier to evolve for OmegaT-specific needs

  • Aligns with XSD-based processing architecture

Cons

  • Reduces compatibility with external tools

  • Undermines the assumption that project_save.tmx is a reliable backup/exchange file

  • May confuse users who expect TMX compliance from TMX-named files


Option B: Make project_save.tmx Strictly TMX 1.4b Compliant

Update project_save.tmx to fully conform to TMX 1.4b Level 2, using compliant tagging, encoding, and schema validation. #PR1577 implements this option, by extending ProjectTMX#save to save in TMX 1.4b format for individual project. It also preserves the TMX version when it is a team project.

Pros

  • Promotes interoperability and correctness

  • Simplifies reuse by other tools

  • Aligns with exported -level2.tmx files

  • Meets user expectations for TMX file compliance

Cons

  • Adds validation overhead and migration costs

  • Risks breaking compatibility for existing users/projects

  • TMX 1.4b is not extensible (no <custom> tags, etc.)


Option C: Replace with an Alternative Internal Format (e.g., SQLite)

Use a non-TMX format (e.g., SQLite or custom binary XML) for internal storage. Retain TMX exports for interoperability only.

Pros

  • Supports rich internal metadata and faster saves

  • Clean separation between internal state and exchange formats

  • Enables backward-compatible export options

  • Addresses performance and scalability concerns raised in discussions

Cons

  • Loss of human-readable internal format

  • Tooling and migration burden

  • May require file format version negotiation

  • Significant architectural change requiring careful user communication


Option D: Dual Strategy for TMX Validation

Improve TMX handling by adopting strict schema-based validation for external files and lenient parsing for trusted internal data. Use converted XSD schemas instead of DTDs for validation.

Pros

  • Provides reliable TMX file validation across 1.1 and 1.4b versions

  • Enhances test coverage and correctness (PR #1559)

  • Maintains performance by keeping StAX as the primary parser

  • Aligns with XSD-based architecture

Cons

  • Requires maintenance of XSD schema files

  • Some legacy test files may need to be rewritten (fix is done)

  • Developers must be aware of different validation modes for internal/external files, and we already have “omegaTMX” flag in parser method.


Option E: Add XSD Validation for both import of TMX data and internal data

Validate project_save.tmx with TMX 1.1 XSD when reading

Pros

  • Leverages existing XSD validation infrastructure (PR #1559)

  • Provides a path for future modernization

  • Reduces editor confusion while maintaining compliance

Cons

  • Questions remain about XML schema references in <tmx> element


Implementation Notes

  • A PR #1559 has implemented schema-based validation for TMX files using XSDs and demonstrates reading valid TMX without DOCTYPE

  • Tests like AutoTmxTest and TMXReaderTest currently contain files that:

    • Mix versions (e.g., TMX 1.1 with xml:lang from 1.4)

    • Omit required attributes (e.g., pos on <it>)

    • Pass incorrectly due to the absence of validation

  • OmegaT’s TMXReader uses StAX, which cannot validate against DTD; schema validation is the only reliable approach

  • Development discussions reveal tension between storage efficiency and exchange format expectations


Recommendation Path

Based on extended discussions, we propose a refined staged approach:

  1. Immediate (Technical Debt Reduction)

    • Adopt Option D: Implement XSD-based validation for external TMX files proposed as PR #1570

    • Review and correct non-conforming test data fixed on PR #1569

    • Document current project_save.tmx behavior and limitations

  2. Short-Term (Consistency and User Experience)

    • Evaluate Option E: validate project_save.tmx with TMX 1.1 XSD when reading

    • Consider Philippe’s concern about user expectations vs. Hiroshi’s technical clarity

    • Gather broader community feedback on format change communication

  3. Medium-Term (Architecture Decision)

    • Decide between continued TMX evolution (Option B) vs. internal format separation (Option A)

    • Address performance concerns raised by khagaroth through optimization or format change

    • Clarify a storage vs. exchange role per Thomas’s observations

  4. Long-Term (Scalability and Performance)

    • Evaluate Option C if storage performance becomes critical

    • Consider hybrid approaches that maintain TMX exports while using efficient internal storage


Community Perspectives Summary

  • Technical Architecture (Hiroshi): Favor XSD over DTD, remove misleading DOCTYPE, align with modern XML processing

  • User Experience (Philippe): Maintain TMX compliance expectations, avoid confusing format changes

  • Performance (khagaroth): Question XML suitability for working storage, consider scalability

  • Purpose Clarity (Thomas): Distinguish storage from exchange usage, acknowledge evolved user expectations


Open Questions

  • Should project_save.tmx always be readable by older OmegaT versions?

  • What level of strictness should be enforced on tm/auto/*.tmx and user-imported memories?

  • Do we want to support TMX 1.2 or 1.3 in practice?

  • Can schema validation be extended to all TMX processing (including project merge, team sync)?

  • How should we communicate format changes to users who rely on project_save.tmx for external processing?

  • Should we include XML schema references in the <tmx xmlns="http://www.lisa.org/tmx14"> element if removing DOCTYPE?


Contributors

  • Hiroshi Miura

  • Jean-Christophe Helary

  • Philippe B.

  • Thomas Cordonnier

  • khagaroth


References


This ADR consolidates architectural concerns about internal format design, TMX standard conformance, and test validation robustness. It incorporates community perspectives on technical architecture, user experience, performance, and purpose clarity. The document is open for review and intended as a base for community agreement before implementation.