ADR: Format Compliance and Purpose of project_save.tmx and TMX Validation Strategy¶
Status¶
Implementation proposed (Date: 2025-07-19)
Pull Requests:
Related Threads:
“[OmTdev] The internal TMX has DOCTYPE and DTD” (2025-07-13 to 14)
“[OmTdev] Some inconsistencies in TMX test cases and format handling” (2025-07-18 to 19)
“[OmTdev] Clarifying XML Processing in OmegaT and Use of XSD over DTD” (2025-07-15 to 16)
Mailing List: SourceForge OmegaT-devel July 2025
Context¶
OmegaT stores its internal translation memory in a file called project_save.tmx.https://github.com/omegat-org/omegat/pull/1577
Historically, this file has included a DOCTYPE declaration for TMX 1.1 (e.g., <!DOCTYPE tmx SYSTEM "tmx11.dtd">).
While this implies TMX conformance, the file is:
Meant for internal use,
Separately, recent reviews of our test infrastructure revealed that:
Many functional test TMX files mix TMX 1.1 and 1.4 constructs,
Files that should be invalid are accepted without error,
Our
TMXReader2uses StAX, which does not support DTD validation.
These two concerns are interrelated and raise key questions:
What TMX version should OmegaT use internally?
How should we validate internal and external TMX files?
How strict should we be in parsing vs exporting?
Extended Discussion from Development List¶
Recent discussions have clarified several important technical and architectural points:
Hiroshi’s Technical Clarification (2025-07-15):
OmegaT no longer relies on DTDs for XML processing for parsing and validating TMX files
XSD offers more precise validation, better namespace handling, and avoids DTD security issues (e.g., XXE attacks)
The
<!DOCTYPE>declaration inproject_save.tmxcan sometimes be misleading for users opening files in validating editors like OxygenXML if they don’t have a copy of the DTD.OmegaT supports TMX export with DTDs for external tool compatibility but doesn’t use DTD-based validation internally
Philippe’s User Experience Perspective (2025-07-15):
Internal handling methods shouldn’t dictate output file format decisions
Removing DOCTYPE while keeping TMX format would confuse users examining file content
Users won’t interpret missing DOCTYPE as “internal use only” - they expect compliant TMX files
Short-term focus should be ensuring schema compliance rather than format changes
Performance and Scalability Concerns (khagaroth, 2025-07-16):
Questioned whether XML is appropriate for working data storage, especially for frequently changing content
Performance issues with larger projects highlight potential need for alternative storage formats
Current XML format doesn’t scale well for bigger projects without splitting
Thomas’s Storage vs. Exchange Perspective (2025-07-16):
Emphasized that
project_save.tmxis primarily used for storage, not for exchangeNeed to balance current storage structure with features that might be required in the future
Options¶
All the options need to consider backward compatibility. We do not know which version of OmegaT a given user is using. We need to have a reasonable number of old versions that still can understand new projects, and reciprocally.
Option A: Treat project_save.tmx as Internal and Drop DOCTYPE¶
Remove the <!DOCTYPE> declaration from project_save.tmx, clarify in documentation that it’s for OmegaT’s internal use only, and not necessarily TMX-compliant.
Pros¶
Prevents confusion in validating editors
Frees internal format from TMX limitations
Easier to evolve for OmegaT-specific needs
Aligns with XSD-based processing architecture
Cons¶
Reduces compatibility with external tools
Undermines the assumption that
project_save.tmxis a reliable backup/exchange fileMay confuse users who expect TMX compliance from TMX-named files
Option B: Make project_save.tmx Strictly TMX 1.4b Compliant¶
Update project_save.tmx to fully conform to TMX 1.4b Level 2, using compliant tagging, encoding, and schema validation.
#PR1577 implements this option, by extending ProjectTMX#save
to save in TMX 1.4b format for individual project. It also preserves the TMX version when it is a team project.
Pros¶
Promotes interoperability and correctness
Simplifies reuse by other tools
Aligns with exported
-level2.tmxfilesMeets user expectations for TMX file compliance
Cons¶
Adds validation overhead and migration costs
Risks breaking compatibility for existing users/projects
TMX 1.4b is not extensible (no
<custom>tags, etc.)
Option C: Replace with an Alternative Internal Format (e.g., SQLite)¶
Use a non-TMX format (e.g., SQLite or custom binary XML) for internal storage. Retain TMX exports for interoperability only.
Pros¶
Supports rich internal metadata and faster saves
Clean separation between internal state and exchange formats
Enables backward-compatible export options
Addresses performance and scalability concerns raised in discussions
Cons¶
Loss of human-readable internal format
Tooling and migration burden
May require file format version negotiation
Significant architectural change requiring careful user communication
Option D: Dual Strategy for TMX Validation¶
Improve TMX handling by adopting strict schema-based validation for external files and lenient parsing for trusted internal data. Use converted XSD schemas instead of DTDs for validation.
Pros¶
Provides reliable TMX file validation across 1.1 and 1.4b versions
Enhances test coverage and correctness (PR #1559)
Maintains performance by keeping StAX as the primary parser
Aligns with XSD-based architecture
Cons¶
Requires maintenance of XSD schema files
Some legacy test files may need to be rewritten (fix is done)
Developers must be aware of different validation modes for internal/external files, and we already have “omegaTMX” flag in parser method.
Option E: Add XSD Validation for both import of TMX data and internal data¶
Validate project_save.tmx with TMX 1.1 XSD when reading
Pros¶
Leverages existing XSD validation infrastructure (PR #1559)
Provides a path for future modernization
Reduces editor confusion while maintaining compliance
Cons¶
Questions remain about XML schema references in
<tmx>element
Implementation Notes¶
A PR #1559 has implemented schema-based validation for TMX files using XSDs and demonstrates reading valid TMX without DOCTYPE
Tests like
AutoTmxTestandTMXReaderTestcurrently contain files that:Mix versions (e.g., TMX 1.1 with
xml:langfrom 1.4)Omit required attributes (e.g.,
poson<it>)Pass incorrectly due to the absence of validation
OmegaT’s TMXReader uses StAX, which cannot validate against DTD; schema validation is the only reliable approach
Development discussions reveal tension between storage efficiency and exchange format expectations
Recommendation Path¶
Based on extended discussions, we propose a refined staged approach:
Immediate (Technical Debt Reduction)
Adopt Option D: Implement XSD-based validation for external TMX files proposed as PR #1570
Review and correct non-conforming test data fixed on PR #1569
Document current
project_save.tmxbehavior and limitations
Short-Term (Consistency and User Experience)
Evaluate Option E: validate
project_save.tmxwith TMX 1.1 XSD when readingConsider Philippe’s concern about user expectations vs. Hiroshi’s technical clarity
Gather broader community feedback on format change communication
Medium-Term (Architecture Decision)
Decide between continued TMX evolution (Option B) vs. internal format separation (Option A)
Address performance concerns raised by khagaroth through optimization or format change
Clarify a storage vs. exchange role per Thomas’s observations
Long-Term (Scalability and Performance)
Evaluate Option C if storage performance becomes critical
Consider hybrid approaches that maintain TMX exports while using efficient internal storage
Community Perspectives Summary¶
Technical Architecture (Hiroshi): Favor XSD over DTD, remove misleading DOCTYPE, align with modern XML processing
User Experience (Philippe): Maintain TMX compliance expectations, avoid confusing format changes
Performance (khagaroth): Question XML suitability for working storage, consider scalability
Purpose Clarity (Thomas): Distinguish storage from exchange usage, acknowledge evolved user expectations
Open Questions¶
Should
project_save.tmxalways be readable by older OmegaT versions?What level of strictness should be enforced on
tm/auto/*.tmxand user-imported memories?Do we want to support TMX 1.2 or 1.3 in practice?
Can schema validation be extended to all TMX processing (including project merge, team sync)?
How should we communicate format changes to users who rely on
project_save.tmxfor external processing?Should we include XML schema references in the
<tmx xmlns="http://www.lisa.org/tmx14">element if removing DOCTYPE?
Contributors¶
Hiroshi Miura
Jean-Christophe Helary
Philippe B.
Thomas Cordonnier
khagaroth
References¶
OmegaT PR#1569 - fix(test): make TMX test data to be compliant with TMX1.1 and 1.4
OmegaT PR#1570 - refactor(core): add TMXLSResourceResolver and improve TMX validation in TMXReader2
Mailing List: SourceForge OmegaT-devel July 2025
This ADR consolidates architectural concerns about internal format design, TMX standard conformance, and test validation robustness. It incorporates community perspectives on technical architecture, user experience, performance, and purpose clarity. The document is open for review and intended as a base for community agreement before implementation.