The Problem with Translating DITA

Today, I was reading the 188th Tool Kit – Basic Edition newsletter by Jost Zetzsche. I love reading Jost’s blog because he isn’t afraid to say what he thinks, even if it is controversial. He is also very well-studied and very well-connected.

In his most recent newsletter, Jost discusses the demise of LISA (Localization Industry Standards Association), the various groups that have stepped up to assume LISA’s charter (interoperability standards), and some of the problems that he sees facing the translation industry with the current proposed replacements. This is an interesting discussion and Jost correctly describes the issues facing the smaller LSP and the individual translator.

I was fascinated by this comment in his post:

“…there are some standards that work very well for translation buyers but are certainly against our [individual translators/small LSP] interests. In my opinion, one of those standards is DITA, an XML-based standard that provides the ability to segment the source text into small chunks that can be used in a variety of ways and allow for a great reuse of data; however, this works much to the detriment of the translator who often lacks the necessary context.”

Wow! In all of my years of dealing with XML and DITA, and in my years of watching and being part of the translation community, I have never considered the impact of DITA on translation. Of course, it makes perfect sense. In the DITA world, content is broken into small chunks so that they can be reused and repurposed in a variety of ways. When a chunk of content is sent to translation, it is not necessarily combined with any of the other chunks to which it relates. The chunk is completely out of context. Kind of like my children eavesdropping on my phone calls, but only catching errant sentences, for which they later want a full explanation.

How is a translator supposed to work with a small chunk of information, if there is no context? This seems like a formidable task to me.

Creating a small chunk of content that will live in multiple places, be surrounded by a variety of other pieces of content, and be consumed via a variety of media is a difficult task unto itself. However, the writer creating this small chunk usually has access to the CMS system that houses the related information.

The translator is often an individual, contracted by an LSP, who may or may not be able to access the related pieces of content. I repeat: Wow! This seems like an incredibly daunting task, particularly if the content was not created to be global-ready. By this I mean, the writer was not following the basic rules of writing text that will be translated. I’ve written lots of posts on these basic rules. For example, making sure that you use a noun with the words “this, that, these, and those,” so the translator knows if the noun is masculine or feminine. Or, making sure that your sentences do not contain idiomatic phrases that have no meaning in another language. And, my favorite, keeping your sentences as short and simple as possible.

I have to wonder how helpful translation memory is, if the context of the segment is unavailable. Sure, perhaps that segment has been translated before. But, how would the translator know if the context of this rendition of the segment is appropriate?

As I’ve mention countless times, I am not an expert on translation. I am a student, though. And I am very curious to understand more about this topic. If you have information to add to this discussion, or if I am missing a point, or if I am just incorrect, please let me know.

Wow!

 

Val Swisher

Val Swisher is the CEO of Content Rules. She is a well-known expert in global content strategy, content development, and terminology management. Using her 20 years of experience, Val helps companies solve complex content problems by analyzing their content and how it is created.

When not blogging, Val can be found sitting behind her sewing machine working on her latest quilt. She also makes a mean hummus.
 

 

 
  • http://www.sdicorp.com/Resources/Blog/articleType/AuthorView/authorID/24/lkunz.aspx Larry Kunz

    The same problem is faced by editors who are editing the content in the native language. The editing community is developing ways to cope, but the best way is one that you mentioned: train the writers to follow the basic rules of writing text that’s to be translated — consistent vocabulary, straightforward sentence structure, and so forth.

    Another way to mitigate the problem is to avoid creating topics that are very short. Just because DITA permits an extreme level of modularity, doesn’t always mean that it’s a good idea. Chunk your content into topics that are small enough for the level of reuse that you require — but no smaller.

    • http://www.contentrules.com Val Swisher

      It makes perfect sense that editors of the source language are faced with the same issue. And training the writers (and providing them with tools, like acrolinx IQ and ContentRules IQ) really helps.

      Knowing the optimum chunk size is also important. Do you know of any resources that discuss the best practices around this topic?

  • http://www.contentrules.com Val Swisher

    Mike – I completely agree. Anything that provides some context for the translator would be extremely helpful. The only danger I could see would be if the chunk is repurposed in a DIFFERENT context. Then, I think we risk the possibility that the translation is appropriate for one context, but not another. I have seen chunks of content that were created for one context, but as soon as the chunk was pulled into a different context, it didn’t fit.

  • Julia Pond

    It might not be as big a problem as you imagine. True, there will be small reused chunks (sentence or phrase). But each topic is supposed to be written as a self-contained, standalone unit, so it should not rely on context to convey meaning.

    • http://www.contentrules.com Val Swisher

      Perhaps not. I think that if the writers focus on creating self-contained, standalone units, it should definitely make the job easier. I think that providing the writers with tools that ensure they create the content in a way that is easily understandable and translatable (on its own) is an important step.

  • http://twitter.com/MonicaMOliveira Monica Oliveira

    I disagree that breaking source text into chunks work well for the translation buyers. This is only true if the buyers have a good control of their content (i.e., author phrases written to mean the same whatever the context), and if they do, it would not affect the translators because lack of context would not impact the final translation as much. The second point is: the money/time saved by reusing out-of-context translations, will be spent (if not more) in the editing to fix any context related errors. The segmentation of the source text creates a problem that needs to be addressed at some point, either at the translation or at editing step. If the cost of making sense of out-of-context translation comes out of the translator word rate, buyers have a problem other than cost: retention of their resources. A translator will only take a project like this once. A segmented view of the standard might make buyers loose the sight of the business picture.

    • http://www.contentrules.com Val Swisher

      Thanks for your comments, Monica. I think your points are quite valid. And, as Jost mentions, the widespread use of DITA chunks (out of context) has some unintended consequences. Given how silo’d content creation and localization remain, I am not surprised that authors create chunks of content without context. I really think that the writing community (of which I have been a part for 20+ years) needs to recognize the needs of the localization community. For that, they need training and tools, not to mention management recognition and support.

  • http://twitter.com/MonicaMOliveira Monica Oliveira

    I disagree that breaking source text into chunks work well for the translation buyers. This is only true if the buyers have a good control of their content (i.e., author phrases written to mean the same whatever the context), and if they do, it would not affect the translators because lack of context would not impact the final translation as much. The second point is: the money/time saved by reusing out-of-context translations, will be spent (if not more) in the editing to fix any context related errors. The segmentation of the source text creates a problem that needs to be addressed at some point, either at the translation or at editing step. If the cost of making sense of out-of-context translation comes out of the translator word rate, buyers have a problem other than cost: retention of their resources. A translator will only take a project like this once. A segmented view of the standard might make buyers loose the sight of the business picture.

  • http://twitter.com/MMOlvr Monica Oliveira

    I disagree that breaking source text into chunks work well for the translation buyers. This is only true if the buyers have a good control of their content (i.e., author phrases written to mean the same whatever the context), and if they do, it would not affect the translators because lack of context would not impact the final translation as much. The second point is: the money/time saved by reusing out-of-context translations, will be spent (if not more) in the editing to fix any context related errors. The segmentation of the source text creates a problem that needs to be addressed at some point, either at the translation or at editing step. If the cost of making sense of out-of-context translation comes out of the translator word rate, buyers have a problem other than cost: retention of their resources. A translator will only take a project like this once. A segmented view of the standard might make buyers loose the sight of the business picture.

  • Kirsty Taylor

    I’m also seeing this context issue with some of our developments with how we are treating the UI literals. Changes have been made with new versions of a product, and literals now have no context; they are not linked to the screen they are on or to the other fields around them. We did a language demo a few months ago, and got in-country comments that suggested different translations for the same literal on two different tabs or screens. This was probably a totally valid suggestion, given context, but the changes developers have made to the way the UI translations are read, and that UI literals just a string without context, we may have some re-work down the track.
    With content, we have reuse across products, and I’m now finding that I need to make sure that our tiny segments that we reuse (e.g. “Click OK”) are the same across products. In most cases they should be, but there is also the potential that we’ve got different translations in different products, and reuse in the content will mean that the content does not match the UI.

    • http://www.contentrules.com Val Swisher

      Kirsty – I agree. Clearly, out of context content presents big problems for many organizations. And those problems manifest in different ways, depending on the use and location on the content.

      It seems to me that having a tool that enforces consistency of words, segments, and sentences across all content, regardless of the context, would add significant value. Using your example, there are many ways to say “Click OK.”

      Click OK
      Press OK
      Hit OK
      Enter OK
      Click on the OK button
      Click on the Okay button
      Click the button

      You get the idea. Enforcing reuse during the authoring process could go a long way towards helping this situation. Clearly, it is a problem that will continue to grow as more companies repurpose more content in more ways for more audiences.

  • http://www.facebook.com/donrday Don Day

    Just to back up the point made by Larry and Julia, properly written DITA topics comply with the notion of being “short enough to be specific to a single subject or answer a single question, but long enough to make sense on its own and be authored as a unit” (http://docs.oasis-open.org/dita/v1.1/OS/archspec/topics.html).

    The problem of contextual correctness of reused, translated strings is of course not unique to DITA. Prior to DITA’s development, I worked with graphics API messages and OS/2 system messages that were translated separately. I converted Java resource bundles into XML markup that had to be translated separately from the code due to separate internal billing policies. I converted Java source Javadoc comments into markup for documentation, many times translated separately later on. I suspect that much string usage both inside and outside of the DITA architecture is still subject to this kind of translation indeterminacy.

    There is a body of experience-based best practice materials for DITA that writers and documentation managers should certainly be aware of:

    * IBM had been translating DITA content for at least 3 years prior to the start of the OASIS DITA Technical Committee in 2004, publishing its own lessons learned (“Globalize your On Demand Business”–http://www-01.ibm.com/software/globalization/topics/dita/index.jsp and “Implement a DITA publishing solution without abandoning your current publishing system investments”–http://www.ibm.com/developerworks/xml/library/x-dita11/ ).

    * The DITA TC early on established the OASIS DITA Translation Subcommittee, comprised of many members of the translation services community, to produce a number of best practices listed here: “Optimizing DITA for Translations”–http://dita.xml.org/wiki/optimizing-dita-for-translations. Some of the unlinked materials are now published–I’ll check on getting those items updated with current links.

    * Other companies are now publishing use cases that affirm the potential of well-written DITA to reduce translation costs while maintaining high satisfaction and quality worldwide. These articles capture lessons learned and are useful as valuable ROI calculation truth points for business case justification. The list of references is growing, so I suggest just searching for “DITA ROI translation” to surface quite a bit of available training and citations on the subject.

    • http://www.contentrules.com Val Swisher

      Don – Thank you so much for this list of resources. I know that it will be valuable to many in the writing and translation communities. It you have additional sources, please let me know. I will include them in a follow-up post.

  • Martin Spethman

    Actually Translation Memory (TM) will prove to be helpful when translating DITA “chunks” because, in most cases, the TM will be based on repetitious and relevant, related content. DITA chunks are nearly always managed with a Content Management System (CMS) that would also provide the linguist with a rendering (typically PDF) of a complete topic (e.g. “chapter”) that uses the chunk. While it would it would certainly be theoretically possible for a customer to manually manage their DITA chunks and attempt to send individual chunks to a translation vendor without context, this is not the usual case.

    On the surface, it might look like the translator is being given more work than necessary when translating DITA chunks: the linguist needs to search for the text section in available reference materials (usually PDF files) in order to see the text in context. Then the linguist must make a decision; is the text suggested by TM the best choice? Or, is an alternate translation better?

    Most translators today have many tools available to help them make the best translation possible for any “piece” of text. Tools range from online databases and digital multilingual dictionaries to professional forums. Computer Aided Translation (CAT) tool developers try to incorporate functions into their tools which make the translation process less “painful”. Many tools have “Autosuggest” and Google translate incorporated; this not only increases the number of reference materials which can be used but also speeds up the translation process, making it more productive.

    In summary, a good translator is always a good researcher, and research is part of the job.

    • http://www.contentrules.com Val Swisher

      Martin – Thanks so much for your perspective. I’ve gotten differing opinions and related experience from people who have responded either to this post or to me directly. Some, like you, suggest that this really isn’t a problem, because it is rare that chunks are delivered out of context or without access to the CMS system.

      Others state that this is a huge problem and one that has persisted for years. Ultan, who goes by @localization on Twitter, wrote this blog post addressing the issue quite thoroughly: http://www.multilingualblog.com/?p=102. Apparently, this problem still exists today, at least in Ultan’s world.

      As I mentioned, I am certainly not an expert. I appreciate knowing more about your experience, and how various tools and TMs have just about eliminated the issue of DITA chunks arriving out of context to translators – at least in your world!

    • David

      Martin

      When you write: “In summary, a good translator is always a good researcher, and research is part of the job” then you might be right. Basically. But how many clients are willing to pay for the time necessary for research… not many, I can say from my experience coming from tha language industry.

      Best regards,
      David

  • Douglas McCarthy

    In my opinion, the problem raised by Jost Zetsche is a false one. First of all, the information on the type of entity that a word, phrase or sentence embodies is only marginally useful to the translator. All he/she really needs is the most usual output format, whether it’s CHM, PDF or HTML. This will be enough of a guide.
    Second, if the translator really wants to study the various entities and their structure in the document, a text editor like Notepad or even better (because of indentation etc.) Notepad++ is freely available. This kind of tool allows you to get into the heart of the code with no difficulty.
    Third, this supposed constraint is similar to that involved in translating classic FrameMaker or InDesign: you work with the MIF or the INX/IDML respectively, and check the PDF output of the source document as you go along.
    As far as the size of chunks goes, this is rarely, if ever, going to be a problem. Even if you had a command name identified as such by tags, or a string of code identified in the same way, it would be easy to find the whole sentence context in your translation environment editor. The segmentation rules would probably find the sentence or phrase units easily enough, and use in-line tags inside the sentence. This is exactly what happens when you translate HTML or even DOCX with words in bold or italic format, for example. The only difference is that most translation editors today can provide you with a preview of your HTML or DOCX file. However, doing without a preview is not a big handicap.

    • http://www.contentrules.com Val Swisher

      Thanks very much for your perspective. I learn more and more every day. Hearing different views on this topic has been very interesting and I appreciate your interest!

  • Anna van Raaphorst

    I struggled with the “translating out of context” problem Val mentions recently when my business partner and I were playing around with automated machine translation of some of the sample DITA files we distribute as part of our DITAinformationcenter. Dick wrote a Python program to send the DITA files, one by one, to the Microsoft and Google translation tools to put into both German and Spanish. We then compared the results of the two tools, as well as the individual language files to the original English. One goal was to rewrite the original English files using guidelines of Simplified English to see if that helped to produce better translations.

    The results were a mixed bag — enough said. One interesting (but frustrating!) result was the differences within the individual translated files — structures that should have been parallel but were not. One example is the tasks in the garage sample that in English were titled “To wash the car,” “To spray-paint,” “To take out the garbage,” etc. All parallel structures starting with “to.” The German translation has three versions of that structure!

    I tried everything to produce consistent results (for example, changing the “to” structures to gerunds — no luck.

    To see the results, go to the following Download page and page down to the “garage” and “grocery shopping” files. (P.S. This site is still a work in progress.)

    http://www.xmldocs.info/?page_id=50

    -Anna

  • Pingback: The Importance of Feedback to CMS Health

  • Pingback: Fine chunking and translation apparently don’t mix either

Get the Scoop

our monthly dose of compelling content delivered to your inbox

strategy | development | globalization writing | terminology | XML | ebooks

 

Recent Posts