gettext: Normalizing

 
 8.3.4 Normalizing Strings in Entries
 ------------------------------------
 
    There are many different ways for encoding a particular string into a
 PO file entry, because there are so many different ways to split and
 quote multi-line strings, and even, to represent special characters by
 backslashed escaped sequences.  Some features of PO mode rely on the
 ability for PO mode to scan an already existing PO file for a particular
 string encoded into the ‘msgid’ field of some entry.  Even if PO mode
 has internally all the built-in machinery for implementing this
 recognition easily, doing it fast is technically difficult.  To
 facilitate a solution to this efficiency problem, we decided on a
 canonical representation for strings.
 
    A conventional representation of strings in a PO file is currently
 under discussion, and PO mode experiments with a canonical
 representation.  Having both ‘xgettext’ and PO mode converging towards a
 uniform way of representing equivalent strings would be useful, as the
 internal normalization needed by PO mode could be automatically
 satisfied when using ‘xgettext’ from GNU ‘gettext’.  An explicit PO mode
 normalization should then be only necessary for PO files imported from
 elsewhere, or for when the convention itself evolves.
 
    So, for achieving normalization of at least the strings of a given PO
 file needing a canonical representation, the following PO mode command
 is available:
 
 ‘M-x po-normalize’
      Tidy the whole PO file by making entries more uniform.
 
    The special command ‘M-x po-normalize’, which has no associated keys,
 revises all entries, ensuring that strings of both original and
 translated entries use uniform internal quoting in the PO file.  It also
 removes any crumb after the last entry.  This command may be useful for
 PO files freshly imported from elsewhere, or if we ever improve on the
 canonical quoting format we use.  This canonical format is not only
 meant for getting cleaner PO files, but also for greatly speeding up
 ‘msgid’ string lookup for some other PO mode commands.
 
    ‘M-x po-normalize’ presently makes three passes over the entries.
 The first implements heuristics for converting PO files for GNU
 ‘gettext’ 0.6 and earlier, in which ‘msgid’ and ‘msgstr’ fields were
 using K&R style C string syntax for multi-line strings.  These
 heuristics may fail for comments not related to obsolete entries and
 ending with a backslash; they also depend on subsequent passes for
 finalizing the proper commenting of continued lines for obsolete
 entries.  This first pass might disappear once all oldish PO files would
 have been adjusted.  The second and third pass normalize all ‘msgid’ and
 ‘msgstr’ strings respectively.  They also clean out those trailing
 backslashes used by XView’s ‘msgfmt’ for continued lines.
 
    Having such an explicit normalizing command allows for importing PO
 files from other sources, but also eases the evolution of the current
 convention, evolution driven mostly by aesthetic concerns, as of now.
 It is easy to make suggested adjustments at a later time, as the
 normalizing command and eventually, other GNU ‘gettext’ tools should
 greatly automate conformance.  A description of the canonical string
 format is given below, for the particular benefit of those not having
 Emacs handy, and who would nevertheless want to handcraft their PO files
 in nice ways.
 
    Right now, in PO mode, strings are single line or multi-line.  A
 string goes multi-line if and only if it has _embedded_ newlines, that
 is, if it matches ‘[^\n]\n+[^\n]’.  So, we would have:
 
      msgstr "\n\nHello, world!\n\n\n"
 
    but, replacing the space by a newline, this becomes:
 
      msgstr ""
      "\n"
      "\n"
      "Hello,\n"
      "world!\n"
      "\n"
      "\n"
 
    We are deliberately using a caricatural example, here, to make the
 point clearer.  Usually, multi-lines are not that bad looking.  It is
 probable that we will implement the following suggestion.  We might lump
 together all initial newlines into the empty string, and also all
 newlines introducing empty lines (that is, for N > 1, the N-1’th last
 newlines would go together on a separate string), so making the previous
 example appear:
 
      msgstr "\n\n"
      "Hello,\n"
      "world!\n"
      "\n\n"
 
    There are a few yet undecided little points about string
 normalization, to be documented in this manual, once these questions
 settle.