The script was written in PERL or more specifically, MacPerl, as the translations would take place on the macs at the department. All of the folowing code fragments are in PERL, and are taken more or less directly from the script itself.
The first task was to determine a working idea of the translation algorithm. The translation would need to be able to convert not only single ASCII points, but multiple character sequences to deal with some accenting problems. To start with, the problem was simplified. First the algorithm would search each character, and look up its translation. This takes care of one to one, as well as one to many translations. After the simple translation algorithm was developed, a second-pass search might be done to find any multiple-character sequences.
The translation maps would be read in from a file, in order make the translation process generalized. To make sure that the map-file format would not have to be revised, it was designed around the most general parameters which could be conceived. This meant that the format would have to be able to handle character sequences of differing lengths. In addition it was decided that the map-files should be fairly legible in their raw format. Finally, the font names should appear in the file itself, rather than the file name. (Font names can be quite long) The format decided upon was as follows
Lektorek Russian --> CyrillicII
3b --> fc
3c --> c7
8e --> 65 b1
3e --> c8
The simple translation algorithm follows fairly quickly, once this map is loaded into a hash:
s/(.)/$map{$1}/gThe second-pass algorithm would need to be a little smarter that that. It needs to search for different strings, so alternation would appear to be needed. Also, since they're literally in the regex, they need to be quoted. So when, in reading the map-file, we encounter a many to one translation, we add it to a different hash, called map_tto. When the map-file is read completely, the keys to map_tto are quoted, individually, and joined on '|', the alternation character. This is the search string in the following code fragment:
The documents were to be read in RTF format, which would allow the preservation of formatting, but still allow us to work on ASCII text. The RTF reader had to be able to understand the basics of RTF, ignoring most control words, recognizing font changes, special characters, hex-codes, etc. The RTF specification was obtained from Microsoft's WWW site.
The reader was designed recursively. Each level would handle a block, the
function was called parse_block(). The text would be searched for
an RTF meta (a forgivingly small set of three characters: \,{, and}) open
curlies would call another level of interpretation, closed curlies would finish
off a level, and backslashes would be parsed for control word content. Each
instance of the reader would store its own font number, which it inherited from
the previous level. Any intervening text would be translated if necessary, or
copied out directly, by another subroutine write()
The RTF header also had to be read,the header function was called read_header().The
imporetant tasks of this function were to find the appropriate (source) font
tags, and substitute in the new font names. The font tags are later used by
write() to detemine if a passage needs translating.
Keeps hash of references to hash-maps (keyed on font tag).
Now makes only one copy of document in memory. RAM usage is now about 3000K.
Keeps file in memory as an array, searches only through the necessary lines of text, using new search subroutines. This cuts down on long string copies. ($`, $&, and $' are much much shorter)
Perl Script reads in file names, and sends literal-text applescripts to Microsoft Word for each file. This approach is used to avoid a nasty bug in Word 6.0.1 AppleScript support.
The whole script is encapsulated in one short MacPerl runtime, which 'requires' the other scripts, saved in plain text. This makes minor bug fixes possible without the actual MacPerl Editor.