x

Table of Contents

Download interlinear.pl.

Currently, this script expects your computerized lexicon to be the same format as mine. If you would like to use this script, email me and I'll try to add support for your format (assuming it's a consistent format, so that a computer program can easily parse it — if you keep your language information in prose format, I can't help ya ;)).

Documentation

Usage

At the command line, run:

perl ./interlinear.pl < source-file.txt > interlinearized-file.html

Syntax

See the interlinearized texts in Writing section of Arthaey's website:

   http://arthaey.mine.nu:8080/~arthaey/conlang/writing/

Configuration

You must begin each interlinear source file with a configuration section, which defines the names of the languages used and specifies where the lexicon file and the dictionary HTML page are located. For example:

   <config>
      L0 = LanguageToBeInterlinearized
      L1 = SmoothTranslationLang
      L2 = OtherSmoothTranslationLang
      dictionary = ../www/dictionary.html
      lexicon = saved-lexicon
   </config>

The language codes must be L0..L9, and L0 must be the language whose lines are to be interlinearized. You must define dictionary to be the relative path to the HTML version of your dictionary (morphemes will be linked to $dictionary#$morpheme). You must also define lexicon to be the relative path to the FreezeThaw-saved version of your lexicon.

You may optionally include extra words in ``temporary lexicon'' section, before the interlinear text itself. Words defined here will override words in the lexicon defined in the config section (although only for this one text). Use the same format as for your main lexicon (which currently must be SIL Shoebox's format) Proper names are the most likely thing to be defined here. For example:

   <lexicon>
      \lx Arthei
      \ph 'Ar\Te
      \ps prop
      \ge Arthaey
   </lexicon>

Interlinear Markup

After the <config> ... </config> section comes the interlinear text. These lines begin with one of the Ln language codes defined in the configuration section, followed by a colon and whitespace, and then the text itself. For the L0 line, you will further mark the text up so that it can be properly broken down into morphemes and automatically glossed.

Place | at the end of each morpheme. To select a morpheme's sense that isn't the first one, append the sense's number directly after the pipe. Thus, bat and bat|1 will gloss to the first meaning of the word bat, and bat|2 will gloss to the second meaning of the word bat. The order of words' senses is determined by order of entry in the lexicon.

Surround with { and } characters that belong in the final orthographic version but that aren't part of the dictionary form of the morpheme. These characters will be displayed in the final version, but will not be used to look up the glosses of morphemes. (Punctuation marks will need to be included in curly braces, for example.)

Add parts of morphemes that have been left out of the final orthographic version with [ and ]. These characters will not be displayed in the final version, but they will be used to look up the glosses of morphemes.

A # will become a newline (HTML <br/>), and two ## together will become a new paragraph tag (HTML <p/>) in the big orthographic version.

To preserve the case of a particular word, prefix it with ^. This is most useful for proper names.

Any HTML (or anything, really) between < and > will be passed verbatim to the big orthographic version of the text, although not to the line-by-line orthograrhic version.

Links to each line's line number are automatically placed at the very beginning of each line. Normally, this is what you want. Sometimes, however, you will want more explicit control over the link's placement: for example, HTML headings will otherwise cause a line break between the link and the line itself. Anywhere a @ appears in a line, it will be replaced by the link to the line number.

Source

#!/usr/bin/perl
use strict;
use warnings;

###############################################################################
# INTERLINEARIZER
###############################################################################
# By Arthaey Angosii <arthaey@yahoo.com>, 27 December 2004.
###############################################################################

use encoding "utf8";
use HTML::Entities;
use Encode;
use CXS;                   # Henrik Theiling's CXS <> IPA module
use FreezeThaw qw/thaw/;   # to read in the saved lexicon-as-Perl-hash
use Data::Dumper::Simple;  # for debugging, during development
use lib '/home/arthaey/www/conlang/lexicon'; use Lexicon;


###############################################################################
# VARIABLES
###############################################################################

use enum qw(NONE ORTHO_ONLY MORPH_ONLY POSSIBLE_BREAK VERBATIM ESCAPE);

my @lines;                  # contains info for each line of the text
my $line_num = 0;           # keeps track of the current line number
my $parsing_config = 1;     # flags whether the config section is being parsed
my $parsing_lexemes = 0;    # flags whether the local lexicon is being parsed
my %config;                 # options in the <config> ... </config> section
my @langs;                  # mapping of language names and L0..Ln
my %used_pos;               # parts of speech used in the text
my %lexicon;                # all morphemes that can be automatically glossed
my $temp_lex = '.tmp.lex';  # lexemes defined only for this text

my %pos = (              # parts of speech names, for the PoS legend
   adj  => 'adjective',
   adv  => 'adverb',
   asp  => 'aspect',
   conj => 'conjunction',
   cop  => 'copula',
   mi   => 'modifier',
   n    => 'noun',
   opt  => 'optative',
   part => 'particle',
   phr  => 'phrase',
   pl   => 'plural',
   pos  => 'possessive',
   pro  => 'pro-form',
   prop => 'proper name',
   prsn => 'person',
   tns  => 'tense',
   v    => 'verb',
);


###############################################################################
# SUBROUTINES
###############################################################################

# Usage:
#    $updated = update_flag(\$flag, $character);
#  
# Updates the value of the flag that determines whether parsed text belongs in
# the orthographic section, the morhemic break-down section, or both. It then
# returns whether the flag was updated. Note that it will report an update even
# if the "change" was to make the flag the same value as it already was.
#
sub update_flag($$) {
   my $flag = shift;
   my $char = shift;
   my $updated = 1;

   if ($char eq '[') {
      $$flag = MORPH_ONLY;
   }
   elsif ($char eq ']') {
      $$flag = NONE;
   }
   elsif ($char eq '{') {
      $$flag = ORTHO_ONLY;
   }
   elsif ($char eq '}') {
      $$flag = NONE;
   }
   else {
      $updated = 0;
   }

   return $updated;
}

# Usage:
#    $normalized_word = normalize_lexeme($word);
#
# Decodes the $word from UTF-8 and encodes any non-ASCII or HTML-unsafe
# characters at HTML entites.
#
sub normalize_lexeme($) {
   my $word = shift;

   utf8::decode($word);
   $word = encode_entities($word);

   return $word;
}

# Usage:
#    load_lexicon();
# 
# Reads the dictionary information into the %lexicon hash. It will only do this
# if the lexicon is not already loaded. Thus, it is safe to call this function
# multiple times; it will only do the loading work the first time, and do
# nothing thereafter.
#
sub load_lexicon() {
   if (not %lexicon) {
      my $lexicon_file = $config{'lexicon'};
      my $data = `cat $lexicon_file`;
      %lexicon = thaw $data;

      # add temporary lexemes, only valid for this text
      parse_lexicon($temp_lex, \%lexicon);
   }
}

# Usage:
#    $gloss = gloss($word);
#    $gloss = gloss($word, $sense);
#
# Looks up the gloss of the (normalized) $word in question, loading the lexicon
# if necessary. If no sense is given, it will return the first sense (or the
# only sense, if the word does not have multiple senses. If it cannot find the
# word in the lexicon, or it does not have a gloss defined for the sense
# requested, it will return '??'.
#
sub gloss($;$) {
   my $word = normalize_lexeme(shift);
   my $sense = shift;
   my $val = $lexicon{$word}{'gloss'};
   my $return;
   load_lexicon();

   if (ref $val eq 'ARRAY') 
   {
      $return = $$val[$sense ? $sense - 1 : 0];
   }
   else {
      $return = $val;
   }

   # if no value found, check subentries :(
   #foreach my $main (sort keys %lexicon) {
      # NEED TO USE THE XML LEXICON TO GET SUBENTRIES TO WORK
   #}

   return ($return ? $return : '??');
}

# Usage:
#    $ipa = ipa($word);
#
# Looks up the CXS pronunciation of the (normalized) $word and returns the
# Unicode IPA version of it.
#
sub ipa($) {
   my $word = normalize_lexeme(shift);
   my $return;
   load_lexicon();

   if (not ($return = $lexicon{$word}{'cxs'})) {
      $return = '';
   }

   return cxs2ipa($return);
}

# Usage:
#    $PoS = part_of_speech($word);
#    $PoS = part_of_speech($word, $sense);
# 
# Looks up the part of speech of the (normalized) $word in question, loading
# the lexicon if necessary. If no sense is given, it will return the first
# sense (or the only sense, if the word does not have multiple senses. If it
# cannot find the word in the lexicon, or it does not have a gloss defined for
# the sense requested, it will return the empty string.
#
sub part_of_speech($;$) {
   my $word = normalize_lexeme(shift);
   my $sense = shift;
   my $return;
   load_lexicon();

   # if a sense number was given, use that
   if ($sense and ref $lexicon{$word}{'pos'} eq 'ARRAY') {
      $return = $lexicon{$word}{'pos'}[$sense-1];
      
      # if the senses share part of speech, use the shared value
      if (not $return) {
         $return = $lexicon{$word}{'pos'};
      }
   }
   # otherwise, use the first sense
   else {
      my $val = $lexicon{$word}{'pos'};
      if (ref $val eq 'ARRAY') {
         $return = $$val[0];
      }
      else {
         $return = $val;
      }
   }

   return ($return ? $return : '');
}

# Usage:
#    print_lexicon_lookup(@morphemes, $CSS_class, \&lookup_subroutine);
#    print_lexicon_lookup(@morphemes, $CSS_class, \&lookup_subroutine, @senses);
#
# Prints to STDOUT a table row with class=$CSS_class and each element in
# @morphemes as its own table cell. The value in the cell will be the return
# value of lookup_subroutine($morpheme) or lookup_subroutine($morpheme, $sense)
# (depending on whether sense numbers were passed in).
#
sub print_lexicon_lookup($$$;$) {
   my $words = shift;
   my $class = shift;
   my $sub = shift;
   my $sense = shift;
   my $word_num = 0;

   print "<tr class=\"$class\">\n";
   foreach my $word (@$words) {
      print "\t<td>",
            ($sense ? &$sub($word, $$sense[$word_num++]) : &$sub($word)),
            "</td>\n";
   }
   print "</tr>\n";
}

# Usage:
#    $clean_text = remove_html($dirty_text);
#
# Strips anything between '<' and '>'. Very dumb function, but does a simple
# job simply.
#
sub remove_html($) {
   my $html = shift;
   my $return = '';
   my $flag = VERBATIM;

   foreach my $char (split '', $html) {
      if ($char eq '<') {
         $flag = NONE;
      }
      elsif ($char eq '>') {
         $flag = VERBATIM;
      }
      elsif ($flag == VERBATIM) {
         $return .= $char;
      }
   }

   return $return;
}


# Usage:
#    $needed_custom = custom_line_num($ortho_array_ref, $line_link);
#
# Replaces the literal string '@@LINE@@' with a link to the current line
# number at that location in the line.
sub custom_line_num($$) {
   my $ortho = shift;
   my $line_link = shift;
   my $return = 0;

   foreach my $word (@$$ortho) {
      if ($word =~ s/@\@LINE@@/$line_link/) {
         $return = 1;
         last;
      }
   }

   return $return;
}


###############################################################################
# MAIN PROGRAM
###############################################################################

open TEMP_LEX, '>', "$temp_lex" or die "Can't create $temp_lex: $!\n";

# PARSE EACH LINE OF THE SOURCE FILE
while (<>) {
   # skip blank lines and comments
   if (/^\/\/|^\s*$/) {
      # do nothing
   }
   # parse configuration options
   elsif ($parsing_config) {
      /\s*(\S+)\s*=\s*(.*)\s*/; # find any "Ln = Lang-name" lines
      no warnings; # "uninitialized" hash element, b/c it's first created here
      $config{$1} = $2;
      use warnings;
      $parsing_config = 0 if /<\/config>/;
   }
   elsif (/<lexicon>/) {
      $parsing_lexemes = 1;
   }
   elsif (/<\/lexicon>/) {
      $parsing_lexemes = 0;
   }
   elsif ($parsing_lexemes) {
      print TEMP_LEX $_;
   }
   # minimal parsing on all lines except the L0 lines, which need extra parsing
   elsif ($_ !~ /^L0:\s+/) {
      /^(\S+):\s+(.*)\s*/; # find any "Ln: Smooth translation" lines
      my ($lang, $line) = ($1, $2);
      $line =~ s/#/\n/g;
      $line =~ s'@'@@LINE@@';
      ${$lines[$line_num]}{$lang} = [$line];
   }
   # parse the interlinear lines
   else {
      my ($ortho, $morph, $gloss);
      my $colspan = 1;
      my $keep_caps = 0;
      my $flag = NONE;
      my $prev_flag = NONE;

      push @lines, {ortho => [], morph => [], gloss => [], colspan => []};

      /^L0:\s+(.*)\s+/;
      my @chars = split '', $1;
      my $char;

      $line_num++;

      foreach $char (@chars) {
         if ($char eq "\\") {
            $prev_flag = $flag;
            $flag = ESCAPE;
         }
         elsif ($flag == ESCAPE) {
            $flag = $prev_flag;
            $ortho .= $char unless $flag == MORPH_ONLY;
            $morph .= $char unless $flag == ORTHO_ONLY;
         }
         elsif ($char eq '>') {
            $flag = $prev_flag;
            $ortho .= $char;
         }
         elsif ($flag == VERBATIM) {
            $ortho .= $char;
         }
         elsif ($char eq '<') {
            $prev_flag = $flag;
            $flag = VERBATIM;
            $ortho .= $char;
         }
         elsif ($char eq '#') {
            $ortho .= "\n";
         }
         elsif ($char eq '@') {
            $ortho .= '@@LINE@@';
         }
         elsif ($char eq '|') {
            $flag = POSSIBLE_BREAK;

            $morph =~ s/^\s*//;
            $morph = lc $morph unless $keep_caps;
            push @{$lines[$line_num]{'morph'}}, $morph;
            $morph = '';
         }
         elsif ($flag == POSSIBLE_BREAK) {
            if ($char =~ /\s/) {
               $ortho =~ s/^\s*//;
               push @{$lines[$line_num]{'ortho'}}, $ortho;
               push @{$lines[$line_num]{'colspan'}}, $colspan;

               # reset values
               $ortho = '';
               $colspan = 1;
               $keep_caps = 0;
               $flag = NONE;
            }
            elsif ($char =~ /\d/) {
               # shouldn't reset the flag here
            }
            elsif (++$colspan and not update_flag(\$flag, $char)) {
               if ($char eq '^') {
                  $keep_caps = 1;
               }
               else {
                  $ortho .= $char unless $flag == MORPH_ONLY;
                  $char = lc $char unless $keep_caps;
                  $morph .= $char unless $flag == ORTHO_ONLY;
               }
               $flag = NONE;
            }

            push @{$lines[$line_num]{'sense'}}, ($char =~ /\d/ ? 0+$char : 0);
         }
         elsif ($char eq '^') {
            $keep_caps = 1;
         }
         elsif (not update_flag(\$flag, $char)) {
            if ($char eq '^') {
               $keep_caps = 1;
            }
            else {
               $ortho .= $char unless $flag == MORPH_ONLY;
               $char = lc $char unless $keep_caps;
               $morph .= $char unless $flag == ORTHO_ONLY;
            }
         }
      } # end each character

      # do final processing of the last morpheme
      if ($flag == POSSIBLE_BREAK) {
         $ortho =~ s/^\s*//;
         $morph =~ s/^\s*//;
         $morph = lc $morph unless $keep_caps;

         push @{$lines[$line_num]{'ortho'}}, $ortho;
         push @{$lines[$line_num]{'morph'}}, $morph unless $morph =~ /^\s*$/;
         push @{$lines[$line_num]{'colspan'}}, $colspan;
         push @{$lines[$line_num]{'sense'}},
            ($char and $char =~ /\d/ ? 0+$char : 0);
      }
   }
} # end each line

close TEMP_LEX;


# MAKE AN ARRAY OF ALL LANGUAGES USED IN THIS FILE
foreach my $key (sort keys %config) {
   if ($key ne 'L0' and $key =~ /L\d+/) {
      push @langs, $key;
   }
}


# MAKE AN ARRAY OF ALL THE PARTS OF SPEECH USED
foreach my $line (@lines) {
   foreach my $word (@{$$line{'morph'}}) {
      my $pos = part_of_speech($word, $$line{'sense'});
      $used_pos{$pos} = 1 unless not $pos or ref $pos eq 'ARRAY';
   }
}

print "<div class=\"clearer\"></div>\n\n";
print "<h2>Orthographic</h2>\n";


# DISPLAY THE ORIGINAL ORTHOGRAPHIC VERSION
print "<div class=\"L0\">\n";
print "<h3>", $config{'L0'}, "</h3>\n";
$line_num = 0;
foreach my $line (@lines) {
   next unless $$line{'ortho'} and @{$$line{'ortho'}};

   $line_num++;
   my $line_link = '<span class="ref-num">'.
                   "[<a href=\"#line$line_num\">$line_num</a>]</span> ";

   if (not custom_line_num(\$$line{'ortho'}, $line_link)) {
      print $line_link;
   }

   foreach my $word (@{$$line{'ortho'}}) {
      my $copy = $word; # otherwise, extra breaks show up in the interlinear
      $copy =~ s/\n\n/<p\/>/g;
      $copy =~ s/\n/<br\/>/g;
      print "$copy ";
   }
}
print "</div>\n";


# DISPLAY EACH FULL SMOOTH TRANSLATION
foreach my $lang (@langs) {
   print "<div class=\"$lang\">\n";
   print "<h3>", $config{$lang}, "</h3>\n";
   $line_num = 0;
   foreach my $line (@lines) {
      next unless $$line{'ortho'} and @{$$line{'ortho'}};

      $line_num++;
      my $line_link = '<span class="ref-num">'.
                      "[<a href=\"#line$line_num\">$line_num</a>]</span> ";

      if (not custom_line_num(\$$line{$lang}, $line_link)) {
         print $line_link;
      }

      foreach my $word (@{$$line{$lang}}) {
         my $copy = $word; # otherwise, extra breaks show up in the interlinear
         $copy =~ s/\n\n/<p\/>/g;
         $copy =~ s/\n/<br\/>/g;
         print "$copy ";
      }
   }
   print "</div>\n";
}


# DISPLAY PARTS OF SPEECH LEGEND
print "<div class=\"clearer\"></div>\n",
      "<div class=\"pos_legend\">\n",
      "<h2>Parts of Speech Legend</h2>\n\n<dl>\n";
foreach my $pos (sort keys %used_pos) {
   my $full_pos = $pos{$pos};
   $full_pos = '&nbsp;' unless $full_pos;
   print "<dt>$pos</dt>\n<dd>$full_pos</dd>\n";
}
print "</dl>\n</div>\n";

print <<HTML_END;
<h2>Interlinear</h2>

<table class="legend">
   <tr class="ortho"><td>orthographic version</td></tr>
   <tr class="morph"><td>morphemic breakdown</td></tr>
   <tr class="ipa"><td>IPA pronunciation</td></tr>
   <tr class="pos"><td>part of speech</td></tr>
HTML_END

foreach my $lang (@langs) {
   print "\t<tr class=\"$lang\"><td><span class=\"$lang\">",
         $config{$lang},
         " translation</span></td></tr>\n";
}
print "</table>\n";


# DISPLAY EACH INTERLINEARIZED LINE OF THE TEXT
$line_num = 0;
foreach my $line (@lines) {
   next unless $$line{'ortho'} and @{$$line{'ortho'}};
   $line_num++;

   print "<a name=\"line$line_num\"> </a>";
   print "<table class=\"interlinear\">\n";

   # table row for orthographic representation
   my $col_num = 0;
   print "<tr class=\"ortho\">\n";
   foreach my $word (@{$$line{'ortho'}}) {
      print "\t<td",
            " colspan='", @{$$line{'colspan'}}[$col_num++], "'>",
            remove_html($word), "</td>\n";
   }
   print "</tr>\n";

   # table row for morphemic breakdown
   print "<tr class=\"morph\">\n";
   foreach my $word (@{$$line{'morph'}}) {
      print "\t<td><a href=\"",
            $config{'dictionary'},
            "#$word\">$word</a></td>\n";
   }
   print "</tr>\n";

   print_lexicon_lookup($$line{'morph'}, 'ipa', \&ipa);
   print_lexicon_lookup($$line{'morph'}, 'pos', \&part_of_speech, $$line{'sense'});
   print_lexicon_lookup($$line{'morph'}, 'gloss', \&gloss, $$line{'sense'});

   # table row for smooth translation
   foreach my $lang (@langs) {
      print "<tr class=\"$lang\">\n";
      foreach my $trans (@{$$line{$lang}}) {
         print "\t<td colspan=\"",
               scalar @{$$line{'morph'}},
               "\"><span class=\"$lang\">",
               remove_html($trans),
               "</span></td>\n";
      }
      print "</tr>\n";
   }

   print "</table>\n";
}

# DELETE THE TEMPORARY FILES
unlink $temp_lex;


###############################################################################
# END INTERLINEARIZER
###############################################################################

__END__

=head1 Interlinearizer

By Arthaey Angosii <arthaey@yahoo.com>

=head2 Usage

At the command line, run:

C<perl ./interlinear.pl E<lt> source-file.txt E<gt> interlinearized-file.html>

=head2 Syntax

See the interlinearized texts in Writing section of Arthaey's website:

   http://arthaey.mine.nu:8080/~arthaey/conlang/writing/

=head3 Configuration

You must begin each interlinear source file with a configuration section, which
defines the names of the languages used and specifies where the lexicon file
and the dictionary HTML page are located. For example:

   <config>
      L0 = LanguageToBeInterlinearized
      L1 = SmoothTranslationLang
      L2 = OtherSmoothTranslationLang
      dictionary = ../www/dictionary.html
      lexicon = saved-lexicon
   </config>

The language codes must be L0..L9, and L0 must be the language whose lines are
to be interlinearized. You must define C<dictionary> to be the relative path to
the HTML version of your dictionary (morphemes will be linked to
C<$dictionary#$morpheme>). You must also define C<lexicon> to be the relative path
to the FreezeThaw-saved version of your lexicon.

You may optionally include extra words in "temporary lexicon" section, before
the interlinear text itself. Words defined here will override words in the
lexicon defined in the C<config> section (although only for this one text).
Use the same format as for your main lexicon (which currently must be SIL
Shoebox's format) Proper names are the most likely thing to be defined here.
For example:

   <lexicon>
      \lx Arthei
      \ph 'Ar\Te
      \ps prop
      \ge Arthaey
   </lexicon>

=head3 Interlinear Markup

After the E<lt>configE<gt> ... E<lt>/configE<gt> section comes the interlinear
text. These lines begin with one of the Ln language codes defined in the
configuration section, followed by a colon and whitespace, and then the text
itself. For the L0 line, you will further mark the text up so that it can be
properly broken down into morphemes and automatically glossed.

Place C<|> at the end of each morpheme. To select a morpheme's sense that
isn't the first one, append the sense's number directly after the pipe. Thus,
C<bat> and C<bat|1> will gloss to the first meaning of the word I<bat>, and
C<bat|2> will gloss to the second meaning of the word I<bat>. The order of
words' senses is determined by order of entry in the lexicon.

Surround with C<{> and C<}> characters that belong in the final orthographic
version but that aren't part of the dictionary form of the morpheme. These
characters will be displayed in the final version, but will not be used to
look up the glosses of morphemes. (Punctuation marks will need to be included
in curly braces, for example.)

Add parts of morphemes that have been left out of the final orthographic
version with C<[> and C<]>. These characters will not be displayed in the final
version, but they will be used to look up the glosses of morphemes.

A C<#> will become a newline (HTML C<< <br/> >>), and two C<##> together will
become a new paragraph tag (HTML C<< <p/> >>) in the big orthographic version.

To preserve the case of a particular word, prefix it with C<^>. This is most
useful for proper names.

Any HTML (or anything, really) between C<E<lt>> and C<E<gt>> will be passed
verbatim to the big orthographic version of the text, although not to the
line-by-line orthograrhic version.

Links to each line's line number are automatically placed at the very beginning
of each line. Normally, this is what you want. Sometimes, however, you will
want more explicit control over the link's placement: for example, HTML
headings will otherwise cause a line break between the link and the line
itself. Anywhere a C<@> appears in a line, it will be replaced by the link to
the line number.

=cut