Download interlinear.pl.
Currently, this script expects your computerized lexicon to be the same format as mine. If you would like to use this script, email me and I'll try to add support for your format (assuming it's a consistent format, so that a computer program can easily parse it — if you keep your language information in prose format, I can't help ya ;)).
Documentation
Usage
At the command line, run:
perl ./interlinear.pl < source-file.txt > interlinearized-file.html
Syntax
See the interlinearized texts in Writing section of Arthaey's website:
http://arthaey.mine.nu:8080/~arthaey/conlang/writing/
Configuration
You must begin each interlinear source file with a configuration section, which defines the names of the languages used and specifies where the lexicon file and the dictionary HTML page are located. For example:
<config>
L0 = LanguageToBeInterlinearized
L1 = SmoothTranslationLang
L2 = OtherSmoothTranslationLang
dictionary = ../www/dictionary.html
lexicon = saved-lexicon
</config>
The language codes must be L0..L9, and L0 must be the language whose lines are
to be interlinearized. You must define dictionary to be the relative path to
the HTML version of your dictionary (morphemes will be linked to
$dictionary#$morpheme). You must also define lexicon to be the relative path
to the FreezeThaw-saved version of your lexicon.
You may optionally include extra words in ``temporary lexicon'' section, before
the interlinear text itself. Words defined here will override words in the
lexicon defined in the config section (although only for this one text).
Use the same format as for your main lexicon (which currently must be SIL
Shoebox's format) Proper names are the most likely thing to be defined here.
For example:
<lexicon>
\lx Arthei
\ph 'Ar\Te
\ps prop
\ge Arthaey
</lexicon>
Interlinear Markup
After the <config> ... </config> section comes the interlinear text. These lines begin with one of the Ln language codes defined in the configuration section, followed by a colon and whitespace, and then the text itself. For the L0 line, you will further mark the text up so that it can be properly broken down into morphemes and automatically glossed.
Place | at the end of each morpheme. To select a morpheme's sense that
isn't the first one, append the sense's number directly after the pipe. Thus,
bat and bat|1 will gloss to the first meaning of the word bat, and
bat|2 will gloss to the second meaning of the word bat. The order of
words' senses is determined by order of entry in the lexicon.
Surround with { and } characters that belong in the final orthographic
version but that aren't part of the dictionary form of the morpheme. These
characters will be displayed in the final version, but will not be used to
look up the glosses of morphemes. (Punctuation marks will need to be included
in curly braces, for example.)
Add parts of morphemes that have been left out of the final orthographic
version with [ and ]. These characters will not be displayed in the final
version, but they will be used to look up the glosses of morphemes.
A # will become a newline (HTML <br/>), and two ## together will
become a new paragraph tag (HTML <p/>) in the big orthographic version.
To preserve the case of a particular word, prefix it with ^. This is most
useful for proper names.
Any HTML (or anything, really) between < and > will be passed
verbatim to the big orthographic version of the text, although not to the
line-by-line orthograrhic version.
Links to each line's line number are automatically placed at the very beginning
of each line. Normally, this is what you want. Sometimes, however, you will
want more explicit control over the link's placement: for example, HTML
headings will otherwise cause a line break between the link and the line
itself. Anywhere a @ appears in a line, it will be replaced by the link to
the line number.
Source
#!/usr/bin/perl use strict; use warnings; ############################################################################### # INTERLINEARIZER ############################################################################### # By Arthaey Angosii <arthaey@yahoo.com>, 27 December 2004. ############################################################################### use encoding "utf8"; use HTML::Entities; use Encode; use CXS; # Henrik Theiling's CXS <> IPA module use FreezeThaw qw/thaw/; # to read in the saved lexicon-as-Perl-hash use Data::Dumper::Simple; # for debugging, during development use lib '/home/arthaey/www/conlang/lexicon'; use Lexicon; ############################################################################### # VARIABLES ############################################################################### use enum qw(NONE ORTHO_ONLY MORPH_ONLY POSSIBLE_BREAK VERBATIM ESCAPE); my @lines; # contains info for each line of the text my $line_num = 0; # keeps track of the current line number my $parsing_config = 1; # flags whether the config section is being parsed my $parsing_lexemes = 0; # flags whether the local lexicon is being parsed my %config; # options in the <config> ... </config> section my @langs; # mapping of language names and L0..Ln my %used_pos; # parts of speech used in the text my %lexicon; # all morphemes that can be automatically glossed my $temp_lex = '.tmp.lex'; # lexemes defined only for this text my %pos = ( # parts of speech names, for the PoS legend adj => 'adjective', adv => 'adverb', asp => 'aspect', conj => 'conjunction', cop => 'copula', mi => 'modifier', n => 'noun', opt => 'optative', part => 'particle', phr => 'phrase', pl => 'plural', pos => 'possessive', pro => 'pro-form', prop => 'proper name', prsn => 'person', tns => 'tense', v => 'verb', ); ############################################################################### # SUBROUTINES ############################################################################### # Usage: # $updated = update_flag(\$flag, $character); # # Updates the value of the flag that determines whether parsed text belongs in # the orthographic section, the morhemic break-down section, or both. It then # returns whether the flag was updated. Note that it will report an update even # if the "change" was to make the flag the same value as it already was. # sub update_flag($$) { my $flag = shift; my $char = shift; my $updated = 1; if ($char eq '[') { $$flag = MORPH_ONLY; } elsif ($char eq ']') { $$flag = NONE; } elsif ($char eq '{') { $$flag = ORTHO_ONLY; } elsif ($char eq '}') { $$flag = NONE; } else { $updated = 0; } return $updated; } # Usage: # $normalized_word = normalize_lexeme($word); # # Decodes the $word from UTF-8 and encodes any non-ASCII or HTML-unsafe # characters at HTML entites. # sub normalize_lexeme($) { my $word = shift; utf8::decode($word); $word = encode_entities($word); return $word; } # Usage: # load_lexicon(); # # Reads the dictionary information into the %lexicon hash. It will only do this # if the lexicon is not already loaded. Thus, it is safe to call this function # multiple times; it will only do the loading work the first time, and do # nothing thereafter. # sub load_lexicon() { if (not %lexicon) { my $lexicon_file = $config{'lexicon'}; my $data = `cat $lexicon_file`; %lexicon = thaw $data; # add temporary lexemes, only valid for this text parse_lexicon($temp_lex, \%lexicon); } } # Usage: # $gloss = gloss($word); # $gloss = gloss($word, $sense); # # Looks up the gloss of the (normalized) $word in question, loading the lexicon # if necessary. If no sense is given, it will return the first sense (or the # only sense, if the word does not have multiple senses. If it cannot find the # word in the lexicon, or it does not have a gloss defined for the sense # requested, it will return '??'. # sub gloss($;$) { my $word = normalize_lexeme(shift); my $sense = shift; my $val = $lexicon{$word}{'gloss'}; my $return; load_lexicon(); if (ref $val eq 'ARRAY') { $return = $$val[$sense ? $sense - 1 : 0]; } else { $return = $val; } # if no value found, check subentries :( #foreach my $main (sort keys %lexicon) { # NEED TO USE THE XML LEXICON TO GET SUBENTRIES TO WORK #} return ($return ? $return : '??'); } # Usage: # $ipa = ipa($word); # # Looks up the CXS pronunciation of the (normalized) $word and returns the # Unicode IPA version of it. # sub ipa($) { my $word = normalize_lexeme(shift); my $return; load_lexicon(); if (not ($return = $lexicon{$word}{'cxs'})) { $return = ''; } return cxs2ipa($return); } # Usage: # $PoS = part_of_speech($word); # $PoS = part_of_speech($word, $sense); # # Looks up the part of speech of the (normalized) $word in question, loading # the lexicon if necessary. If no sense is given, it will return the first # sense (or the only sense, if the word does not have multiple senses. If it # cannot find the word in the lexicon, or it does not have a gloss defined for # the sense requested, it will return the empty string. # sub part_of_speech($;$) { my $word = normalize_lexeme(shift); my $sense = shift; my $return; load_lexicon(); # if a sense number was given, use that if ($sense and ref $lexicon{$word}{'pos'} eq 'ARRAY') { $return = $lexicon{$word}{'pos'}[$sense-1]; # if the senses share part of speech, use the shared value if (not $return) { $return = $lexicon{$word}{'pos'}; } } # otherwise, use the first sense else { my $val = $lexicon{$word}{'pos'}; if (ref $val eq 'ARRAY') { $return = $$val[0]; } else { $return = $val; } } return ($return ? $return : ''); } # Usage: # print_lexicon_lookup(@morphemes, $CSS_class, \&lookup_subroutine); # print_lexicon_lookup(@morphemes, $CSS_class, \&lookup_subroutine, @senses); # # Prints to STDOUT a table row with class=$CSS_class and each element in # @morphemes as its own table cell. The value in the cell will be the return # value of lookup_subroutine($morpheme) or lookup_subroutine($morpheme, $sense) # (depending on whether sense numbers were passed in). # sub print_lexicon_lookup($$$;$) { my $words = shift; my $class = shift; my $sub = shift; my $sense = shift; my $word_num = 0; print "<tr class=\"$class\">\n"; foreach my $word (@$words) { print "\t<td>", ($sense ? &$sub($word, $$sense[$word_num++]) : &$sub($word)), "</td>\n"; } print "</tr>\n"; } # Usage: # $clean_text = remove_html($dirty_text); # # Strips anything between '<' and '>'. Very dumb function, but does a simple # job simply. # sub remove_html($) { my $html = shift; my $return = ''; my $flag = VERBATIM; foreach my $char (split '', $html) { if ($char eq '<') { $flag = NONE; } elsif ($char eq '>') { $flag = VERBATIM; } elsif ($flag == VERBATIM) { $return .= $char; } } return $return; } # Usage: # $needed_custom = custom_line_num($ortho_array_ref, $line_link); # # Replaces the literal string '@@LINE@@' with a link to the current line # number at that location in the line. sub custom_line_num($$) { my $ortho = shift; my $line_link = shift; my $return = 0; foreach my $word (@$$ortho) { if ($word =~ s/@\@LINE@@/$line_link/) { $return = 1; last; } } return $return; } ############################################################################### # MAIN PROGRAM ############################################################################### open TEMP_LEX, '>', "$temp_lex" or die "Can't create $temp_lex: $!\n"; # PARSE EACH LINE OF THE SOURCE FILE while (<>) { # skip blank lines and comments if (/^\/\/|^\s*$/) { # do nothing } # parse configuration options elsif ($parsing_config) { /\s*(\S+)\s*=\s*(.*)\s*/; # find any "Ln = Lang-name" lines no warnings; # "uninitialized" hash element, b/c it's first created here $config{$1} = $2; use warnings; $parsing_config = 0 if /<\/config>/; } elsif (/<lexicon>/) { $parsing_lexemes = 1; } elsif (/<\/lexicon>/) { $parsing_lexemes = 0; } elsif ($parsing_lexemes) { print TEMP_LEX $_; } # minimal parsing on all lines except the L0 lines, which need extra parsing elsif ($_ !~ /^L0:\s+/) { /^(\S+):\s+(.*)\s*/; # find any "Ln: Smooth translation" lines my ($lang, $line) = ($1, $2); $line =~ s/#/\n/g; $line =~ s'@'@@LINE@@'; ${$lines[$line_num]}{$lang} = [$line]; } # parse the interlinear lines else { my ($ortho, $morph, $gloss); my $colspan = 1; my $keep_caps = 0; my $flag = NONE; my $prev_flag = NONE; push @lines, {ortho => [], morph => [], gloss => [], colspan => []}; /^L0:\s+(.*)\s+/; my @chars = split '', $1; my $char; $line_num++; foreach $char (@chars) { if ($char eq "\\") { $prev_flag = $flag; $flag = ESCAPE; } elsif ($flag == ESCAPE) { $flag = $prev_flag; $ortho .= $char unless $flag == MORPH_ONLY; $morph .= $char unless $flag == ORTHO_ONLY; } elsif ($char eq '>') { $flag = $prev_flag; $ortho .= $char; } elsif ($flag == VERBATIM) { $ortho .= $char; } elsif ($char eq '<') { $prev_flag = $flag; $flag = VERBATIM; $ortho .= $char; } elsif ($char eq '#') { $ortho .= "\n"; } elsif ($char eq '@') { $ortho .= '@@LINE@@'; } elsif ($char eq '|') { $flag = POSSIBLE_BREAK; $morph =~ s/^\s*//; $morph = lc $morph unless $keep_caps; push @{$lines[$line_num]{'morph'}}, $morph; $morph = ''; } elsif ($flag == POSSIBLE_BREAK) { if ($char =~ /\s/) { $ortho =~ s/^\s*//; push @{$lines[$line_num]{'ortho'}}, $ortho; push @{$lines[$line_num]{'colspan'}}, $colspan; # reset values $ortho = ''; $colspan = 1; $keep_caps = 0; $flag = NONE; } elsif ($char =~ /\d/) { # shouldn't reset the flag here } elsif (++$colspan and not update_flag(\$flag, $char)) { if ($char eq '^') { $keep_caps = 1; } else { $ortho .= $char unless $flag == MORPH_ONLY; $char = lc $char unless $keep_caps; $morph .= $char unless $flag == ORTHO_ONLY; } $flag = NONE; } push @{$lines[$line_num]{'sense'}}, ($char =~ /\d/ ? 0+$char : 0); } elsif ($char eq '^') { $keep_caps = 1; } elsif (not update_flag(\$flag, $char)) { if ($char eq '^') { $keep_caps = 1; } else { $ortho .= $char unless $flag == MORPH_ONLY; $char = lc $char unless $keep_caps; $morph .= $char unless $flag == ORTHO_ONLY; } } } # end each character # do final processing of the last morpheme if ($flag == POSSIBLE_BREAK) { $ortho =~ s/^\s*//; $morph =~ s/^\s*//; $morph = lc $morph unless $keep_caps; push @{$lines[$line_num]{'ortho'}}, $ortho; push @{$lines[$line_num]{'morph'}}, $morph unless $morph =~ /^\s*$/; push @{$lines[$line_num]{'colspan'}}, $colspan; push @{$lines[$line_num]{'sense'}}, ($char and $char =~ /\d/ ? 0+$char : 0); } } } # end each line close TEMP_LEX; # MAKE AN ARRAY OF ALL LANGUAGES USED IN THIS FILE foreach my $key (sort keys %config) { if ($key ne 'L0' and $key =~ /L\d+/) { push @langs, $key; } } # MAKE AN ARRAY OF ALL THE PARTS OF SPEECH USED foreach my $line (@lines) { foreach my $word (@{$$line{'morph'}}) { my $pos = part_of_speech($word, $$line{'sense'}); $used_pos{$pos} = 1 unless not $pos or ref $pos eq 'ARRAY'; } } print "<div class=\"clearer\"></div>\n\n"; print "<h2>Orthographic</h2>\n"; # DISPLAY THE ORIGINAL ORTHOGRAPHIC VERSION print "<div class=\"L0\">\n"; print "<h3>", $config{'L0'}, "</h3>\n"; $line_num = 0; foreach my $line (@lines) { next unless $$line{'ortho'} and @{$$line{'ortho'}}; $line_num++; my $line_link = '<span class="ref-num">'. "[<a href=\"#line$line_num\">$line_num</a>]</span> "; if (not custom_line_num(\$$line{'ortho'}, $line_link)) { print $line_link; } foreach my $word (@{$$line{'ortho'}}) { my $copy = $word; # otherwise, extra breaks show up in the interlinear $copy =~ s/\n\n/<p\/>/g; $copy =~ s/\n/<br\/>/g; print "$copy "; } } print "</div>\n"; # DISPLAY EACH FULL SMOOTH TRANSLATION foreach my $lang (@langs) { print "<div class=\"$lang\">\n"; print "<h3>", $config{$lang}, "</h3>\n"; $line_num = 0; foreach my $line (@lines) { next unless $$line{'ortho'} and @{$$line{'ortho'}}; $line_num++; my $line_link = '<span class="ref-num">'. "[<a href=\"#line$line_num\">$line_num</a>]</span> "; if (not custom_line_num(\$$line{$lang}, $line_link)) { print $line_link; } foreach my $word (@{$$line{$lang}}) { my $copy = $word; # otherwise, extra breaks show up in the interlinear $copy =~ s/\n\n/<p\/>/g; $copy =~ s/\n/<br\/>/g; print "$copy "; } } print "</div>\n"; } # DISPLAY PARTS OF SPEECH LEGEND print "<div class=\"clearer\"></div>\n", "<div class=\"pos_legend\">\n", "<h2>Parts of Speech Legend</h2>\n\n<dl>\n"; foreach my $pos (sort keys %used_pos) { my $full_pos = $pos{$pos}; $full_pos = ' ' unless $full_pos; print "<dt>$pos</dt>\n<dd>$full_pos</dd>\n"; } print "</dl>\n</div>\n"; print <<HTML_END; <h2>Interlinear</h2> <table class="legend"> <tr class="ortho"><td>orthographic version</td></tr> <tr class="morph"><td>morphemic breakdown</td></tr> <tr class="ipa"><td>IPA pronunciation</td></tr> <tr class="pos"><td>part of speech</td></tr> HTML_END foreach my $lang (@langs) { print "\t<tr class=\"$lang\"><td><span class=\"$lang\">", $config{$lang}, " translation</span></td></tr>\n"; } print "</table>\n"; # DISPLAY EACH INTERLINEARIZED LINE OF THE TEXT $line_num = 0; foreach my $line (@lines) { next unless $$line{'ortho'} and @{$$line{'ortho'}}; $line_num++; print "<a name=\"line$line_num\"> </a>"; print "<table class=\"interlinear\">\n"; # table row for orthographic representation my $col_num = 0; print "<tr class=\"ortho\">\n"; foreach my $word (@{$$line{'ortho'}}) { print "\t<td", " colspan='", @{$$line{'colspan'}}[$col_num++], "'>", remove_html($word), "</td>\n"; } print "</tr>\n"; # table row for morphemic breakdown print "<tr class=\"morph\">\n"; foreach my $word (@{$$line{'morph'}}) { print "\t<td><a href=\"", $config{'dictionary'}, "#$word\">$word</a></td>\n"; } print "</tr>\n"; print_lexicon_lookup($$line{'morph'}, 'ipa', \&ipa); print_lexicon_lookup($$line{'morph'}, 'pos', \&part_of_speech, $$line{'sense'}); print_lexicon_lookup($$line{'morph'}, 'gloss', \&gloss, $$line{'sense'}); # table row for smooth translation foreach my $lang (@langs) { print "<tr class=\"$lang\">\n"; foreach my $trans (@{$$line{$lang}}) { print "\t<td colspan=\"", scalar @{$$line{'morph'}}, "\"><span class=\"$lang\">", remove_html($trans), "</span></td>\n"; } print "</tr>\n"; } print "</table>\n"; } # DELETE THE TEMPORARY FILES unlink $temp_lex; ############################################################################### # END INTERLINEARIZER ############################################################################### __END__ =head1 Interlinearizer By Arthaey Angosii <arthaey@yahoo.com> =head2 Usage At the command line, run: C<perl ./interlinear.pl E<lt> source-file.txt E<gt> interlinearized-file.html> =head2 Syntax See the interlinearized texts in Writing section of Arthaey's website: http://arthaey.mine.nu:8080/~arthaey/conlang/writing/ =head3 Configuration You must begin each interlinear source file with a configuration section, which defines the names of the languages used and specifies where the lexicon file and the dictionary HTML page are located. For example: <config> L0 = LanguageToBeInterlinearized L1 = SmoothTranslationLang L2 = OtherSmoothTranslationLang dictionary = ../www/dictionary.html lexicon = saved-lexicon </config> The language codes must be L0..L9, and L0 must be the language whose lines are to be interlinearized. You must define C<dictionary> to be the relative path to the HTML version of your dictionary (morphemes will be linked to C<$dictionary#$morpheme>). You must also define C<lexicon> to be the relative path to the FreezeThaw-saved version of your lexicon. You may optionally include extra words in "temporary lexicon" section, before the interlinear text itself. Words defined here will override words in the lexicon defined in the C<config> section (although only for this one text). Use the same format as for your main lexicon (which currently must be SIL Shoebox's format) Proper names are the most likely thing to be defined here. For example: <lexicon> \lx Arthei \ph 'Ar\Te \ps prop \ge Arthaey </lexicon> =head3 Interlinear Markup After the E<lt>configE<gt> ... E<lt>/configE<gt> section comes the interlinear text. These lines begin with one of the Ln language codes defined in the configuration section, followed by a colon and whitespace, and then the text itself. For the L0 line, you will further mark the text up so that it can be properly broken down into morphemes and automatically glossed. Place C<|> at the end of each morpheme. To select a morpheme's sense that isn't the first one, append the sense's number directly after the pipe. Thus, C<bat> and C<bat|1> will gloss to the first meaning of the word I<bat>, and C<bat|2> will gloss to the second meaning of the word I<bat>. The order of words' senses is determined by order of entry in the lexicon. Surround with C<{> and C<}> characters that belong in the final orthographic version but that aren't part of the dictionary form of the morpheme. These characters will be displayed in the final version, but will not be used to look up the glosses of morphemes. (Punctuation marks will need to be included in curly braces, for example.) Add parts of morphemes that have been left out of the final orthographic version with C<[> and C<]>. These characters will not be displayed in the final version, but they will be used to look up the glosses of morphemes. A C<#> will become a newline (HTML C<< <br/> >>), and two C<##> together will become a new paragraph tag (HTML C<< <p/> >>) in the big orthographic version. To preserve the case of a particular word, prefix it with C<^>. This is most useful for proper names. Any HTML (or anything, really) between C<E<lt>> and C<E<gt>> will be passed verbatim to the big orthographic version of the text, although not to the line-by-line orthograrhic version. Links to each line's line number are automatically placed at the very beginning of each line. Normally, this is what you want. Sometimes, however, you will want more explicit control over the link's placement: for example, HTML headings will otherwise cause a line break between the link and the line itself. Anywhere a C<@> appears in a line, it will be replaced by the link to the line number. =cut