Translation Graphs
by Tom Veatch, 2003, 2008, 2017
Introduction and Motivation
Suppose you are competent in one language (call that language L1) and
you are interested in a document in another language which you
don't know (call that language L2). Wouldn't it be nice if someone
had done some work on that document to make it accessible to you in
that other language? Of course a normal paragraph by paragraph
translation of the whole, which you may find in a published
translation of that authored work, would do something like this, but
you wouldn't be getting it in the other language at all, but in your
own language. You'd learn the translated content, but you wouldn't
learn the language.
I have in mind, instead, something that opens the language itself to
you, so that as you, the learner, go through the document as analysed
and worked up by a linguist/translator, you gradually become able to
recognize and understand the elements of the other language and
ultimately to understand the original itself. This worked-up form of
the document would have to provide a variety of more-accessible forms
of the parts of the document: translations that are word-by-word, not
just paragraph by paragraph, links to audio playback of pronunciations
of the symbols and words, perhaps of larger sections. Working through
it should teach you and enable you to get a lot of its content; the
second or fifth time you encounter a word in that language, you might
not have to look at its dictionary entry to begin to understand it
yourself. Ultimately, with enough such documents, and enough time
devoted to working through them, you would become a competent reader
and even listener in that other language.
To enable this vision of language learning through a mediated,
supported, but direct encounter with the original L2 document, I have
had to envision a whole architecture of language data
representation, markup, storage, lookup tools, editing systems,
display systems and the like, which would be needed to take that
original L2 document and add the L1 resources needed and then to make
them accessible to you as you read through the document. This is my
draft description of that system.
The key idea, of course, is multilinearity. Consider the original
document as a line of text. Maybe a very very long line, but in the
abstract, just a sequence of symbols on a single line. Then, any
additional representation that supports or makes accessible to you any
piece of the original document, can be considered as a translation of
a piece of that first line which is then written onto some additional
second (or Nth) line, in a way where you can tell which part of the
first line it is a translation of. Such a representation is
multilinear.
You might have a large number of lines, lines that do pronunciation,
others that do vocabulary, lines with big gaps in them, lines that
refer to data outside the workup in a dictionary or an audio library
or on the internet, lines that call your attention to syntax or
dialect features, lines that link to clear pronunciations from an
audio dictionary and lines that link to live, vernacular or fast
speech recordings of whole sentences or turns, so that you can learn
how fast speech sounds in that language, not just careful
pronunciations from a dictionary.
And the system that ultimately presents it to you could have an
intelligent model of what you know as a learner of this new language
and it could display for you, at a suitable pace and with the right
amount of repetition and testing, the easiest next elements for you to
learn, as you gradually acquire competence with all the many things in
that document. Well, such a trainer is beyond the scope of this
current discussion, but it is something that, after we achieve the
technical requirements discussed here, we could then aim to build. It
would be a vast improvement on Teachionary (www.sprex.com ->
teachionary).
So now, then, to the technical details.
Translation Graphs
I here propose a logic and associated document format for multilinear
text representations, comprising facilities to arbitrarily segment,
annotate, and translate text documents, and providing the supporting
data and structure required for (audio/visually) displaying and
editing multilinear text representations.
A multilinear text might incorporate lines representing, for example:
* L2 orthography,
* L1 translation of the unit (word by word, sentence by sentence)
* reference to corresponding audio bits (filename and time endpts)
* reference to video bits (file, times)
* reference to dictionary entries
* translator's comments about a unit
* references to a graphical form, such as (coordinates
defining a polygon or rectangle within) a page or scan or
photograph or image containing data related to a span of
content.
* etc.
The lines can be loosely thought of as "translations" of one another,
though the translation may or may not be between human languages; for
example, different translations may be representations of the same
text at different levels of linguistic representation, for example,
paragraphs vs words vs audio, etc.
The concept here is similar to but different from Bird/Liberman
"Annotation Graphs" (AGs). AGs are:
* directed acyclic graphs comprising sets of nodes connected by arcs.
* Arc labels contain the actual text (and text types and other
classification info)
* Nodes anchor ends of text units, optionally encode times in
audio files.
Translation Graphs (TGs) are similar but slightly different:
* same: directed acyclic graphs comprising sets of nodes
connected by arcs.
* same: Arc labels contain the actual text (and text types and
other classification info)
* same: Nodes anchor ends of text units
* diff: Nodes cannot be labelled with times.
* diff: Instead, arcs may refer to audio segments
(filename/start-time/end-time)
TGs are a generalization of AGs since AGs refer to single audio signal
or implicit time baseline, whereas TGs can refer to more than one
audio representation for (segments of) a single text. Formally, then:
Logico-mathematical definition of TGs:
A Translation Graph is an 5-tuple {T,C,L,N,A} with T, a set of types;
C, a set of classes; and L, a set of labels; N, a set of nodes, and A,
a set of arcs each linking from one to another node in N, associated
with one type in T, one class in C, and one label in L.
Format of (Text representation for) TGs:
A TG is represented using tags and tagged text. A tag is a string
enclosed in <>'s, the string begins with a tag name and is followed as
appropriate by further text. Tagged text is surrounded by tag pairs
where the name of the second is the name of the first preceded by a
slash as in .. .
1) An optional header within
.. tags containing
a) one or more types within .. tags including
i) the type's name and
ii) optionally an encoding for the labels the given type within
tags
b) one or more classes within .. tags including
i) the class name. This might be the name of the underlying document
which may have its layers stored separately in a variety of files.
Class identity across files enables separate storage if
desired while also providing for merging or you might think,
zipping, files together into a multi-layer translation graph
structure, subject to compatible node labelling and
sequencing. Class could encode a classification richer even
than mere hierarchy for document versioning
e.g., Torah Old_Testament King_James
Torah Old_Testament UnitarianVersion_21.8
Note: types and classes absent in the header but found in the arcs
within the TG are acceptable; a program that writes tidy TGs
should add all arc types & classes to the header for completeness
but programs that read TGs should not expect that all will be there,
since TGs may be written by hand, and new types/classes added
by the writer/translator/editor.
Note: The type for implicit arcs is "implicit" (an 8-character
string). Implicit and explicit arcs must have different type(name)s.
Note: The class for implicit arcs is "unspecified" (an 11-character
string), unless specified in the header. To specify a class for
implicit arcs in the header, let the class name be prefixed
with "implicit:" (a 9-character string) and when interpreted,
the prefix removed before interpretation.
Note: The encoding for implicit arcs is the name of the charset
in which the document itself is encoded.
2) A body comprising
a) optional tags .. at start and end of the body.
b) zero or more explicit nodes each comprising a single tag
The document is ill-formed if multiple nodes have the same $id.
If a program encounters multiple nodes with the same $id,
and all arcs between the first and last instances of that node
are explicit, the program is authorized to simply delete all but
one, and continue processing; nothing incoherent is implied.
However, if there is any "implicit arc" text between the identical
nodes, behavior is undefined; programs ought to fail with a warning.
c) zero or more explicit arcs each comprising or alternatively $contents. For example, an arc from node 1 to node
2 of type L1 and no specified class may be represented as contents or as .
d) zero or more bits/bytes/characters of "implicit arc" contents.
Note: if "implicit arc" contents occur before the first and/or
after the last tag, then a and/or a
tag are implicitly considered to be present. A tidying program
should make explicit such implicit begin/end nodes.
Note: if there are no implicit arcs, then sequential order of the
occurrence of node and arc tags in the body is immaterial, since
the ordering of nodes and arcs along any path through the graph can
be reconstructed as the implied sequence of node id's from the
arcs. Similarly the arcs might be in separate files, but still
joinably reorientable to the nodes shared between the files. Human
readability will be enhanced if each file has its own type of data
in it, then a display program zips them together onto a shared
spine of nodes, and shows graphically and even audibly and perhaps
even in video some selected components. Annotations by a given
editor can be in their separate file. A TG-capable editing program
could be made to store layers to their separate respective files
while ensuring cross-compatibility of order and labelling of nodes.
Note: if stored separately, each layer may more simply be written
out using implicit arcs (since then linear ordering and arc
endpointing need not be written explicitly but can be derived from
the text between the node tags, thus reducing the explicitness of
the tagging to just node tags, and improving the file's
readability. Key is node consistency between layers. This needs
to be checked upon loading and merging multiple TG layers of a
single class or document name, but can easily be ensured when
writing to files. Where documents have different node labellings,
a UI might be provided to control and supervise a zip-together
operation, identifying nodes that correspond in the merging files.
e) Contents is a text string encoded according to the encoding
(charset) for arcs of a given type. Contents may be one of:
i) a direct text representation of a linguistic unit.
The type's encoding refers to the encoding of the directly
included text data in this case.
ii) a reference to another form of the linguistic unit in an
external resource or file. Such a reference must provide
enough information to extract and interpret the data within
the context of the given arc. A global resource could be
specified in the header in a future TG format version; for
example, the filename of the corresponding live audio file.
Local resources generally and all resources in Version 1.0
compliant TGs can be specified by a URL (and if the URL
specifies no method, consider it a filename path). In
addition to the regular URLs include methods for database
lookup and audio file subsegment extraction. The type's
encoding refers to the encoding of the external data in this
case.
Note: formalize the db lookup method later, as it is used.
Note: formalize the audio reference method later, as used.
But "filename start_time end_time" sounds good,
with the encoding for the arc's type being (e.g.)
"audio:raw 16KHzPCM", this is interpretable by
playback code.
Note: formalize the audio/video reference method later, as
used. But "filename start_time end_time" sounds good,
with the encoding for the arc's type being (e.g.)
"video:mpeg", this should be interpretable by playback
code.
Note: formalize dictionary lookup method later, as used.
Note: formalize alternate-text reference method later, as
used. But "filename start_char end_char" is a good
start, assuming the referred-to file is unchanging.
Or "TG_filename start_node end_node type" would also
work assuming the node IDs don't change and the type
uniquely identifies a single arc.
Discussion:
A TG can usefully be thought of as a text thingy under iterative
editing, translation, and refinement. We will here consider TGs as
texts in a variety of stages of processing.
At the first stage is the raw text of the document or other text
unit. We have defined the TG file format so that a raw document is a
valid TG file with implicit begin/end nodes and a single implicit arc
with its contents being the entire document. This implicit arc's type
and class are "implicit" and "unspecified", respectively and the
implicit encoding for the arc is the charset of the document.
At a second stage of processing, we might do any segmentation desired
by inserting to nodes anchor segment ends between linguistic units
such as paragraphs, sentences, lines, words, morphemes, etc.
A node must include a unique node identifier (eg, a number)
Implicitly the text between nodes is the unique label for that
segment. A unit of tag-external text may be referred to as an
"implicit arc".
At a third stage, for example, we can make the arcs explicit,
by replacing implicit arcs by explicit arc tags labelled with the
same text.
To make an arc explicit, tag it so with
or text
At a fourth stage, we can translate arcs from one representation to
another. To add another representation for a segment that is an
explicit arc linking two nodes a and b, add another arc from a to b,
with the added representation's type (e.g., "L2" if it is a
translation into L2), class, text encoding (or put that in the
header), etc., and the translated contents.
To illustrate here are some different TGs derived from an original
text document containing just the string "a b c".
raw: a b c [order matters]
segmented: a b c [order matters]
explicit: [order doesn't matter]
words translated: < arc 3 4 L2 (C)>
[add anywhere]
S's, words: [add anywhere]
word wav's: ...
S wav's: ...
Note that more than one audio resource may be referenced by different
arcs including isolated pronunciations of dictionary entries,
full-text readings by one or various readers, real-live-recorded
originals (if naturally recorded), etc. This is why these are not
annotation graphs, which are designed for annotating single audio
files and which can have times within the relevant audio file
specified the node. Rather this type of graph annotates relationships
among linguistic data. A text segment may have some translation
relationships to multiple audio file segments, for example, audio
recordings of different actors reading the same line of Shakespeare.
A TG handles this by referring to the audio file segment as an arc
like any arc, with its appropriate (audio) type and contents specified
by reference to the audio file and time endpoints.
Note that text outside of a tag is implicitly considered as a label
for an arc between the preceding and following nodes.
(In a draft of this spec, I allowed for "0" and "-1" as node names
equivalent to "begin" and "end".)
Assume that and are implicitly inferrable if
not explicitly present in a document (which may have no nodes at all).
Thus a raw text document has two implicit nodes anchoring the ends of
the whole document and is a well-formed multilinear text document.
Operations and Apps
Automatic and manual methods, either or both, can reasonably be
used to operate on texts of this sort.
Reasonable operations would include segmenting, arc-labelling,
translating, linking to segment pronunciations' audio files, etc.
An initial formatter might for example:
* add begin/end nodes.
* optionally make the implicit arc explicit:
a single arc with that whole document's text as its label.
A segmenter might:
* select an arc of a specified type with a complex label
* split the label into sequential or simultaneous components
of a second specified type
* add nodes between sequential components and add arcs
labelled with the respective components.
* this segmentation process could be iterated over all arcs of
a given type in the file.
* manual or automatic methods could be done.
* for example an automatic segmentation could be done
in a script-teaching application, where each letter
in the script gets its own arc of type "letter" and
is parallel to ("simultaneous" with) an arc
referring to a teaching resource for that letter
(e.g., IPA or Roman equivalent, audio form)
* or a document could be manually segmented with aid
of emacs macros &c.
A dictionary-connector might:
* add arcs of a "dictionary-index" type parallel to and
between the same node endpoints for arcs whose labels are
found in a dictionary lookup. (e.g., if a hash or other
index could save repetitious dictionary lookup of additional
word instances in a document).
A word-by-word, phrase-by-phrase, or sentence-by-sentence translator might:
* add a new type and charset to the header.
* add arcs of that type parallel to each word, phrase, or
sentence arc, each with contents being the
word/phrase/sentence's translation.
* here "translation" may mean a mapping into any other
linguistic level, for example, translate orthographic
Sanskrit (where graphemes are derived from both words at a
word boundary) to sequences of (separated, "underlying")
morphemes.
A parameterizeable display system might:
* read and parse one or more TG files, constructing a
(probably not very human-readable) TG data structure
internally as a set of nodes with labels and arcs of various
types between the segment-anchoring nodes. (Note that
overlapping chunks as for example in non-agglutinative
languages may require multiple nodes at a finer level of
representation, with arcs covering more than a single
node-to-node segment. One node preceding the first
influence of a later form, and another following the last
influence of the previous form.)
* display the types available in a configuration UI for selection.
* display the selected types in a linguistics-style,
multilinear, tabular display, comprising
* a line in the table for each selected type
* links on selected types leading to a selected alternate type
* e.g., click on a word in the word line to hear the
audio of the word not shown in the text display
but referenced in the TG as an audio arc
corresponding to that word arc.
* or e.g., click to pop up and choose from a
menu of alternative data types available
* Implementation could be via PHP or JavaScript mapping TG
files to HTML with the intended UI functionalities.
Dictionary lookup might be to a cloud located globally
shared perhaps many-language resource. Video access might
be to YouTube or other (universally) accessible video
document store.
An editing system might:
* display a selected subset of the TG's types in multilinear
tabular form
* provide for creation of a new type, with the header
tag and with arcs optionally exhausting the document
(comments don't, translations do) and with a derived-from
type for automatically generating a first-draft set of arcs
of the new type.
* provide a line in the multilinear tables for entering
translations (data for that new type) parallel to a selected
other type. It should have a charset, input method, and
display font.
* provide means of inserting nodes, e.g., for a click on a
character to be interpreted as inserting a new boundary node
before it and inserting arcs of the current type on its left
and right.
* provide for automatic pre-filling of arc contents via some
perl-ish substitution s/// mapping from another type (e.g.,
orthography to phonology by some rule system)
* provide for editing boundary locations (deleting and adding)
(e.g., if automatically inserted in the wrong place)
A language-teaching system might:
* select a parameterization for the display system,
* drive the user's reading through the system via highlighting
displayed text bits (e.g., bouncing-ball) simultaneous with
playback.
* do a read-aloud game. Have a layer of arcs for L2 ASR
grammar resources, highlight the next after the previous
succeeds (or doesn't).
* Ask the user what s/he wants to learn.
* to learn an alphabet, provide:
* links within text
* bouncing-ball read-aloud one letter at a
time
* to read content with translations shown only of some
selection of new words/morphemes (e.g. randomly
selected at a certain percentage or frequency in
the text, or selected by a teaching algorithm based
on a model of the user's knowledge level which
could be maintained by the system at a fine or
gross level or configurable by the user also at a
fine or gross level)
* bouncing-ball read-aloud one word at a time
* needs sub-sententially-aligned audio arcs
* enable an isolated-word pronunciation mode
via dictionary pronunciation audio arcs or
via reference to a carefully pronounced
rendition of the text.
* to learn isolated-word vs vernacular-conversational
pronunciation
A Context of Application
An extended example might be helpful to see utility in this quite
abstract system presentation. Consider a context of historical
document preservation such as being carried out by the Muktabodha
Indological Research Institute which is saving disintegrating,
family-stored document archives from oblivion. They have discovered
ancient palm leaves covered by hand-copied historical texts being
misused and often in bad condition, and they are committed to
preserving these resources.
For MIRI, the first step (after fundraising, hiring, training,
advertising and networking, locating, persuading, travelling,
unpacking, and setting up the equipment), the first step is the
scanning of the found materials. From this a primitive TG could be
produced as simply a sequence of scan filenames in a text file. After
slight processing, it could be reformatted into a proper TG file with
head, type, class, body, node, and arc tags in which the relevant TG
layer type might be "original_scan_jpeg", and a sample node id might
be "Hejamadi_Sanjeeva_Kunder_box_3_scan_3209" and an arc immediately
after that node a filename reference to the particular scan. (If
order of the pages scanned relative to one another is not known, then
both start and end nodes could be given for a floater, and empty arcs
specified as entering from 0 or leaving to -1.) An upcontrasted image
set could be integrated into the TG formatting by adding another TG
layer with type "contrast+120_scan_jpeg" with arcs referring to
separate, corresponding image files. In this way, workflow can be
carried out and tracked and reintegratable in TG layer files.
Although readable in their direct image form to specialists, these
scans then need to be processed into something useful for the rest of
us. Does this situation suggest Translation Graphs? I hope so.
So for example, passes made by improving purpose-trained OCR systems
over the images might produce a lot of segmentations for example, top
down into line_areas, string_areas, char_areas, feature- or
glyph-stroke areas with their extracted parameters, and then bottom up
into probability- or confidence-weighted character/word/morph
hypotheses (perhaps multiple "simultaneous" hypotheses for a given
single area, or multiple overlapping areas).
A human-edited OCR transcription might be derived from the above,
copied and reviewed, and approved after editing by a competent editor,
with hypotheses confirmed/deleted/modified and content
added/subtracted/changed as well as segmentation endpoints moved or
removed or added or multiplied where the OCR produced bad
segmentation. Obviously more work produces better results, and many
drafts each provide their separate TG layer of translation of the now
multiplying forms or glimpses of a theoretical, implied, underlying,
intended document that the author of these ancient, perhaps
disintegrating palm leaves bequeathed to us in that form.
Such an edited transcription might then be built onto as added TG
layers on the same document:
* transliterated from its perhaps obscure script into a more
accessible script such as devanagari or Roman
* translated word by word into dictionary references or
* referenced to a growing concordance
* translated at a line/word/paragraph level to some L1 (type L1,
charset ..., class doc_name author...)
* rendered into audio by a reader, thence recorded into a digital
file, made accessible to the system, and linked to by assigning
segments of audio to start/end node spans.
In short, scan them, then build up what you have into parsed,
understandable TG documents readable by all. With the constellation
of tools and operations described here, it is imaginable that
ultimately any interested human could access and penetrate, could with
minimal, if large, efforts, learn to read in the original, these
preserved archives. And the same systems could be used to provide
teaching access to learners of a target language through movies in
that language, suitably supported by transcriptions and dictionaries
and translations, all displayed and prompted into the viewer's
attention so that learning and understanding can be made as effortless
as possible.
----------------------------------------------------------------------
Sample data:
Roman_SRoman_W
suuta uvaacha
suuta
uvaacha
kailaasashikhare ramye bhaktisa~dhaananaayakam
praNamya paarvatii bhaktyaa sha~kara~ paryaprchchhata
shrii devyuvaacha
aum namo devadevesha paraatpara jagadguro
sadaashiva mahaadeva gurudiikshaa~ pradehi me
-e ??>
ramye ??>
sa~dhaana ??>
naayaka ??>
-m acc.sg>
-~ acc.sg>
s/N[.*i]/yaa/ with>
N[(.*)(a)/$1esha of-pl.Ns>
Adj[(.*)(a)]/$1aat the-most-Adj>
kailaasashikhare ramye bhaktisa~dhaananaayakam
praNamya paarvatii bhaktyaa sha~kara~ paryaprchchhata
aum namo devadevesha paraatpara jagadguro |
sadaashiva mahaadeva gurudiikshaa~ pradehi me || 2 ||
After Discussions with Dave Graff:
Let a Document be defined as a consistent segmentation, and a document name.
Let a Segmentation be defined as an ordered set of named nodes/boundaries
(convenient if sortable into order on the names)
Let a Tier include a subset of a Document's segmentation (including
its edges, the first and last nodes thereof), and further let it
specify content or material in some form within each of the tier's
segments.
Thus:
Content may change from tier to tier but the segmentation remains consistent.
A SubDocument is a Document when its segmentation is consistent across all its Tiers.
Examples:
A movie represented in the following form:
file:movie.mp4
is a Document comprising a single tier. Retaining and elaborating that
segmentation, further segmenting it into scenes, one may add to the
above Document a second tier, perhaps in a second file, for example:
t=0.0,120.5
t=120.5,240
t=240,1440
t=1440,4000
t=4000,4200
Observe how the nodes from the first tier carry over into and remain
consistent in the second tier. '#x' and 'x#' are convenient notations
intended to refer to a node or boundary on one side or another of
content labelled x. Further, the times in these arcs are interpreted
within a given context, here, an mp4 file specified on another tier,
which gives the time-pairs meaning as segments containing specific
content.
Separately, consider a set of image files enumerated by file name in a
listing file, separated by named s. That would be another
definition for a Document, including some content by reference.
img_001.jpg
img_002.jpg
..
Associated with the ImageArchive Document, one might enumerate as a
more fine-grained Tier a set of bounding boxes, in a certain order, to
be understood as located within the segment's associated image, each
box bounding all the pixels associated with a character glyph. That
enumeration would be a tier of an elaboration of the previous
ImageArchive Document.
An added tier, drafted by image processing algorithm, corrected by a
human, might be a sequence of character-sized arcs between
character-bounding nodes, this tier representing the transcription of
the imaged page. Stripped of annotations, it is equivalent or
identical to the digitized text of the image.
----
Node names can be used as part of referring expressions to identify
content substrings; each character has its offset in its tier's
string, so that sequences can be referenced for translation within
another tier as, tiername[i,j]=..., etc. In this way character
offsets might be used as a method of cross-tier indexing for
translation, indexation, interpretation, etc. However the approach
here uses such references substantially less than the Annotation Graph
approach; in this Translation Graph approach the explicit structure
provides a concrete anchoring that establishes reference across tiers,
since a bit of content in a segment of this tier must necessarily be
within the corresponding segment in every tier of the same document,
by the requirement of consistency of segmentation. If the position of
some bit of content within a Document can be seen directly in its
presented ordering between 's, then it is unnecessary to
cross-refer with numerical indexes, for example to say that characters
12-16 of the other tier are a certain morpheme. Instead, a morpheme
tier has segmentation boundaries between morphemes, which are
consistent with the boundaries between phonemes, say, at another tier,
and as such the alignment is directly visible.
An important task here has to do with database structure and
populating and utilizing that structure, as in for example a method of
transduction of database content from one user to another depending on
user's needs. Some of the translation tiers can be pulled out of the database.
Sometimes a differentially carefully spoken rendition might be a tier.
The more renditions, the better, indeed, since translations are so
variable.
According to Dave, back in the 2000's, a corpus of news reporting in
Chinese or Arabic, translated into English, 10x, produced results that
were always different! Only a rare short sentence came out the same
across translators. Word choice, word order, pronominalization, all
different, all reactions by natives were different. Usually not very
significant but frequent and subtle. Everyone has their own take.
Now, the purpose here is the data and interfaces to support a
language-teaching, or language-learning-supportive, browser.
Of course machine learning in multiple iterations has its role to
play. Based on initial work by a linguist, the machine learning
algorithms will improve their transductions to preliminarily populate
added tiers. Then linguists will improve the machine-generated
drafts. Then the machines will continue learning. Presented with a
gray box for a word proposed by algorithm, a human decider could click
it to see alternatives or type (or push a button and speak aloud) to
enter a new one, and select a correct form, which the algorithm will
learn from and use to continue to improve its hypotheses in that tier.
Machine learning can help partially automate segmentation, lookup,
translation, also vocabulary sorting, based on frequency, to help
decide what learners should learn first, etc.
The resulting picture here is a workflow encompassing an ongoing
process of sustained translation into another language. One step
might be called "transcribe": Convert jpegs to an ordered sequence of
bounding boxes, then to characters by OCR, then correct those
classifications by human, feeding that back into the OCR. Another
step does morpheme translation, simultaneously with dictionary
construction. A dictionary process that feeds labels learned so far,
forward, would start labelling unlabelled words. The dictionary is
not fixed but is a process, a growing and living dictionary. As Dave
says, 80% of words may have no ambiguity but the other 20% will be 80%
of the work. Multiply ambiguous, highly context dependent, related to
Zipf's law. A long tail of infrequent words which are relatively
clear, and very rare words which may be quite unknown. In this
workflow toolset, human users will label and work away until achieving
some kind of critical mass making it useful to others.
Dave: Building the browser is going into a different direction from
archiving. Archiving with the raw source material and
analysis/translation is just the bare facts. Then mediating to a
learner/reader is more: a tool that serves as an instructor, carrying
on an ongoing dialog with the individual to know what they got out of
it, how comfortable are you with this?
Tom: Have an apple tv remote control that you can click when you don't
understand something in the recent history of the current media
playback; if it's media annotated with this kind of data, and the
remote understands the meaning of the click as "Explain that to me",
then the video can pause, an IGT be displayed, and the user can browse
until they learn what they want, and click onward to continue.
Dave: Useful in teaching learners is an intelligent use of
concordances. Just consider the vocabulary building issue, the
biggest issue in language learning is vocabulary access. Once you
have a database, with occurrences of each word, for each word, show
them all in context with all the conjugated forms of it, maybe you
could expand with a ruleboook and a grammar and go further, but the
actual contextualized found forms tell so much. The concordance is
crucial.
Consider learner support for an utterance-sized bit of linguistic data
in the form of a two-way IGT from LS (language of source) to LT
(target language, reader's language) back to LS, each direction
comprising several tiers including morphemic analysis, word-level
translations, and full translations. Enable tagging/editing by
permitted contributors such as (a) a linguist, and (b) the author or
even (c) an interested person, to mark errors, questionables, or
introduce to corrections at suitable layers.
Provide concordances for one or more items in the structure and enable
concordance of others by a menu operation on the item.
Dave: Apply this to multi-lingual Twitter feeds. People who want to
understand the L2 twitter data (and authors who want to be understood)
might contribute a lot of data checking and editing to such a system.
Getting the community usefulness going, after some critical mass is
achieved, with live dictionaries, live algorithms, and humans
involved, it could become quite useful to all.
Tom: I want anyone to go into an L2 situation and be maximally
supported to learn and understand what they don't know. This is far
more ambitious than the Star Trek universal translater. It applies
equally to multi-lingual Twitter, to foreign watchers of previously
unsubtitled English movies, to audio concordances for learning dialect
features, to archiving and study of ancient religious texts, to any
form of language whether textual audio or video that is of interest to
the point it is worth doing the work on it to make it accessible to
another language. Put an app into your iPad and watch the TV with it,
when it recognizes a place in the film where someone has made a
tutorial out of it, the app provides for the user to click a button
and see and go through an IGT to learn -- on the iPad, if the TV isn't
smart enough to show it on the TV. Or have it be knowledgeable about
you enough to pause and give you a translation of something it thinks
will be helpful to you, once in a while. And you can click ? here or
there as a question about what did that mean, and it can help. Even
partly understanding native speakers can use similar controls over
presentation of the Translation Graphs to turn the subtitles on and
off.
12/20/2017
After a request for a translation of a paper I wrote into French:
I imagine providing a web UI for crowdsourcing translation tasks,
exposing, to begin with, some tiers of the original document as
document, sections, paragraphs, sentences, words. Then I imagine
populating some added, French tiers with Google translate data. I
guess one could only see a sentence or two at a time within the UI;
that's fine to begin with. The underlying data form would be: tiers
in different files. Emacs would do as an editor to convert higher
level segments to finer grained segments. Then some background
processes would populate the French tier by doing Google Translate
robo-requests. Another might cut the TGs in the files into bits and
pump them into the MySQL database, so that various forms thereof could
be accessed using various SQL queries, like SELECT...JOIN... Next a
web UI in HTML probably enhanced with JavaScript or other DataTable
system, some kind of editable, displayable, automatically populatable
tables, to show and provide for editing/correction/entry of various
tiers. The editing process, when a change is made in some box of the
table, would trigger code to send changes not to a text file formatted
as a Translation Graph, but to the MySQL database storing
correspondences suggested by the user. A table saying,
French_sentence_by_Google maps to French_corrected_sentence with a
column in the table for the contributing user's ID. More tables for
dictionary entries. Etc. Maybe the UI allows a click to expand a
part where the translation seems queer to the reader, offering that as
a filter on the automatic translations, so readers can pick out queer
bits and just fix them, and meanwhile read on.
Now, why didn't I notice that the segmentation of words in French is
not consistent in ordering with the segmentation of words in English,
when the word order is different. I suppose that's okay. Or is it?
IGTs use a base language, the language of the linguist, for the
morphemic translations, but given in the observed sequence of the L2
morphemes. Then the base language morphemes are scrambled from that
ordering to a base language phrase or sentence translation. Then if
you build it the other way around, the scramblings won't match up. But
perhaps the phrases/sentences will, at a higher level. Some aspect of
ordering will remain, and that's what the TG node structure will
expose. And the correspondences will work, but by matching longer
segments together between two shared boundaries, rather than by
directly corresponding in order at the smaller-segment level. Perhaps
some language of permutation could be encoded in the graph so that
word correspondences could be directly read out of it. Meanwhile,
not.
Dave, this is possibly still abstract noise but I do feel things are
moving in the right direction, toward concrete implementability.
--------------------------------------
NLTK Functions:
s = stem.{Porter,etc.}; s.stem(token)
list(tokenize.whitespace(text))
t = tag.{Default,Regexp(patterns),Unigram(backoff=t)}; t.tag(tokens); t.train(corpus);
g = cfg.parse_grammar(gmr); char.ChartParse(g,METHOD);
TGTK:
tier = TG.readTier(fn1); // on a plain text file makes a tier on whitespace
tier.write(fn2); // output the tier into a file, saving it.
doc = readDoc(fn3); // initial tier
for (fnI;;) doc.AddTier(fnI); // returns T if consistent & well-formed.
for (seg = doc.FirstSegment(tiername); seg && seg.hasNext(); seg = seg.nextSegment(tiername)) {
// segment is an object that selects a segment or arc between nodes on a tier,
// thus it defines precedessor and successor nodes at that tier
// as well as the including arcs on larger-segment tiers
// as well as the include arc sequences on shorter-segment tiers
// handle the segment
}
Each tier has a name and access method and for each arc between adjacent nodes within the tier it has data.
The underlying data is provided to a caller by calling the access method given the data.
So it might be,
Yes
Father
Yes Father
Then calling code could call
doc.FirstSegment("WordsTier").access("WordsTier");
to retrieve the ASCII string "Yes" which we thereby take as the actual first word,
or it could call
doc.FirstSegment("WordsTier").NextSegment("WordsTier").access("WordsTier")
to retrieve the ASCII "Father", the actual second word.
As another example:
IMG1.jpg
IMG2.jpg
Then calling code could call doc.FirstSegment("ImagesTier").access()
to retrieve the filename "IMG1.jpg"
To automatically process a tier of data, generating a second tier of data to be incorporating into the same Document,
code something like this might do:
stemstier = doc.AddTier(CopyNodesFromTier="Words","StemsTier");
s = doc.FirstSegment("Words");
stemstier.replace_with(s,stemmer(s.access("WordsTier")
s.NextSegment("Words")
----------------------------------------------------------------
Let a tier be represented by default as a computer file having a file
name base and with filename extension .tier, including at the front
and separate from its body a globally-unique document title, a tier
name, and a hash of its content. Other tiers of the same document
should have the same title, a different tier name
and tags are in plain ASCII; *
* CONTENT is encoded in the named charset */
URL="..." /* URL/URI for this file */
title="..." /* globally unique, could be an ISBN or UPN;
* shared identically across tiers of the same document */
tiername="..." /* unique among representations of the titled document */
parentURL="" /* URL to file containing reference tier. Default: "" */
boundary="..." /* Regex of boundary marker: "#", " ", "\s+", "" */
hash="..." /* Auto-generate from CONTENT, or use to compare. */
>CONTENT
Here the hash string is the hash of its CONTENT, ensuring a consistent
baseline of data therefore enabling a consistent treatment of its
segmentation. Since changing any bit in the content modifies the hash,
potentially leading to inconsistent segmentation, autogenerate it after
safe changes, and check it when doing any operation that depends on it.
Before the document is locked, each tier's segments should have
explicitly named node tags separating each content arc. Then segments
can continually be split or combined or modified without ruining
some externally-counted unique global ordering.
Upon locking the document, immutability can be assumed and the system
can remove all the explicit node tags with their unique-within-tier
names, and rely on a generic boundary tag or symbol, combined with a
naming or numbering convention.
Thus a lock and unlock pair of operations on a document would
translate between two forms of the document: in the unlocked form,
each tier's segments are separated by uniquely named nodes and
cross-tier reference is certain through using the correct node names.
In the locked form, each tier's segments are separately by a generic
boundary identifier, and other tiers can refer across nodes to a
particular node in the locked tier identified by its
convention-derived name or number.
A naming/numbering might be built up by the editor using a binary tree
system, where starting with a whole document as one segment, inserting
the first node boundary automatically numbers the preceding as 0 and
the following as 1, and splits numbering the precedent as ...0 and the
successor as ...1, thus giving a unique and ordered name to any
segment, and the numbering system reflecting the perhaps meaningless
and forgettable time sequence of the editor's divisions of the
document.
Given a succession of segments, a renaming could be done automatically
via a decimal counting sequence as in: #0, #1, ... #n-1, n-1# or via a
hashed naming #4s8ulkj #98ulaki #kjh92n3 .. where there is no sort
ordering on the names themselves, indeed sorting is not needed since
the content data (text) contains its own ordering.
Lock(separator,document) would go through the document and remove all
the nodes not referred to in multiple tiers, replacing them with
separator, the generic boundary marker. The nodes referred to in
multiple tiers should be retained with a name so that each tier knows
how to cross-reference a segment to segments in other tiers.
A hierarchy of immutable tiers can be represented in this way. A
mostly implicit naming system for segments in tiers might be a dotted
numbering system with parallel hierarchies. A top tier might be no
more than the whole document in one segment. A tier hierarchy
specified, for example, as TopTier.NextTier1.ThisTier could be used to
interpret a dotted numerical reference for a segment at ThisTier such
as #23.#300.#8291, meaning, the current segment #8291 within ThisTier
is also which is within segment #300 in NextTier1 which is within
segment #23 in TopTier. Naming segments proceeds automatically from
beginning to end within each tier from the beginning of the document
starting from #0 and up to #n-1,n-1#.
A naming convention with # representing the boundary as traditional in
formal linguistic morphology, and consistently with a numbering of
content segments as numbered arcs between nodes or boundaries,
provides that nodes can be named with # and a number, whereby the
number identifies the (zero-based) arc number and # before or after
represents the boundary preceding or following the selected arc. In
this convention, then, #A is the name of the boundary preceding A
while A# is the boundary immediately after A. Dual naming
of boundaries follows: #B = (B-1)# and B# = #(B+1).
Where agglutinative sequencing in morphology or higher levels applies,
the above is true. However, where morphemes overlap, the first
segmental indication of a following morpheme might precede rather than
follow the final segmental indication of a preceding morpheme. Then an
abstract sequence #A#B# might appear at a lower level of overlappingly
influenced subsegments as [#A]aaa[#B]baabbbaaa[A#]bbbbb[B#]. Here the
subsegments influenced by #A# are bounded by #A and A#, ditto #B#.
For example, sandhi:
[#1]iti[1#][#2]ahur[2#] agglutinatively, abstract morphemes in order
[#1]it[#2]y[1#]ahur[2#] after sandhi, where /y/ is influenced by both.
ok.
-------------------------------------------------------------------
Consider some workflow:
* Create document
* Block out as 5 lines or paragraphs, so far without content
[#p1][#p2][#p3][#p4][#p5][p5#]
* (Perhaps apply image processing or OCR to create some intermediate
forms to focus and support manual transcription.)
* Fill in the paragraphs (here as Roman character glyphs encoded as
ASCII but use your own charset & editor/text entry method):
[#p1]Om puurnam adah puurnam idam
[#p2]puurnaat puurnam udachyate
[#p3]purnasya puurnam aadaayaa
[#p4]purnam eva vashishyate
[#p5]Om shaanti shaanti shaanti
[p5#]
* Segment into "words"
[#p1][#w1]Om[#w2]puurnam[#w3]adah[#w4]puurnam[#w5]idam[w5#]
[#p2][#w6]puurnaat[#w7]puurnam[#w8]udachyate[w8#]
[#p3][#w9]purnasya[#w10]puurnam[#w11]aadaayaa[w11#]
[#p4][#w12]purnam[#w13]eva[#w14]vashishyate[w14#]
[#p5][#w15]Om[#w16]shaanti[#w16]shaanti[#w17]shaanti[w17#]
[p5#]
* Segment inflectional morphemes
[#p1][#w1]Om[#w2]puurn[#m1]am[#w3]adah[#w4]puurn[#m2]am[#w5]idam[w5#]
[#p2][#w6]puurn[#m3]aat[#w7]puurn[#m4]am[#w8]udachya[#m5]te[w8#]
[#p3][#w9]purn[#m6]asya[#w10]puurn[#m7]am[#w11]aat[#m8]aayaa[w11#]
[#p4][#w12]purn[#m9]am[#w13]eva[#w14]vashishya[#m10]te[w14#]
[#p5][#w15]Om[#w16]shaanti[#w16]shaanti[#w17]shaanti[w17#]
[p5#]
* Enter dictionary entries:
L1: Sanskrit. L2: English
om -> om
puurn# -> whole, complete, perfect
#am -> nom.sg.
#aat -> ablative
#asya -> genitive
#aayaa -> subjunctive
adah -> that
idam -> this
eva -> only
vashishi -> remain
#ate -> present
udachi -> arise
shaanti -> peace
* Populate morpheme translation tier automatically from dictionary
[#p1][#w1]Om[#w2]whole,complete,perfect[#m1]nom.sg.
[#w3]that[#w4]whole,complete,perfect[#m2]nom.sg.
[#w5]this[w5#]
[#p2][#w6]whole,complete,perfect[#m3]ablative
[#w7]whole,complete,perfect[#m4]nom.sg.
[#w8]arise[#m5]pres.[w8#]
[#p3][#w9]whole,complete,perfect[#m6]genitive
[#w10]whole,complete,perfect[#m7]nom.sg.[#w11]abl.[#m8]subj.[w11#]
[#p4][#w12]whole,complete,perfect[#m9]nom.sg.[#w13]only
[#w14]remain[#m10]pres.[w14#]
[#p5][#w15]Om[#w16]peace[#w16]peace[#w17]peace[w17#]
[p5#]
* Manually select dictionary entries from the list given in a context.
The display should show words with multiple entries in a highlighted
form with a menu representing the options, and making the
transcriber's job easier to select the preferred option.
[#p1][#w1]Om[#w2]perfect[#m1]nom.sg.
[#w3]that[#w4]perfect[#m2]nom.sg.
[#w5]this[w5#]
[#p2][#w6]perfect[#m3]ablative
[#w7]perfect[#m4]nom.sg.
[#w8]arise[#m5]pres.[w8#]
[#p3][#w9]perfect[#m6]genitive
[#w10]perfect[#m7]nom.sg.[#w11]abl.[#m8]subj.[w11#]
[#p4][#w12]perfect[#m9]nom.sg.[#w13]only
[#w14]remain[#m10]pres.[w14#]
[#p5][#w15]Om[#w16]peace[#w16]peace[#w17]peace[w17#]
[p5#]
* Manually translate from translated morphemes to English phrasing:
[#p1][#w1]Om.[#w2]That is perfect.
[#w4]This is perfect[w5#]
[#p2][#w6]From the perfect
[#w7]The perfect arises[w8#]
[#p3][#w9]From the perfect
[#w10]If the perfect is taken[w11#]
[#p4][#w12]The perfect, only, remains[w14#]
[#p5][#w15]Om!
[#w16]Peace!
[#w16]Peace!
[#w17]Peace![w17#]
[p5#]
* Automatic editing procedures (such as emacs macros or eLisp functions)
should be made available and easily invoked to:
* construct or add to dictionary the words not presently found therein.
* carry out segmentation of words, inflectional morphemes, etc. using
some expanding/trainable ruleset, into a new tier.
* copy a tier to be a new tier (pick from an inventory, enter new tier name)
* substitute within a tier per dictionary mappings
* enable text editing, click to select a segment, control-+ to
expand to influce the next segment, type to replace the selection
with new text.
* Multi-tier editorial display should be provided, to see other tiers while
editing a tier.
* Presentation for learning may be computer controlled based on a
model of the reader/learner's knowledge, or manually parameterized.
Anyway we now have data to support the learner.
A map to the meanings of the grammatical encodings like "abl"
(ablative, 'away from'), "subj" (subjunctive, 'possibly') should be
a click away.
A map to a concordance for any morpheme should be a click away.
A map to an IPA reference and to a pronunciation guide and script
description should be a click away.
An audio format where the text is performed in a recording with a
bouncing ball display should be a click away.
All this may be hidden and the image/video/audio media
(dis-)played, with the display of all tiers or a parameterized,
selected subset, a click away during playback when the audience is
puzzled and wants to understand the part they just heard but didn't
understand.