Translation Graphs by Tom Veatch, 2003, 2008, 2017 Introduction and Motivation Suppose you are competent in one language (call that language L1) and you are interested in a document in another language which you don't know (call that language L2). Wouldn't it be nice if someone had done some work on that document to make it accessible to you in that other language? Of course a normal paragraph by paragraph translation of the whole, which you may find in a published translation of that authored work, would do something like this, but you wouldn't be getting it in the other language at all, but in your own language. You'd learn the translated content, but you wouldn't learn the language. I have in mind, instead, something that opens the language itself to you, so that as you, the learner, go through the document as analysed and worked up by a linguist/translator, you gradually become able to recognize and understand the elements of the other language and ultimately to understand the original itself. This worked-up form of the document would have to provide a variety of more-accessible forms of the parts of the document: translations that are word-by-word, not just paragraph by paragraph, links to audio playback of pronunciations of the symbols and words, perhaps of larger sections. Working through it should teach you and enable you to get a lot of its content; the second or fifth time you encounter a word in that language, you might not have to look at its dictionary entry to begin to understand it yourself. Ultimately, with enough such documents, and enough time devoted to working through them, you would become a competent reader and even listener in that other language. To enable this vision of language learning through a mediated, supported, but direct encounter with the original L2 document, I have had to envision a whole architecture of language data representation, markup, storage, lookup tools, editing systems, display systems and the like, which would be needed to take that original L2 document and add the L1 resources needed and then to make them accessible to you as you read through the document. This is my draft description of that system. The key idea, of course, is multilinearity. Consider the original document as a line of text. Maybe a very very long line, but in the abstract, just a sequence of symbols on a single line. Then, any additional representation that supports or makes accessible to you any piece of the original document, can be considered as a translation of a piece of that first line which is then written onto some additional second (or Nth) line, in a way where you can tell which part of the first line it is a translation of. Such a representation is multilinear. You might have a large number of lines, lines that do pronunciation, others that do vocabulary, lines with big gaps in them, lines that refer to data outside the workup in a dictionary or an audio library or on the internet, lines that call your attention to syntax or dialect features, lines that link to clear pronunciations from an audio dictionary and lines that link to live, vernacular or fast speech recordings of whole sentences or turns, so that you can learn how fast speech sounds in that language, not just careful pronunciations from a dictionary. And the system that ultimately presents it to you could have an intelligent model of what you know as a learner of this new language and it could display for you, at a suitable pace and with the right amount of repetition and testing, the easiest next elements for you to learn, as you gradually acquire competence with all the many things in that document. Well, such a trainer is beyond the scope of this current discussion, but it is something that, after we achieve the technical requirements discussed here, we could then aim to build. It would be a vast improvement on Teachionary (www.sprex.com -> teachionary). So now, then, to the technical details. Translation Graphs I here propose a logic and associated document format for multilinear text representations, comprising facilities to arbitrarily segment, annotate, and translate text documents, and providing the supporting data and structure required for (audio/visually) displaying and editing multilinear text representations. A multilinear text might incorporate lines representing, for example: * L2 orthography, * L1 translation of the unit (word by word, sentence by sentence) * reference to corresponding audio bits (filename and time endpts) * reference to video bits (file, times) * reference to dictionary entries * translator's comments about a unit * references to a graphical form, such as (coordinates defining a polygon or rectangle within) a page or scan or photograph or image containing data related to a span of content. * etc. The lines can be loosely thought of as "translations" of one another, though the translation may or may not be between human languages; for example, different translations may be representations of the same text at different levels of linguistic representation, for example, paragraphs vs words vs audio, etc. The concept here is similar to but different from Bird/Liberman "Annotation Graphs" (AGs). AGs are: * directed acyclic graphs comprising sets of nodes connected by arcs. * Arc labels contain the actual text (and text types and other classification info) * Nodes anchor ends of text units, optionally encode times in audio files. Translation Graphs (TGs) are similar but slightly different: * same: directed acyclic graphs comprising sets of nodes connected by arcs. * same: Arc labels contain the actual text (and text types and other classification info) * same: Nodes anchor ends of text units * diff: Nodes cannot be labelled with times. * diff: Instead, arcs may refer to audio segments (filename/start-time/end-time) TGs are a generalization of AGs since AGs refer to single audio signal or implicit time baseline, whereas TGs can refer to more than one audio representation for (segments of) a single text. Formally, then: Logico-mathematical definition of TGs: A Translation Graph is an 5-tuple {T,C,L,N,A} with T, a set of types; C, a set of classes; and L, a set of labels; N, a set of nodes, and A, a set of arcs each linking from one to another node in N, associated with one type in T, one class in C, and one label in L. Format of (Text representation for) TGs: A TG is represented using tags and tagged text. A tag is a string enclosed in <>'s, the string begins with a tag name and is followed as appropriate by further text. Tagged text is surrounded by tag pairs where the name of the second is the name of the first preceded by a slash as in .. . 1) An optional header within .. tags containing a) one or more types within .. tags including i) the type's name and ii) optionally an encoding for the labels the given type within tags b) one or more classes within .. tags including i) the class name. This might be the name of the underlying document which may have its layers stored separately in a variety of files. Class identity across files enables separate storage if desired while also providing for merging or you might think, zipping, files together into a multi-layer translation graph structure, subject to compatible node labelling and sequencing. Class could encode a classification richer even than mere hierarchy for document versioning e.g., Torah Old_Testament King_James Torah Old_Testament UnitarianVersion_21.8 Note: types and classes absent in the header but found in the arcs within the TG are acceptable; a program that writes tidy TGs should add all arc types & classes to the header for completeness but programs that read TGs should not expect that all will be there, since TGs may be written by hand, and new types/classes added by the writer/translator/editor. Note: The type for implicit arcs is "implicit" (an 8-character string). Implicit and explicit arcs must have different type(name)s. Note: The class for implicit arcs is "unspecified" (an 11-character string), unless specified in the header. To specify a class for implicit arcs in the header, let the class name be prefixed with "implicit:" (a 9-character string) and when interpreted, the prefix removed before interpretation. Note: The encoding for implicit arcs is the name of the charset in which the document itself is encoded. 2) A body comprising a) optional tags .. at start and end of the body. b) zero or more explicit nodes each comprising a single tag The document is ill-formed if multiple nodes have the same $id. If a program encounters multiple nodes with the same $id, and all arcs between the first and last instances of that node are explicit, the program is authorized to simply delete all but one, and continue processing; nothing incoherent is implied. However, if there is any "implicit arc" text between the identical nodes, behavior is undefined; programs ought to fail with a warning. c) zero or more explicit arcs each comprising or alternatively $contents. For example, an arc from node 1 to node 2 of type L1 and no specified class may be represented as contents or as . d) zero or more bits/bytes/characters of "implicit arc" contents. Note: if "implicit arc" contents occur before the first and/or after the last tag, then a and/or a tag are implicitly considered to be present. A tidying program should make explicit such implicit begin/end nodes. Note: if there are no implicit arcs, then sequential order of the occurrence of node and arc tags in the body is immaterial, since the ordering of nodes and arcs along any path through the graph can be reconstructed as the implied sequence of node id's from the arcs. Similarly the arcs might be in separate files, but still joinably reorientable to the nodes shared between the files. Human readability will be enhanced if each file has its own type of data in it, then a display program zips them together onto a shared spine of nodes, and shows graphically and even audibly and perhaps even in video some selected components. Annotations by a given editor can be in their separate file. A TG-capable editing program could be made to store layers to their separate respective files while ensuring cross-compatibility of order and labelling of nodes. Note: if stored separately, each layer may more simply be written out using implicit arcs (since then linear ordering and arc endpointing need not be written explicitly but can be derived from the text between the node tags, thus reducing the explicitness of the tagging to just node tags, and improving the file's readability. Key is node consistency between layers. This needs to be checked upon loading and merging multiple TG layers of a single class or document name, but can easily be ensured when writing to files. Where documents have different node labellings, a UI might be provided to control and supervise a zip-together operation, identifying nodes that correspond in the merging files. e) Contents is a text string encoded according to the encoding (charset) for arcs of a given type. Contents may be one of: i) a direct text representation of a linguistic unit. The type's encoding refers to the encoding of the directly included text data in this case. ii) a reference to another form of the linguistic unit in an external resource or file. Such a reference must provide enough information to extract and interpret the data within the context of the given arc. A global resource could be specified in the header in a future TG format version; for example, the filename of the corresponding live audio file. Local resources generally and all resources in Version 1.0 compliant TGs can be specified by a URL (and if the URL specifies no method, consider it a filename path). In addition to the regular URLs include methods for database lookup and audio file subsegment extraction. The type's encoding refers to the encoding of the external data in this case. Note: formalize the db lookup method later, as it is used. Note: formalize the audio reference method later, as used. But "filename start_time end_time" sounds good, with the encoding for the arc's type being (e.g.) "audio:raw 16KHzPCM", this is interpretable by playback code. Note: formalize the audio/video reference method later, as used. But "filename start_time end_time" sounds good, with the encoding for the arc's type being (e.g.) "video:mpeg", this should be interpretable by playback code. Note: formalize dictionary lookup method later, as used. Note: formalize alternate-text reference method later, as used. But "filename start_char end_char" is a good start, assuming the referred-to file is unchanging. Or "TG_filename start_node end_node type" would also work assuming the node IDs don't change and the type uniquely identifies a single arc. Discussion: A TG can usefully be thought of as a text thingy under iterative editing, translation, and refinement. We will here consider TGs as texts in a variety of stages of processing. At the first stage is the raw text of the document or other text unit. We have defined the TG file format so that a raw document is a valid TG file with implicit begin/end nodes and a single implicit arc with its contents being the entire document. This implicit arc's type and class are "implicit" and "unspecified", respectively and the implicit encoding for the arc is the charset of the document. At a second stage of processing, we might do any segmentation desired by inserting to nodes anchor segment ends between linguistic units such as paragraphs, sentences, lines, words, morphemes, etc. A node must include a unique node identifier (eg, a number) Implicitly the text between nodes is the unique label for that segment. A unit of tag-external text may be referred to as an "implicit arc". At a third stage, for example, we can make the arcs explicit, by replacing implicit arcs by explicit arc tags labelled with the same text. To make an arc explicit, tag it so with or text At a fourth stage, we can translate arcs from one representation to another. To add another representation for a segment that is an explicit arc linking two nodes a and b, add another arc from a to b, with the added representation's type (e.g., "L2" if it is a translation into L2), class, text encoding (or put that in the header), etc., and the translated contents. To illustrate here are some different TGs derived from an original text document containing just the string "a b c". raw: a b c [order matters] segmented: a b c [order matters] explicit: [order doesn't matter] words translated: < arc 3 4 L2 (C)> [add anywhere] S's, words: [add anywhere] word wav's: ... S wav's: ... Note that more than one audio resource may be referenced by different arcs including isolated pronunciations of dictionary entries, full-text readings by one or various readers, real-live-recorded originals (if naturally recorded), etc. This is why these are not annotation graphs, which are designed for annotating single audio files and which can have times within the relevant audio file specified the node. Rather this type of graph annotates relationships among linguistic data. A text segment may have some translation relationships to multiple audio file segments, for example, audio recordings of different actors reading the same line of Shakespeare. A TG handles this by referring to the audio file segment as an arc like any arc, with its appropriate (audio) type and contents specified by reference to the audio file and time endpoints. Note that text outside of a tag is implicitly considered as a label for an arc between the preceding and following nodes. (In a draft of this spec, I allowed for "0" and "-1" as node names equivalent to "begin" and "end".) Assume that and are implicitly inferrable if not explicitly present in a document (which may have no nodes at all). Thus a raw text document has two implicit nodes anchoring the ends of the whole document and is a well-formed multilinear text document. Operations and Apps Automatic and manual methods, either or both, can reasonably be used to operate on texts of this sort. Reasonable operations would include segmenting, arc-labelling, translating, linking to segment pronunciations' audio files, etc. An initial formatter might for example: * add begin/end nodes. * optionally make the implicit arc explicit: a single arc with that whole document's text as its label. A segmenter might: * select an arc of a specified type with a complex label * split the label into sequential or simultaneous components of a second specified type * add nodes between sequential components and add arcs labelled with the respective components. * this segmentation process could be iterated over all arcs of a given type in the file. * manual or automatic methods could be done. * for example an automatic segmentation could be done in a script-teaching application, where each letter in the script gets its own arc of type "letter" and is parallel to ("simultaneous" with) an arc referring to a teaching resource for that letter (e.g., IPA or Roman equivalent, audio form) * or a document could be manually segmented with aid of emacs macros &c. A dictionary-connector might: * add arcs of a "dictionary-index" type parallel to and between the same node endpoints for arcs whose labels are found in a dictionary lookup. (e.g., if a hash or other index could save repetitious dictionary lookup of additional word instances in a document). A word-by-word, phrase-by-phrase, or sentence-by-sentence translator might: * add a new type and charset to the header. * add arcs of that type parallel to each word, phrase, or sentence arc, each with contents being the word/phrase/sentence's translation. * here "translation" may mean a mapping into any other linguistic level, for example, translate orthographic Sanskrit (where graphemes are derived from both words at a word boundary) to sequences of (separated, "underlying") morphemes. A parameterizeable display system might: * read and parse one or more TG files, constructing a (probably not very human-readable) TG data structure internally as a set of nodes with labels and arcs of various types between the segment-anchoring nodes. (Note that overlapping chunks as for example in non-agglutinative languages may require multiple nodes at a finer level of representation, with arcs covering more than a single node-to-node segment. One node preceding the first influence of a later form, and another following the last influence of the previous form.) * display the types available in a configuration UI for selection. * display the selected types in a linguistics-style, multilinear, tabular display, comprising * a line in the table for each selected type * links on selected types leading to a selected alternate type * e.g., click on a word in the word line to hear the audio of the word not shown in the text display but referenced in the TG as an audio arc corresponding to that word arc. * or e.g., click to pop up and choose from a menu of alternative data types available * Implementation could be via PHP or JavaScript mapping TG files to HTML with the intended UI functionalities. Dictionary lookup might be to a cloud located globally shared perhaps many-language resource. Video access might be to YouTube or other (universally) accessible video document store. An editing system might: * display a selected subset of the TG's types in multilinear tabular form * provide for creation of a new type, with the header tag and with arcs optionally exhausting the document (comments don't, translations do) and with a derived-from type for automatically generating a first-draft set of arcs of the new type. * provide a line in the multilinear tables for entering translations (data for that new type) parallel to a selected other type. It should have a charset, input method, and display font. * provide means of inserting nodes, e.g., for a click on a character to be interpreted as inserting a new boundary node before it and inserting arcs of the current type on its left and right. * provide for automatic pre-filling of arc contents via some perl-ish substitution s/// mapping from another type (e.g., orthography to phonology by some rule system) * provide for editing boundary locations (deleting and adding) (e.g., if automatically inserted in the wrong place) A language-teaching system might: * select a parameterization for the display system, * drive the user's reading through the system via highlighting displayed text bits (e.g., bouncing-ball) simultaneous with playback. * do a read-aloud game. Have a layer of arcs for L2 ASR grammar resources, highlight the next after the previous succeeds (or doesn't). * Ask the user what s/he wants to learn. * to learn an alphabet, provide: * links within text * bouncing-ball read-aloud one letter at a time * to read content with translations shown only of some selection of new words/morphemes (e.g. randomly selected at a certain percentage or frequency in the text, or selected by a teaching algorithm based on a model of the user's knowledge level which could be maintained by the system at a fine or gross level or configurable by the user also at a fine or gross level) * bouncing-ball read-aloud one word at a time * needs sub-sententially-aligned audio arcs * enable an isolated-word pronunciation mode via dictionary pronunciation audio arcs or via reference to a carefully pronounced rendition of the text. * to learn isolated-word vs vernacular-conversational pronunciation A Context of Application An extended example might be helpful to see utility in this quite abstract system presentation. Consider a context of historical document preservation such as being carried out by the Muktabodha Indological Research Institute which is saving disintegrating, family-stored document archives from oblivion. They have discovered ancient palm leaves covered by hand-copied historical texts being misused and often in bad condition, and they are committed to preserving these resources. For MIRI, the first step (after fundraising, hiring, training, advertising and networking, locating, persuading, travelling, unpacking, and setting up the equipment), the first step is the scanning of the found materials. From this a primitive TG could be produced as simply a sequence of scan filenames in a text file. After slight processing, it could be reformatted into a proper TG file with head, type, class, body, node, and arc tags in which the relevant TG layer type might be "original_scan_jpeg", and a sample node id might be "Hejamadi_Sanjeeva_Kunder_box_3_scan_3209" and an arc immediately after that node a filename reference to the particular scan. (If order of the pages scanned relative to one another is not known, then both start and end nodes could be given for a floater, and empty arcs specified as entering from 0 or leaving to -1.) An upcontrasted image set could be integrated into the TG formatting by adding another TG layer with type "contrast+120_scan_jpeg" with arcs referring to separate, corresponding image files. In this way, workflow can be carried out and tracked and reintegratable in TG layer files. Although readable in their direct image form to specialists, these scans then need to be processed into something useful for the rest of us. Does this situation suggest Translation Graphs? I hope so. So for example, passes made by improving purpose-trained OCR systems over the images might produce a lot of segmentations for example, top down into line_areas, string_areas, char_areas, feature- or glyph-stroke areas with their extracted parameters, and then bottom up into probability- or confidence-weighted character/word/morph hypotheses (perhaps multiple "simultaneous" hypotheses for a given single area, or multiple overlapping areas). A human-edited OCR transcription might be derived from the above, copied and reviewed, and approved after editing by a competent editor, with hypotheses confirmed/deleted/modified and content added/subtracted/changed as well as segmentation endpoints moved or removed or added or multiplied where the OCR produced bad segmentation. Obviously more work produces better results, and many drafts each provide their separate TG layer of translation of the now multiplying forms or glimpses of a theoretical, implied, underlying, intended document that the author of these ancient, perhaps disintegrating palm leaves bequeathed to us in that form. Such an edited transcription might then be built onto as added TG layers on the same document: * transliterated from its perhaps obscure script into a more accessible script such as devanagari or Roman * translated word by word into dictionary references or * referenced to a growing concordance * translated at a line/word/paragraph level to some L1 (type L1, charset ..., class doc_name author...) * rendered into audio by a reader, thence recorded into a digital file, made accessible to the system, and linked to by assigning segments of audio to start/end node spans. In short, scan them, then build up what you have into parsed, understandable TG documents readable by all. With the constellation of tools and operations described here, it is imaginable that ultimately any interested human could access and penetrate, could with minimal, if large, efforts, learn to read in the original, these preserved archives. And the same systems could be used to provide teaching access to learners of a target language through movies in that language, suitably supported by transcriptions and dictionaries and translations, all displayed and prompted into the viewer's attention so that learning and understanding can be made as effortless as possible. ---------------------------------------------------------------------- Sample data: Roman_SRoman_W suuta uvaacha suuta uvaacha kailaasashikhare ramye bhaktisa~dhaananaayakam praNamya paarvatii bhaktyaa sha~kara~ paryaprchchhata shrii devyuvaacha aum namo devadevesha paraatpara jagadguro sadaashiva mahaadeva gurudiikshaa~ pradehi me <noun devi Goddess> <exclamation aum Om> <verb namo I-bow> <noun deva God> <? N[(.*)(a)/$1esha of-pl.Ns> <adj para greater> <? Adj[(.*)(a)]/$1aat the-most-Adj> <noun jagat world> <noun guru teacher> <adj sadaa auspicious> <personal_name shiva Shiva> <adj mahaa great> <noun diikshaa initiation> <verb pradehi give> <pron me to_me> </dictionary> <aside> suuta uvaacha </aside> <verse 1> <line> kailaasashikhare ramye bhaktisa~dhaananaayakam </line> <line> praNamya paarvatii bhaktyaa sha~kara~ paryaprchchhata </line> </verse> <aside> shrii devyuvaacha </aside> <verse 2> <line> aum namo devadevesha paraatpara jagadguro | </line> <line> sadaashiva mahaadeva gurudiikshaa~ pradehi me || 2 || </line> </verse> After Discussions with Dave Graff: Let a Document be defined as a consistent segmentation, and a document name. Let a Segmentation be defined as an ordered set of named nodes/boundaries (convenient if sortable into order on the names) Let a Tier include a subset of a Document's segmentation (including its edges, the first and last nodes thereof), and further let it specify content or material in some form within each of the tier's segments. Thus: Content may change from tier to tier but the segmentation remains consistent. A SubDocument is a Document when its segmentation is consistent across all its Tiers. Examples: A movie represented in the following form: <node id="MovieStart">file:movie.mp4<node id="MovieEnd"> is a Document comprising a single tier. Retaining and elaborating that segmentation, further segmenting it into scenes, one may add to the above Document a second tier, perhaps in a second file, for example: <node id="MovieStart"> <node id="#credits"> t=0.0,120.5 <node id="#opening"> t=120.5,240 <node id="#development"> t=240,1440 <node id="#denouement"> t=1440,4000 <node id="#closing_credits"> t=4000,4200 <node id="MovieEnd"> Observe how the nodes from the first tier carry over into and remain consistent in the second tier. '#x' and 'x#' are convenient notations intended to refer to a node or boundary on one side or another of content labelled x. Further, the times in these arcs are interpreted within a given context, here, an mp4 file specified on another tier, which gives the time-pairs meaning as segments containing specific content. Separately, consider a set of image files enumerated by file name in a listing file, separated by named <node>s. That would be another definition for a Document, including some content by reference. <node id="#ImageArchive"> <node id="#1">img_001.jpg <node id="#2">img_002.jpg .. <node id="ImageArchive#"> Associated with the ImageArchive Document, one might enumerate as a more fine-grained Tier a set of bounding boxes, in a certain order, to be understood as located within the segment's associated image, each box bounding all the pixels associated with a character glyph. That enumeration would be a tier of an elaboration of the previous ImageArchive Document. An added tier, drafted by image processing algorithm, corrected by a human, might be a sequence of character-sized arcs between character-bounding nodes, this tier representing the transcription of the imaged page. Stripped of <node> annotations, it is equivalent or identical to the digitized text of the image. ---- Node names can be used as part of referring expressions to identify content substrings; each character has its offset in its tier's string, so that sequences can be referenced for translation within another tier as, tiername[i,j]=..., etc. In this way character offsets might be used as a method of cross-tier indexing for translation, indexation, interpretation, etc. However the approach here uses such references substantially less than the Annotation Graph approach; in this Translation Graph approach the explicit structure provides a concrete anchoring that establishes reference across tiers, since a bit of content in a segment of this tier must necessarily be within the corresponding segment in every tier of the same document, by the requirement of consistency of segmentation. If the position of some bit of content within a Document can be seen directly in its presented ordering between <node>'s, then it is unnecessary to cross-refer with numerical indexes, for example to say that characters 12-16 of the other tier are a certain morpheme. Instead, a morpheme tier has segmentation boundaries between morphemes, which are consistent with the boundaries between phonemes, say, at another tier, and as such the alignment is directly visible. An important task here has to do with database structure and populating and utilizing that structure, as in for example a method of transduction of database content from one user to another depending on user's needs. Some of the translation tiers can be pulled out of the database. Sometimes a differentially carefully spoken rendition might be a tier. The more renditions, the better, indeed, since translations are so variable. According to Dave, back in the 2000's, a corpus of news reporting in Chinese or Arabic, translated into English, 10x, produced results that were always different! Only a rare short sentence came out the same across translators. Word choice, word order, pronominalization, all different, all reactions by natives were different. Usually not very significant but frequent and subtle. Everyone has their own take. Now, the purpose here is the data and interfaces to support a language-teaching, or language-learning-supportive, browser. Of course machine learning in multiple iterations has its role to play. Based on initial work by a linguist, the machine learning algorithms will improve their transductions to preliminarily populate added tiers. Then linguists will improve the machine-generated drafts. Then the machines will continue learning. Presented with a gray box for a word proposed by algorithm, a human decider could click it to see alternatives or type (or push a button and speak aloud) to enter a new one, and select a correct form, which the algorithm will learn from and use to continue to improve its hypotheses in that tier. Machine learning can help partially automate segmentation, lookup, translation, also vocabulary sorting, based on frequency, to help decide what learners should learn first, etc. The resulting picture here is a workflow encompassing an ongoing process of sustained translation into another language. One step might be called "transcribe": Convert jpegs to an ordered sequence of bounding boxes, then to characters by OCR, then correct those classifications by human, feeding that back into the OCR. Another step does morpheme translation, simultaneously with dictionary construction. A dictionary process that feeds labels learned so far, forward, would start labelling unlabelled words. The dictionary is not fixed but is a process, a growing and living dictionary. As Dave says, 80% of words may have no ambiguity but the other 20% will be 80% of the work. Multiply ambiguous, highly context dependent, related to Zipf's law. A long tail of infrequent words which are relatively clear, and very rare words which may be quite unknown. In this workflow toolset, human users will label and work away until achieving some kind of critical mass making it useful to others. Dave: Building the browser is going into a different direction from archiving. Archiving with the raw source material and analysis/translation is just the bare facts. Then mediating to a learner/reader is more: a tool that serves as an instructor, carrying on an ongoing dialog with the individual to know what they got out of it, how comfortable are you with this? Tom: Have an apple tv remote control that you can click when you don't understand something in the recent history of the current media playback; if it's media annotated with this kind of data, and the remote understands the meaning of the click as "Explain that to me", then the video can pause, an IGT be displayed, and the user can browse until they learn what they want, and click onward to continue. Dave: Useful in teaching learners is an intelligent use of concordances. Just consider the vocabulary building issue, the biggest issue in language learning is vocabulary access. Once you have a database, with occurrences of each word, for each word, show them all in context with all the conjugated forms of it, maybe you could expand with a ruleboook and a grammar and go further, but the actual contextualized found forms tell so much. The concordance is crucial. Consider learner support for an utterance-sized bit of linguistic data in the form of a two-way IGT from LS (language of source) to LT (target language, reader's language) back to LS, each direction comprising several tiers including morphemic analysis, word-level translations, and full translations. Enable tagging/editing by permitted contributors such as (a) a linguist, and (b) the author or even (c) an interested person, to mark errors, questionables, or introduce to corrections at suitable layers. Provide concordances for one or more items in the structure and enable concordance of others by a menu operation on the item. Dave: Apply this to multi-lingual Twitter feeds. People who want to understand the L2 twitter data (and authors who want to be understood) might contribute a lot of data checking and editing to such a system. Getting the community usefulness going, after some critical mass is achieved, with live dictionaries, live algorithms, and humans involved, it could become quite useful to all. Tom: I want anyone to go into an L2 situation and be maximally supported to learn and understand what they don't know. This is far more ambitious than the Star Trek universal translater. It applies equally to multi-lingual Twitter, to foreign watchers of previously unsubtitled English movies, to audio concordances for learning dialect features, to archiving and study of ancient religious texts, to any form of language whether textual audio or video that is of interest to the point it is worth doing the work on it to make it accessible to another language. Put an app into your iPad and watch the TV with it, when it recognizes a place in the film where someone has made a tutorial out of it, the app provides for the user to click a button and see and go through an IGT to learn -- on the iPad, if the TV isn't smart enough to show it on the TV. Or have it be knowledgeable about you enough to pause and give you a translation of something it thinks will be helpful to you, once in a while. And you can click ? here or there as a question about what did that mean, and it can help. Even partly understanding native speakers can use similar controls over presentation of the Translation Graphs to turn the subtitles on and off. 12/20/2017 After a request for a translation of a paper I wrote into French: I imagine providing a web UI for crowdsourcing translation tasks, exposing, to begin with, some tiers of the original document as document, sections, paragraphs, sentences, words. Then I imagine populating some added, French tiers with Google translate data. I guess one could only see a sentence or two at a time within the UI; that's fine to begin with. The underlying data form would be: tiers in different files. Emacs would do as an editor to convert higher level segments to finer grained segments. Then some background processes would populate the French tier by doing Google Translate robo-requests. Another might cut the TGs in the files into bits and pump them into the MySQL database, so that various forms thereof could be accessed using various SQL queries, like SELECT...JOIN... Next a web UI in HTML probably enhanced with JavaScript or other DataTable system, some kind of editable, displayable, automatically populatable tables, to show and provide for editing/correction/entry of various tiers. The editing process, when a change is made in some box of the table, would trigger code to send changes not to a text file formatted as a Translation Graph, but to the MySQL database storing correspondences suggested by the user. A table saying, French_sentence_by_Google maps to French_corrected_sentence with a column in the table for the contributing user's ID. More tables for dictionary entries. Etc. Maybe the UI allows a click to expand a part where the translation seems queer to the reader, offering that as a filter on the automatic translations, so readers can pick out queer bits and just fix them, and meanwhile read on. Now, why didn't I notice that the segmentation of words in French is not consistent in ordering with the segmentation of words in English, when the word order is different. I suppose that's okay. Or is it? IGTs use a base language, the language of the linguist, for the morphemic translations, but given in the observed sequence of the L2 morphemes. Then the base language morphemes are scrambled from that ordering to a base language phrase or sentence translation. Then if you build it the other way around, the scramblings won't match up. But perhaps the phrases/sentences will, at a higher level. Some aspect of ordering will remain, and that's what the TG node structure will expose. And the correspondences will work, but by matching longer segments together between two shared boundaries, rather than by directly corresponding in order at the smaller-segment level. Perhaps some language of permutation could be encoded in the graph so that word correspondences could be directly read out of it. Meanwhile, not. Dave, this is possibly still abstract noise but I do feel things are moving in the right direction, toward concrete implementability. -------------------------------------- NLTK Functions: s = stem.{Porter,etc.}; s.stem(token) list(tokenize.whitespace(text)) t = tag.{Default,Regexp(patterns),Unigram(backoff=t)}; t.tag(tokens); t.train(corpus); g = cfg.parse_grammar(gmr); char.ChartParse(g,METHOD); TGTK: tier = TG.readTier(fn1); // on a plain text file makes a tier on whitespace tier.write(fn2); // output the tier into a file, saving it. doc = readDoc(fn3); // initial tier for (fnI;;) doc.AddTier(fnI); // returns T if consistent & well-formed. for (seg = doc.FirstSegment(tiername); seg && seg.hasNext(); seg = seg.nextSegment(tiername)) { // segment is an object that selects a segment or arc between nodes on a tier, // thus it defines precedessor and successor nodes at that tier // as well as the including arcs on larger-segment tiers // as well as the include arc sequences on shorter-segment tiers // handle the segment } Each tier has a name and access method and for each arc between adjacent nodes within the tier it has data. The underlying data is provided to a caller by calling the access method given the data. So it might be, <tier name="WordsTier", access="return arc data in native charset as the word itself"> <node id=#1>Yes <node id=#2>Father <node id=2#> </tier> <tier name="SentencesTier", access="as_self"> <node id=#1>Yes Father <node id=2#> </tier> Then calling code could call doc.FirstSegment("WordsTier").access("WordsTier"); to retrieve the ASCII string "Yes" which we thereby take as the actual first word, or it could call doc.FirstSegment("WordsTier").NextSegment("WordsTier").access("WordsTier") to retrieve the ASCII "Father", the actual second word. As another example: <tier name="ImagesTier", access="return arc data as a file name"> <node id=#1>IMG1.jpg <node id=#2>IMG2.jpg <node id=2#> </tier> Then calling code could call doc.FirstSegment("ImagesTier").access() to retrieve the filename "IMG1.jpg" To automatically process a tier of data, generating a second tier of data to be incorporating into the same Document, code something like this might do: stemstier = doc.AddTier(CopyNodesFromTier="Words","StemsTier"); s = doc.FirstSegment("Words"); stemstier.replace_with(s,stemmer(s.access("WordsTier") s.NextSegment("Words") ---------------------------------------------------------------- Let a tier be represented by default as a computer file having a file name base and with filename extension .tier, including at the front and separate from its body a globally-unique document title, a tier name, and a hash of its content. Other tiers of the same document should have the same title, a different tier name <tier /* comment text allowed within tier tag */ charset="..." /* The file <tier> and </tier> tags are in plain ASCII; * * CONTENT is encoded in the named charset */ URL="..." /* URL/URI for this file */ title="..." /* globally unique, could be an ISBN or UPN; * shared identically across tiers of the same document */ tiername="..." /* unique among representations of the titled document */ parentURL="" /* URL to file containing reference tier. Default: "" */ boundary="..." /* Regex of boundary marker: "#", " ", "\s+", "<node*>" */ hash="..." /* Auto-generate from CONTENT, or use to compare. */ >CONTENT</tier> Here the hash string is the hash of its CONTENT, ensuring a consistent baseline of data therefore enabling a consistent treatment of its segmentation. Since changing any bit in the content modifies the hash, potentially leading to inconsistent segmentation, autogenerate it after safe changes, and check it when doing any operation that depends on it. Before the document is locked, each tier's segments should have explicitly named node tags separating each content arc. Then segments can continually be split or combined or modified without ruining some externally-counted unique global ordering. Upon locking the document, immutability can be assumed and the system can remove all the explicit node tags with their unique-within-tier names, and rely on a generic boundary tag or symbol, combined with a naming or numbering convention. Thus a lock and unlock pair of operations on a document would translate between two forms of the document: in the unlocked form, each tier's segments are separated by uniquely named nodes and cross-tier reference is certain through using the correct node names. In the locked form, each tier's segments are separately by a generic boundary identifier, and other tiers can refer across nodes to a particular node in the locked tier identified by its convention-derived name or number. A naming/numbering might be built up by the editor using a binary tree system, where starting with a whole document as one segment, inserting the first node boundary automatically numbers the preceding as 0 and the following as 1, and splits numbering the precedent as ...0 and the successor as ...1, thus giving a unique and ordered name to any segment, and the numbering system reflecting the perhaps meaningless and forgettable time sequence of the editor's divisions of the document. Given a succession of segments, a renaming could be done automatically via a decimal counting sequence as in: #0, #1, ... #n-1, n-1# or via a hashed naming #4s8ulkj #98ulaki #kjh92n3 .. where there is no sort ordering on the names themselves, indeed sorting is not needed since the content data (text) contains its own ordering. Lock(separator,document) would go through the document and remove all the nodes not referred to in multiple tiers, replacing them with separator, the generic boundary marker. The nodes referred to in multiple tiers should be retained with a name so that each tier knows how to cross-reference a segment to segments in other tiers. A hierarchy of immutable tiers can be represented in this way. A mostly implicit naming system for segments in tiers might be a dotted numbering system with parallel hierarchies. A top tier might be no more than the whole document in one segment. A tier hierarchy specified, for example, as TopTier.NextTier1.ThisTier could be used to interpret a dotted numerical reference for a segment at ThisTier such as #23.#300.#8291, meaning, the current segment #8291 within ThisTier is also which is within segment #300 in NextTier1 which is within segment #23 in TopTier. Naming segments proceeds automatically from beginning to end within each tier from the beginning of the document starting from #0 and up to #n-1,n-1#. A naming convention with # representing the boundary as traditional in formal linguistic morphology, and consistently with a numbering of content segments as numbered arcs between nodes or boundaries, provides that nodes can be named with # and a number, whereby the number identifies the (zero-based) arc number and # before or after represents the boundary preceding or following the selected arc. In this convention, then, #A is the name of the boundary preceding A while A# is the boundary immediately after A. Dual naming of boundaries follows: #B = (B-1)# and B# = #(B+1). Where agglutinative sequencing in morphology or higher levels applies, the above is true. However, where morphemes overlap, the first segmental indication of a following morpheme might precede rather than follow the final segmental indication of a preceding morpheme. Then an abstract sequence #A#B# might appear at a lower level of overlappingly influenced subsegments as [#A]aaa[#B]baabbbaaa[A#]bbbbb[B#]. Here the subsegments influenced by #A# are bounded by #A and A#, ditto #B#. For example, sandhi: [#1]iti[1#][#2]ahur[2#] agglutinatively, abstract morphemes in order [#1]it[#2]y[1#]ahur[2#] after sandhi, where /y/ is influenced by both. ok. ------------------------------------------------------------------- Consider some workflow: * Create document <tier><IMG file="IsopanisadImage.jpg"></tier> * Block out as 5 lines or paragraphs, so far without content <tier>[#p1][#p2][#p3][#p4][#p5][p5#]</tier> * (Perhaps apply image processing or OCR to create some intermediate forms to focus and support manual transcription.) * Fill in the paragraphs (here as Roman character glyphs encoded as ASCII but use your own charset & editor/text entry method): <tier> [#p1]Om puurnam adah puurnam idam [#p2]puurnaat puurnam udachyate [#p3]purnasya puurnam aadaayaa [#p4]purnam eva vashishyate [#p5]Om shaanti shaanti shaanti [p5#] </tier> * Segment into "words" <tier> [#p1][#w1]Om[#w2]puurnam[#w3]adah[#w4]puurnam[#w5]idam[w5#] [#p2][#w6]puurnaat[#w7]puurnam[#w8]udachyate[w8#] [#p3][#w9]purnasya[#w10]puurnam[#w11]aadaayaa[w11#] [#p4][#w12]purnam[#w13]eva[#w14]vashishyate[w14#] [#p5][#w15]Om[#w16]shaanti[#w16]shaanti[#w17]shaanti[w17#] [p5#] </tier> * Segment inflectional morphemes <tier> [#p1][#w1]Om[#w2]puurn[#m1]am[#w3]adah[#w4]puurn[#m2]am[#w5]idam[w5#] [#p2][#w6]puurn[#m3]aat[#w7]puurn[#m4]am[#w8]udachya[#m5]te[w8#] [#p3][#w9]purn[#m6]asya[#w10]puurn[#m7]am[#w11]aat[#m8]aayaa[w11#] [#p4][#w12]purn[#m9]am[#w13]eva[#w14]vashishya[#m10]te[w14#] [#p5][#w15]Om[#w16]shaanti[#w16]shaanti[#w17]shaanti[w17#] [p5#] </tier> * Enter dictionary entries: L1: Sanskrit. L2: English om -> om puurn# -> whole, complete, perfect #am -> nom.sg. #aat -> ablative #asya -> genitive #aayaa -> subjunctive adah -> that idam -> this eva -> only vashishi -> remain #ate -> present udachi -> arise shaanti -> peace * Populate morpheme translation tier automatically from dictionary <tier> [#p1][#w1]Om[#w2]whole,complete,perfect[#m1]nom.sg. [#w3]that[#w4]whole,complete,perfect[#m2]nom.sg. [#w5]this[w5#] [#p2][#w6]whole,complete,perfect[#m3]ablative [#w7]whole,complete,perfect[#m4]nom.sg. [#w8]arise[#m5]pres.[w8#] [#p3][#w9]whole,complete,perfect[#m6]genitive [#w10]whole,complete,perfect[#m7]nom.sg.[#w11]abl.[#m8]subj.[w11#] [#p4][#w12]whole,complete,perfect[#m9]nom.sg.[#w13]only [#w14]remain[#m10]pres.[w14#] [#p5][#w15]Om[#w16]peace[#w16]peace[#w17]peace[w17#] [p5#] </tier> * Manually select dictionary entries from the list given in a context. The display should show words with multiple entries in a highlighted form with a menu representing the options, and making the transcriber's job easier to select the preferred option. <tier> [#p1][#w1]Om[#w2]perfect[#m1]nom.sg. [#w3]that[#w4]perfect[#m2]nom.sg. [#w5]this[w5#] [#p2][#w6]perfect[#m3]ablative [#w7]perfect[#m4]nom.sg. [#w8]arise[#m5]pres.[w8#] [#p3][#w9]perfect[#m6]genitive [#w10]perfect[#m7]nom.sg.[#w11]abl.[#m8]subj.[w11#] [#p4][#w12]perfect[#m9]nom.sg.[#w13]only [#w14]remain[#m10]pres.[w14#] [#p5][#w15]Om[#w16]peace[#w16]peace[#w17]peace[w17#] [p5#] </tier> * Manually translate from translated morphemes to English phrasing: <tier> [#p1][#w1]Om.[#w2]That is perfect. [#w4]This is perfect[w5#] [#p2][#w6]From the perfect [#w7]The perfect arises[w8#] [#p3][#w9]From the perfect [#w10]If the perfect is taken[w11#] [#p4][#w12]The perfect, only, remains[w14#] [#p5][#w15]Om! [#w16]Peace! [#w16]Peace! [#w17]Peace![w17#] [p5#] </tier> * Automatic editing procedures (such as emacs macros or eLisp functions) should be made available and easily invoked to: * construct or add to dictionary the words not presently found therein. * carry out segmentation of words, inflectional morphemes, etc. using some expanding/trainable ruleset, into a new tier. * copy a tier to be a new tier (pick from an inventory, enter new tier name) * substitute within a tier per dictionary mappings * enable text editing, click to select a segment, control-+ to expand to influce the next segment, type to replace the selection with new text. * Multi-tier editorial display should be provided, to see other tiers while editing a tier. * Presentation for learning may be computer controlled based on a model of the reader/learner's knowledge, or manually parameterized. Anyway we now have data to support the learner. A map to the meanings of the grammatical encodings like "abl" (ablative, 'away from'), "subj" (subjunctive, 'possibly') should be a click away. A map to a concordance for any morpheme should be a click away. A map to an IPA reference and to a pronunciation guide and script description should be a click away. An audio format where the text is performed in a recording with a bouncing ball display should be a click away. All this may be hidden and the image/video/audio media (dis-)played, with the display of all tiers or a parameterized, selected subset, a click away during playback when the audience is puzzled and wants to understand the part they just heard but didn't understand.