---------------------------------------------------------------------- --------- TG Workflow, Document Segmentation and Markup -------------- ---------------------------------------------------------------------- Consider, as a TG usage example, the workflow for an original document comprising, say, chapters, paragraphs, sentences, words, to create a TG that makes the higher level structure explicit. An original utf8 (and therefore also ASCII) text document is already a well-formed TGML file. TGML defaults apply when a utf8 (therefore also an ASCII) text file is read in, so that it is a single tier with a single arc linking default start node to default end node, with data type defaulting to ref:auto,charset:utf8, and inferred classes being the file name as title and the file owner as author, and with the entire document content being the label of that arc. A system that reads and updates TGs can read it in. The user would copy the tier to say a words tier, and populate the words tier by splitting the arc on word boundary (in English, usually, space or punctuation) either manually or by calling some function, or first applying the function then giving it a manual correction review. Then another tier could be created for sentences by copying and modifying the words tier, or by adding sentence boundary node names to a subset of the word boundary nodes, or both. First, the user would apply a merge-words-into-sentence algorithm (e.g.: start at the beginning, tag the current point as the start of sentence, move along until you find a period or question mark, followed by whitespace followed by capital letter, and then tag the end of sentence at the punctuation; and repeat). In the copy/modify approach, each sentence's endpoints are the co-located existing nodes, but in a new sentence tier are connected by a referring arc, which refers to the word tier's sequence of nodes and arcs between those same two endpoint nodes. In the node labelling approach, the word boundary nodes which are also sentence endpoints are additionally labelled as such (e.g.: with $N counting the sentence number, append ",#s$N" to the predecessor node's name, and ",s$N# to the successor node's name. Which approach is used might be based on whether you want to economize on node labels and directly browse across a sentence tier to one successor node to find the end of the sentence, or economize on tiers and and have to pick out the nodes that represent sentence boundaries. The added tier approach does not rule out also labelling nodes as sentence boundary nodes at the same time, for the benefit of less search in end of sentence lookup, at the cost of more elaborate data in the TG structure. The same approach can mark up a TG into paragraphs, chapters, etc., by labelling nodes, adding tiers, or both. An opposite direction of work is to split rather than merge arcs. Splitting an arc A between nodes #A and A# means replacing A by a sequence of two arcs separated by a node B, to span from #A to A#, and taking the content of each arc from a splitting of the content of A. Often there will be a derivational morphological process undone in going from surface word to underlying morpheme sequence, so split may also incorporate morphological analysis and transformation (de-derivation, or dictionary lookup), and may indeed be driven by that analysis, since some words on a tier, being analysed morphologically, will not be split. Deprecated: A referring arc approach could even be used by referring to the start of the corresponding arc on the referenced tier and giving counts for the number of the first and last character of the referenced arc to define the content of the referring arc. Deprecated because a tier of split forms should show what content is on which side of the new split points, and sometimes there are modifications to the character sequence; for this information to be directly given in the arcs saves potentially complex, bug-prone, and inscrutable reference spaghetti, so I prefer copy/split/modify over concoct-by-reference. Up and down the chunking hierarchy via merger and split being mentioned, consider now also going out into other dimensions, toward different media types than direct text, such as audio or video. Simply enough: an annotator or segmenter associates *arcs* with time spans in audio or video media. Note: One could imagine putting the times into the nodes, but times are associated with a particular tier, not the time-unspecified abstraction of a tier-shareable segmentation point. For example, in a different audio recording of, say, a different actor playing the same character with the same words, the same node will separate the same words in the different recordings, but correspond to a different times in each. Node names are not the place for such bookkeeping; arc contents are.) Different languages are also examples of such other dimensions, which maintain the same node structure, or at least some part of it, while changing the arc contents and of course the tier type which explain what those contents are. Go in one direction from L2 to L1, or another direction to L3, etc., by translating arcs on a new tier. Since TGs can reside in multiple files (so far, we are requiring, at least one tier per file), which can include other TGs by reference, it follows that an L3 TG tier file including by reference the L1 markup can be read into a data structure from which can directly be read off all the L1 content for translation comparison, for concordance extraction, for learners to examine, etc., corresponding arc by corresponding arc. Since each language tier can separately include by reference the original L1 markup, or any other compatible tier of the document, not even requiring that one be the original reference document, it is clear that any pattern of mutual inclusion imaginable can be simply implemented by loading up some starting file, followed by those that it refers to, recursively until done. Consistency is the user's job, hopefully aided by software checks. Thus the linguist may copy a word tier to a subword tier, processing the word arcs into a sub-words by adding appropriate nodes and/or added node names, by splitting, perhaps modifying toward a more underlying form, the word content to produce the sub-word arc contents. It is a little more work to assert that arc contents may be sequences of identifiers pointing into a dictionary of words, morphemes, lemmas, word types, etc. Each would be a new tier with suitable tier type to specify what kind of data it is. A segment tier may also be found useful, where individual phonemes or other hopefully theoretically well-defined units might be specifed as the content of the arcs of the tier. While it is possible phones may be enumerated within arc contents in the form of a sequence of identifirs (say, numeric ids, or, bit-patterns with reference a phonological feature set), each to be looked up in some reference such as a phonological inventory for a language, we have also drilled down now to the point where the character set, utf8, itself contains the entire IPA as well as every other phonological and orthographic symbol and diacritic known, so that the character representing the data is itself in a way the thing represented, rather than a reference to something in a table somewhere, modulo the usual linguistic type/token distinction. Yes a copy of the type is indeed a token, and certainly the types deserve a separate place, perhaps a table with some discussion of local phonetics, etc. where teaching and study of those types as a category set can go on, not solely to be found sprayed throughout a document in limitless tokens. Because linguists like to generalize. With tier copying and the ability to split, merge, lookup, and transform arcs, whether starting from an arbitrary original text document or any pre-existing TG document, a linguist can mark it up into any number of levels of chunking or segmentation, such as book, part, chapter, paragraph, sentence, word, morpheme, phoneme, or whatever categories are desired and useful. I was trained that words are a surface form, inflections like -s and -ed are peeled off at a deeper layer, derivational morphemes like re- or -ment are peeled off at a yet deeper layer, leaving the true underlyung forms. Another hierarchy uses word tokens, word types, and word lemmas. Transliterating between orthographies, or moving from orthographic to phonological form, are simply another tier transformation or lookup process, applying the transformation arc by arc along the input tier, to produce an output tier. In any case, each tier having its own tier type enables TG-empowered software to interpret, count, summarize, and present the data appropriately.