----------------------------------------------------------------------
--------- TG Workflow, Document Segmentation and Markup --------------
----------------------------------------------------------------------

   Consider, as a TG usage example, the workflow for an original
   document comprising, say, chapters, paragraphs, sentences, words,
   to create a TG that makes the higher level structure explicit.

   An original utf8 (and therefore also ASCII) text document is
   already a well-formed TGML file.  TGML defaults apply when a utf8
   (therefore also an ASCII) text file is read in, so that it is a
   single tier with a single arc linking default start node to default
   end node, with data type defaulting to ref:auto,charset:utf8, and
   inferred classes being the file name as title and the file owner as
   author, and with the entire document content being the label of
   that arc.

   A system that reads and updates TGs can read it in.  The user would
   copy the tier to say a words tier, and populate the words tier by
   splitting the arc on word boundary (in English, usually, space or
   punctuation) either manually or by calling some function, or first
   applying the function then giving it a manual correction review.

   Then another tier could be created for sentences by copying and
   modifying the words tier, or by adding sentence boundary node names
   to a subset of the word boundary nodes, or both.  First, the user
   would apply a merge-words-into-sentence algorithm (e.g.: start at
   the beginning, tag the current point as the start of sentence, move
   along until you find a period or question mark, followed by
   whitespace followed by capital letter, and then tag the end of
   sentence at the punctuation; and repeat).  In the copy/modify
   approach, each sentence's endpoints are the co-located existing
   nodes, but in a new sentence tier are connected by a referring arc,
   which refers to the word tier's sequence of nodes and arcs between
   those same two endpoint nodes.  In the node labelling approach, the
   word boundary nodes which are also sentence endpoints are
   additionally labelled as such (e.g.: with $N counting the sentence
   number, append ",#s$N" to the predecessor node's name, and ",s$N#
   to the successor node's name. Which approach is used might be based
   on whether you want to economize on node labels and directly browse
   across a sentence tier to one successor node to find the end of the
   sentence, or economize on tiers and and have to pick out the nodes
   that represent sentence boundaries.  The added tier approach does
   not rule out also labelling nodes as sentence boundary nodes at the
   same time, for the benefit of less search in end of sentence lookup,
   at the cost of more elaborate data in the TG structure.

   The same approach can mark up a TG into paragraphs, chapters, etc.,
   by labelling nodes, adding tiers, or both.

   An opposite direction of work is to split rather than merge arcs.
   Splitting an arc A between nodes #A and A# means replacing A by a
   sequence of two arcs separated by a node B, to span from #A to A#,
   and taking the content of each arc from a splitting of the content
   of A.  Often there will be a derivational morphological process
   undone in going from surface word to underlying morpheme sequence,
   so split may also incorporate morphological analysis and
   transformation (de-derivation, or dictionary lookup), and may
   indeed be driven by that analysis, since some words on a tier,
   being analysed morphologically, will not be split.

   Deprecated: A referring arc approach could even be used by
       referring to the start of the corresponding arc on the
       referenced tier and giving counts for the number of the first
       and last character of the referenced arc to define the content
       of the referring arc.  Deprecated because a tier of split forms
       should show what content is on which side of the new split
       points, and sometimes there are modifications to the character
       sequence; for this information to be directly given in the arcs
       saves potentially complex, bug-prone, and inscrutable reference
       spaghetti, so I prefer copy/split/modify over
       concoct-by-reference.

   Up and down the chunking hierarchy via merger and split being
   mentioned, consider now also going out into other dimensions,
   toward different media types than direct text, such as audio or
   video.  Simply enough: an annotator or segmenter associates *arcs*
   with time spans in audio or video media.

   Note: One could imagine putting the times into the nodes, but times
         are associated with a particular tier, not the
         time-unspecified abstraction of a tier-shareable segmentation
         point.  For example, in a different audio recording of, say,
         a different actor playing the same character with the same
         words, the same node will separate the same words in the
         different recordings, but correspond to a different times in
         each.  Node names are not the place for such bookkeeping; arc
         contents are.)

   Different languages are also examples of such other dimensions,
   which maintain the same node structure, or at least some part of
   it, while changing the arc contents and of course the tier type
   which explain what those contents are.  Go in one direction from L2
   to L1, or another direction to L3, etc., by translating arcs on a
   new tier.  Since TGs can reside in multiple files (so far, we are
   requiring, at least one tier per file), which can include other TGs
   by reference, it follows that an L3 TG tier file including by
   reference the L1 markup can be read into a data structure from
   which can directly be read off all the L1 content for translation
   comparison, for concordance extraction, for learners to examine,
   etc., corresponding arc by corresponding arc.  Since each language
   tier can separately include by reference the original L1 markup, or
   any other compatible tier of the document, not even requiring that
   one be the original reference document, it is clear that any
   pattern of mutual inclusion imaginable can be simply implemented by
   loading up some starting file, followed by those that it refers to,
   recursively until done.  Consistency is the user's job, hopefully
   aided by software checks.

   Thus the linguist may copy a word tier to a subword tier,
   processing the word arcs into a sub-words by adding appropriate
   nodes and/or added node names, by splitting, perhaps modifying
   toward a more underlying form, the word content to produce the
   sub-word arc contents.

   It is a little more work to assert that arc contents may be
   sequences of identifiers pointing into a dictionary of words,
   morphemes, lemmas, word types, etc.  Each would be a new tier with
   suitable tier type to specify what kind of data it is.

   A segment tier may also be found useful, where individual phonemes
   or other hopefully theoretically well-defined units might be
   specifed as the content of the arcs of the tier.  While it is
   possible phones may be enumerated within arc contents in the form
   of a sequence of identifirs (say, numeric ids, or, bit-patterns
   with reference a phonological feature set), each to be looked up in
   some reference such as a phonological inventory for a language, we
   have also drilled down now to the point where the character set,
   utf8, itself contains the entire IPA as well as every other
   phonological and orthographic symbol and diacritic known, so that
   the character representing the data is itself in a way the thing
   represented, rather than a reference to something in a table
   somewhere, modulo the usual linguistic type/token distinction.  Yes
   a copy of the type is indeed a token, and certainly the types
   deserve a separate place, perhaps a table with some discussion of
   local phonetics, etc. where teaching and study of those types as a
   category set can go on, not solely to be found sprayed throughout a
   document in limitless tokens.  Because linguists like to
   generalize.

   With tier copying and the ability to split, merge, lookup, and
   transform arcs, whether starting from an arbitrary original text
   document or any pre-existing TG document, a linguist can mark it up
   into any number of levels of chunking or segmentation, such as
   book, part, chapter, paragraph, sentence, word, morpheme, phoneme,
   or whatever categories are desired and useful.

   I was trained that words are a surface form, inflections like -s
   and -ed are peeled off at a deeper layer, derivational morphemes
   like re- or -ment are peeled off at a yet deeper layer, leaving the
   true underlyung forms.  Another hierarchy uses word tokens, word
   types, and word lemmas.  Transliterating between orthographies, or
   moving from orthographic to phonological form, are simply another
   tier transformation or lookup process, applying the transformation
   arc by arc along the input tier, to produce an output tier.  In any
   case, each tier having its own tier type enables TG-empowered
   software to interpret, count, summarize, and present the data
   appropriately.