Definitions for TGs: Draft 0.1 Let a Document be defined as a consistent segmentation, and a document name. Let a Segmentation be defined as an ordered set of named nodes/boundaries (convenient if sortable into order on the names) Let a Tier include a subset of a Document's segmentation (including its edges, the first and last nodes thereof), and further let it specify content or material in some form within each of the tier's segments. Thus: Content may change from tier to tier but the segmentation remains consistent. A SubDocument is a Document when its segmentation is consistent across all its Tiers. Formally, a Translation Graph, "TG", or "document" comprises a header plus a set of one or more tiers. Each tier is an ordered sequence of contentless unique nodes separated by optionally contentful arcs. All tiers in a TG share a common start node and a common end node. Single nodes may occur in multiple tiers only if the node ordering in all tiers is consistent (that is, if node N precedes node M in one tier and both occur in another tier, then N cannot follow M in the second tier). A single TG may be defined in multiple files, each with a copy of the same header, and each file typically contains at least one entire tier. This definition is made more explicit via BNF :== rewrites and non-terminals as follows: Format of (Text representation for) TGs: A TG is represented using tags and tagged text. A tag is a string enclosed in <>'s, the string begins with a tag name and is followed as appropriate by further text. Tagged text is surrounded by tag pairs where the name of the second is the name of the first preceded by a slash as in .. . 1) An optional header within .. tags containing a) one or more types within .. tags including i) the type's name and ii) optionally an encoding for the labels the given type within tags b) one or more classes within .. tags including i) the class name. This might be the name of the underlying document which may have its layers stored separately in a variety of files. Class identity across files enables separate storage if desired while also providing for merging or you might think, zipping, files together into a multi-layer translation graph structure, subject to compatible node labelling and sequencing. Class could encode a classification richer even than mere hierarchy for document versioning e.g., Torah Old_Testament King_James Torah Old_Testament UnitarianVersion_21.8 Note: types and classes absent in the header but found in the arcs within the TG are acceptable; a program that writes tidy TGs should add all arc types & classes to the header for completeness but programs that read TGs should not expect that all will be there, since TGs may be written by hand, and new types/classes added by the writer/translator/editor. Note: The type for implicit arcs is "implicit" (an 8-character string). Implicit and explicit arcs must have different type(name)s. Note: The class for unspecified arcs is "unspecified" (an 11-character string), unless specified in the header. To specify a class for implicit arcs in the header, let the class name be prefixed with "implicit:" (a 9-character string) and when interpreted, the prefix removed before interpretation. Note: The encoding for implicit arcs is the name of the charset in which the document itself is encoded. 2) A body comprising a) optional tags .. at start and end of the body. b) zero or more explicit nodes each comprising a single tag The document is ill-formed if multiple nodes have the same $id. If a program encounters multiple nodes with the same $id, and all arcs between the first and last instances of that node are explicit, the program is authorized to simply delete all but one, and continue processing; nothing incoherent is implied. However, if there is any "implicit arc" text between the identical nodes, behavior is undefined; programs ought to fail with a warning. c) zero or more explicit arcs each comprising or alternatively $contents. For example, an arc from node 1 to node 2 of type L1 and no specified class may be represented as contents or as . d) zero or more bits/bytes/characters of "implicit arc" contents. Note: if "implicit arc" contents occur before the first and/or after the last tag, then a and/or a tag are implicitly considered to be present. A tidying program should make explicit such implicit begin/end nodes. Note: if there are no implicit arcs, then sequential order of the occurrence of node and arc tags in the body is immaterial, since the ordering of nodes and arcs along any path through the graph can be reconstructed as the implied sequence of node names from the arcs. Similarly the arcs might be in separate files, but still joinably reorientable to the nodes shared between the files. Human readability will be enhanced if each file has its own type of data in it, then a display program zips them together onto a shared spine of nodes, and shows graphically and even audibly and perhaps even in video some selected components. Annotations by a given editor can be in their separate file. A TG-capable editing program could be made to store layers to their separate respective files while ensuring cross-compatibility of order and labelling of nodes. Note: if stored separately, each layer may more simply be written out using implicit arcs (since then linear ordering and arc endpointing need not be written explicitly but can be derived from the text between the node tags, thus reducing the explicitness of the tagging to just node tags, and improving the file's readability. Key is node consistency between layers. This needs to be checked upon loading and merging multiple TG layers of a single class or document name, but can easily be ensured when writing to files. Where documents have different node labellings, a UI might be provided to control and supervise a zip-together operation, identifying nodes that correspond in the merging files. e) Contents is a text string encoded according to the encoding (charset) for arcs of a given type. Contents may be one of: i) a direct text representation of a linguistic unit. The type's encoding refers to the encoding of the directly included text data in this case. ii) a reference to another form of the linguistic unit in an external resource or file. Such a reference must provide enough information to extract and interpret the data within the context of the given arc. A global resource could be specified in the header in a future TG format version; for example, the filename of the corresponding live audio file. Local resources generally and all resources in Version 1.0 compliant TGs can be specified by a URL (and if the URL specifies no method, consider it a filename path). In addition to the regular URLs include methods for database lookup and audio file subsegment extraction. The type's encoding refers to the encoding of the external data in this case. Note: formalize the db lookup method later, as it is used. Note: formalize the audio reference method later, as used. But "filename start_time end_time" sounds good, with the encoding for the arc's type being (e.g.) "audio:raw 16KHzPCM", this is interpretable by playback code. Note: formalize the audio/video reference method later, as used. But "filename start_time end_time" sounds good, with the encoding for the arc's type being (e.g.) "video:mpeg", this should be interpretable by playback code. Note: formalize dictionary lookup method later, as used. Note: formalize alternate-text reference method later, as used. But "filename start_char end_char" is a good start, assuming the referred-to file is unchanging. Or "TG_filename start_node end_node type" would also work assuming the node names don't change and the type uniquely identifies a single arc. ---------------------------------------------------------------------- Discussion: ---------------------------------------------------------------------- A TG can usefully be thought of as a text thingy under iterative editing, translation, and refinement. We will here consider TGs as texts in a variety of stages of processing. As discussed in tg.workflow.txt, at the first stage is the raw text of the document or other text unit. We have defined the TG file format so that a raw document is a valid TG file with implicit begin/end nodes and a single implicit arc with its contents being the entire document. This implicit arc's type and class are "implicit" and "unspecified", respectively and the implicit encoding for the arc is the charset of the document. At a second stage of processing, we might do any segmentation desired by inserting nodes to anchor segment ends between linguistic units such as paragraphs, sentences, lines, words, morphemes, etc. A node must include a unique node name or identifier (eg, a numeric string). Implicitly the text between nodes is the unique label for that segment. A unit of tag-external text may be referred to as an "implicit arc". At a third stage, for example, we can make the arcs explicit, by replacing implicit arcs by explicit arc tags labelled with the same text. To make an arc explicit, tag it so with or text At a fourth stage, we can translate arcs from one representation to another. To add another representation for a segment that is an explicit arc linking two nodes a and b, add another arc from a to b, with the added representation's type (e.g., "L2" if it is a translation into L2), class, text encoding (or put that in the header), etc., and the translated contents. To illustrate here are some different TGs derived from an original text document containing just the string "a b c". raw: a b c [order matters] segmented: a b c [order matters] explicit: [order doesn't matter] words translated: < arc 3 4 L2 (C)> [add anywhere] S's, words: [add anywhere] word wav's: ... S wav's: ... Note that more than one audio resource may be referenced by different arcs including isolated pronunciations of dictionary entries, full-text readings by one or various readers, real-live-recorded originals (if naturally recorded), etc. This is why these are not annotation graphs, which are designed for annotating single audio files and which can have times within the relevant audio file specified the node. Rather this type of graph annotates relationships among linguistic data. A text segment may have some translation relationships to multiple audio file segments, for example, audio recordings of different actors reading the same line of Shakespeare. A TG handles this by referring to the audio file segment as an arc like any arc, with its appropriate (audio) type and contents specified by reference to the audio file and time endpoints. Note that text outside of a tag is implicitly considered as a label for an arc between the preceding and following nodes. (In a draft of this spec, I allowed for "0" and "-1" as node names equivalent to "begin" and "end".) Assume that and are implicitly inferrable if not explicitly present in a document (which may have no nodes at all). Thus a raw text document has two implicit nodes anchoring the ends of the whole document and is a well-formed multilinear text document. ---------------- TG Modes of Reference: --------- Node names can be used as part of referring expressions to identify content substrings; each character has its offset in its tier's string, so that sequences can be referenced for translation within another tier as, tiername[i,j]=..., etc. In this way character offsets might be used as a method of cross-tier indexing for translation, indexation, interpretation, etc. However the approach here uses such references substantially less than the Annotation Graph approach; in this Translation Graph approach the explicit structure provides a concrete anchoring that establishes reference across tiers, since a bit of content in a segment of this tier must necessarily be within the corresponding segment in every tier of the same document, by the requirement of consistency of segmentation. If the position of some bit of content within a Document can be seen directly in its presented ordering between 's, then it is unnecessary to cross-refer with numerical indexes, for example to say that characters 12-16 of the other tier are a certain morpheme. Instead, a morpheme tier has segmentation boundaries between morphemes, which are consistent with the boundaries between phonemes, say, at another tier, and as such the alignment is directly visible. An important task here has to do with database structure and populating and utilizing that structure, as in for example a method of transduction of database content from one user to another depending on user's needs. Some of the translation tiers can be pulled out of the database. ---------------------------------------------------------------- Let a tier be represented by default as a computer file having a file name base and with filename extension .tier, including at the front and separate from its body a globally-unique document title, a tier name, and a hash of its content. Other tiers of the same document should have the same title, a different tier name and tags are in plain ASCII; * * CONTENT is encoded in the named charset */ URL="..." /* URL/URI for this file */ title="..." /* globally unique, could be an ISBN or UPN; * shared identically across tiers of the same document */ tiername="..." /* unique among representations of the titled document */ parentURL="" /* URL to file containing reference tier. Default: "" */ boundary="..." /* Regex of boundary marker: "#", " ", "\s+", "" */ hash="..." /* Auto-generate from CONTENT, or use to compare. */ >CONTENT Here the hash string is the hash of its CONTENT, ensuring a consistent baseline of data therefore enabling a consistent treatment of its segmentation. Since changing any bit in the content modifies the hash, potentially leading to inconsistent segmentation, autogenerate it after safe changes, and check it when doing any operation that depends on it. Before the document is locked, each tier's segments should have explicitly named node tags separating each content arc. Then segments can continually be split or combined or modified without ruining some externally-counted unique global ordering. Upon locking the document, immutability can be assumed and the system can remove all the explicit node tags with their unique-within-tier names, and rely on a generic boundary tag or symbol, combined with a naming or numbering convention. Thus a lock and unlock pair of operations on a document would translate between two forms of the document: in the unlocked form, each tier's segments are separated by uniquely named nodes and cross-tier reference is certain through using the correct node names. In the locked form, each tier's segments are separately by a generic boundary identifier, and other tiers can refer across nodes to a particular node in the locked tier identified by its convention-derived name or number. A naming/numbering might be built up by the editor using a binary tree system, where starting with a whole document as one segment, inserting the first node boundary automatically numbers the preceding as 0 and the following as 1, and splits numbering the precedent as ...0 and the successor as ...1, thus giving a unique and ordered name to any segment, and the numbering system reflecting the perhaps meaningless and forgettable time sequence of the editor's divisions of the document. Given a succession of segments, a renaming could be done automatically via a decimal counting sequence as in: #0, #1, ... #n-1, n-1# or via a hashed naming #4s8ulkj #98ulaki #kjh92n3 .. where there is no sort ordering on the names themselves, indeed sorting is not needed since the content data (text) contains its own ordering. Lock(separator,document) would go through the document and remove all the nodes not referred to in multiple tiers, replacing them with separator, the generic boundary marker. The nodes referred to in multiple tiers should be retained with a name so that each tier knows how to cross-reference a segment to segments in other tiers. A hierarchy of immutable tiers can be represented in this way. A mostly implicit naming system for segments in tiers might be a dotted numbering system with parallel hierarchies. A top tier might be no more than the whole document in one segment. A tier hierarchy specified, for example, as TopTier.NextTier1.ThisTier could be used to interpret a dotted numerical reference for a segment at ThisTier such as #23.#300.#8291, meaning, the current segment #8291 within ThisTier is also which is within segment #300 in NextTier1 which is within segment #23 in TopTier. Naming segments proceeds automatically from beginning to end within each tier from the beginning of the document starting from #0 and up to #n-1,n-1#. A naming convention with # representing the boundary as traditional in formal linguistic morphology, and consistently with a numbering of content segments as numbered arcs between nodes or boundaries, provides that nodes can be named with # and a number, whereby the number identifies the (zero-based) arc number and # before or after represents the boundary preceding or following the selected arc. In this convention, then, #A is the name of the boundary preceding A while A# is the boundary immediately after A. Dual naming of boundaries follows: #B = (B-1)# and B# = #(B+1). Where agglutinative sequencing in morphology or higher levels applies, the above is true. However, where morphemes overlap, the first segmental indication of a following morpheme might precede rather than follow the final segmental indication of a preceding morpheme. Then an abstract sequence #A#B# might appear at a lower level of overlappingly influenced subsegments as [#A]aaa[#B]baabbbaaa[A#]bbbbb[B#]. Here the subsegments influenced by #A# are bounded by #A and A#, ditto #B#. For example, sandhi: [#1]iti[1#][#2]ahur[2#] agglutinatively, abstract morphemes in order [#1]it[#2]y[1#]ahur[2#] after sandhi, where /y/ is influenced by both. ok.