Definitions for TGs:
Draft 0.1
Let a Document be defined as a consistent segmentation, and a document
name.
Let a Segmentation be defined as an ordered set of named
nodes/boundaries (convenient if sortable into order on the names)
Let a Tier include a subset of a Document's segmentation (including
its edges, the first and last nodes thereof), and further let it
specify content or material in some form within each of the tier's
segments.
Thus:
Content may change from tier to tier but the segmentation remains
consistent.
A SubDocument is a Document when its segmentation is consistent
across all its Tiers.
Formally, a Translation Graph, "TG", or "document" comprises a header
plus a set of one or more tiers. Each tier is an ordered sequence of
contentless unique nodes separated by optionally contentful arcs. All
tiers in a TG share a common start node and a common end node. Single
nodes may occur in multiple tiers only if the node ordering in all
tiers is consistent (that is, if node N precedes node M in one tier
and both occur in another tier, then N cannot follow M in the second
tier). A single TG may be defined in multiple files, each with a copy
of the same header, and each file typically contains at least one
entire tier. This definition is made more explicit via BNF :==
rewrites and non-terminals as follows:
Format of (Text representation for) TGs:
A TG is represented using tags and tagged text. A tag is a string
enclosed in <>'s, the string begins with a tag name and is followed as
appropriate by further text. Tagged text is surrounded by tag pairs
where the name of the second is the name of the first preceded by a
slash as in .. .
1) An optional header within
.. tags containing
a) one or more types within .. tags including
i) the type's name and
ii) optionally an encoding for the labels the given type within
tags
b) one or more classes within .. tags including
i) the class name. This might be the name of the underlying document
which may have its layers stored separately in a variety of files.
Class identity across files enables separate storage if
desired while also providing for merging or you might think,
zipping, files together into a multi-layer translation graph
structure, subject to compatible node labelling and
sequencing. Class could encode a classification richer even
than mere hierarchy for document versioning
e.g., Torah Old_Testament King_James
Torah Old_Testament UnitarianVersion_21.8
Note: types and classes absent in the header but found in the arcs
within the TG are acceptable; a program that writes tidy TGs
should add all arc types & classes to the header for completeness
but programs that read TGs should not expect that all will be there,
since TGs may be written by hand, and new types/classes added
by the writer/translator/editor.
Note: The type for implicit arcs is "implicit" (an 8-character
string). Implicit and explicit arcs must have different type(name)s.
Note: The class for unspecified arcs is "unspecified" (an 11-character
string), unless specified in the header. To specify a class for
implicit arcs in the header, let the class name be prefixed
with "implicit:" (a 9-character string) and when interpreted,
the prefix removed before interpretation.
Note: The encoding for implicit arcs is the name of the charset
in which the document itself is encoded.
2) A body comprising
a) optional tags .. at start and end of the body.
b) zero or more explicit nodes each comprising a single tag
The document is ill-formed if multiple nodes have the same $id.
If a program encounters multiple nodes with the same $id,
and all arcs between the first and last instances of that node
are explicit, the program is authorized to simply delete all but
one, and continue processing; nothing incoherent is implied.
However, if there is any "implicit arc" text between the identical
nodes, behavior is undefined; programs ought to fail with a warning.
c) zero or more explicit arcs each comprising or alternatively $contents. For example, an arc from node 1 to node
2 of type L1 and no specified class may be represented as contents or as .
d) zero or more bits/bytes/characters of "implicit arc" contents.
Note: if "implicit arc" contents occur before the first and/or
after the last tag, then a and/or a
tag are implicitly considered to be present. A tidying program
should make explicit such implicit begin/end nodes.
Note: if there are no implicit arcs, then sequential order of the
occurrence of node and arc tags in the body is immaterial, since
the ordering of nodes and arcs along any path through the graph can
be reconstructed as the implied sequence of node names from the
arcs. Similarly the arcs might be in separate files, but still
joinably reorientable to the nodes shared between the files. Human
readability will be enhanced if each file has its own type of data
in it, then a display program zips them together onto a shared
spine of nodes, and shows graphically and even audibly and perhaps
even in video some selected components. Annotations by a given
editor can be in their separate file. A TG-capable editing program
could be made to store layers to their separate respective files
while ensuring cross-compatibility of order and labelling of nodes.
Note: if stored separately, each layer may more simply be written
out using implicit arcs (since then linear ordering and arc
endpointing need not be written explicitly but can be derived from
the text between the node tags, thus reducing the explicitness of
the tagging to just node tags, and improving the file's
readability. Key is node consistency between layers. This needs
to be checked upon loading and merging multiple TG layers of a
single class or document name, but can easily be ensured when
writing to files. Where documents have different node labellings,
a UI might be provided to control and supervise a zip-together
operation, identifying nodes that correspond in the merging files.
e) Contents is a text string encoded according to the encoding
(charset) for arcs of a given type. Contents may be one of:
i) a direct text representation of a linguistic unit.
The type's encoding refers to the encoding of the directly
included text data in this case.
ii) a reference to another form of the linguistic unit in an
external resource or file. Such a reference must provide
enough information to extract and interpret the data within
the context of the given arc. A global resource could be
specified in the header in a future TG format version; for
example, the filename of the corresponding live audio file.
Local resources generally and all resources in Version 1.0
compliant TGs can be specified by a URL (and if the URL
specifies no method, consider it a filename path). In
addition to the regular URLs include methods for database
lookup and audio file subsegment extraction. The type's
encoding refers to the encoding of the external data in this
case.
Note: formalize the db lookup method later, as it is used.
Note: formalize the audio reference method later, as used.
But "filename start_time end_time" sounds good,
with the encoding for the arc's type being (e.g.)
"audio:raw 16KHzPCM", this is interpretable by
playback code.
Note: formalize the audio/video reference method later, as
used. But "filename start_time end_time" sounds good,
with the encoding for the arc's type being (e.g.)
"video:mpeg", this should be interpretable by playback
code.
Note: formalize dictionary lookup method later, as used.
Note: formalize alternate-text reference method later, as
used. But "filename start_char end_char" is a good
start, assuming the referred-to file is unchanging.
Or "TG_filename start_node end_node type" would also
work assuming the node names don't change and the type
uniquely identifies a single arc.
----------------------------------------------------------------------
Discussion:
----------------------------------------------------------------------
A TG can usefully be thought of as a text thingy under iterative
editing, translation, and refinement. We will here consider TGs as
texts in a variety of stages of processing.
As discussed in tg.workflow.txt, at the first stage is the raw text of
the document or other text unit. We have defined the TG file format so
that a raw document is a valid TG file with implicit begin/end nodes
and a single implicit arc with its contents being the entire document.
This implicit arc's type and class are "implicit" and "unspecified",
respectively and the implicit encoding for the arc is the charset of
the document.
At a second stage of processing, we might do any segmentation desired
by inserting nodes to anchor segment ends between linguistic units
such as paragraphs, sentences, lines, words, morphemes, etc.
A node must include a unique node name or identifier (eg, a numeric
string).
Implicitly the text between nodes is the unique label for that
segment. A unit of tag-external text may be referred to as an
"implicit arc".
At a third stage, for example, we can make the arcs explicit,
by replacing implicit arcs by explicit arc tags labelled with the
same text.
To make an arc explicit, tag it so with
or text
At a fourth stage, we can translate arcs from one representation to
another. To add another representation for a segment that is an
explicit arc linking two nodes a and b, add another arc from a to b,
with the added representation's type (e.g., "L2" if it is a
translation into L2), class, text encoding (or put that in the
header), etc., and the translated contents.
To illustrate here are some different TGs derived from an original
text document containing just the string "a b c".
raw: a b c [order matters]
segmented: a b c [order matters]
explicit: [order doesn't matter]
words translated: < arc 3 4 L2 (C)>
[add anywhere]
S's, words: [add anywhere]
word wav's: ...
S wav's: ...
Note that more than one audio resource may be referenced by different
arcs including isolated pronunciations of dictionary entries,
full-text readings by one or various readers, real-live-recorded
originals (if naturally recorded), etc. This is why these are not
annotation graphs, which are designed for annotating single audio
files and which can have times within the relevant audio file
specified the node. Rather this type of graph annotates relationships
among linguistic data. A text segment may have some translation
relationships to multiple audio file segments, for example, audio
recordings of different actors reading the same line of Shakespeare.
A TG handles this by referring to the audio file segment as an arc
like any arc, with its appropriate (audio) type and contents specified
by reference to the audio file and time endpoints.
Note that text outside of a tag is implicitly considered as a label
for an arc between the preceding and following nodes.
(In a draft of this spec, I allowed for "0" and "-1" as node names
equivalent to "begin" and "end".)
Assume that and are implicitly inferrable if
not explicitly present in a document (which may have no nodes at all).
Thus a raw text document has two implicit nodes anchoring the ends of
the whole document and is a well-formed multilinear text document.
---------------- TG Modes of Reference: ---------
Node names can be used as part of referring expressions to identify
content substrings; each character has its offset in its tier's
string, so that sequences can be referenced for translation within
another tier as, tiername[i,j]=..., etc. In this way character
offsets might be used as a method of cross-tier indexing for
translation, indexation, interpretation, etc. However the approach
here uses such references substantially less than the Annotation Graph
approach; in this Translation Graph approach the explicit structure
provides a concrete anchoring that establishes reference across tiers,
since a bit of content in a segment of this tier must necessarily be
within the corresponding segment in every tier of the same document,
by the requirement of consistency of segmentation. If the position of
some bit of content within a Document can be seen directly in its
presented ordering between 's, then it is unnecessary to
cross-refer with numerical indexes, for example to say that characters
12-16 of the other tier are a certain morpheme. Instead, a morpheme
tier has segmentation boundaries between morphemes, which are
consistent with the boundaries between phonemes, say, at another tier,
and as such the alignment is directly visible.
An important task here has to do with database structure and
populating and utilizing that structure, as in for example a method of
transduction of database content from one user to another depending on
user's needs. Some of the translation tiers can be pulled out of the database.
----------------------------------------------------------------
Let a tier be represented by default as a computer file having a file
name base and with filename extension .tier, including at the front
and separate from its body a globally-unique document title, a tier
name, and a hash of its content. Other tiers of the same document
should have the same title, a different tier name
and tags are in plain ASCII; *
* CONTENT is encoded in the named charset */
URL="..." /* URL/URI for this file */
title="..." /* globally unique, could be an ISBN or UPN;
* shared identically across tiers of the same document */
tiername="..." /* unique among representations of the titled document */
parentURL="" /* URL to file containing reference tier. Default: "" */
boundary="..." /* Regex of boundary marker: "#", " ", "\s+", "" */
hash="..." /* Auto-generate from CONTENT, or use to compare. */
>CONTENT
Here the hash string is the hash of its CONTENT, ensuring a consistent
baseline of data therefore enabling a consistent treatment of its
segmentation. Since changing any bit in the content modifies the hash,
potentially leading to inconsistent segmentation, autogenerate it after
safe changes, and check it when doing any operation that depends on it.
Before the document is locked, each tier's segments should have
explicitly named node tags separating each content arc. Then segments
can continually be split or combined or modified without ruining
some externally-counted unique global ordering.
Upon locking the document, immutability can be assumed and the system
can remove all the explicit node tags with their unique-within-tier
names, and rely on a generic boundary tag or symbol, combined with a
naming or numbering convention.
Thus a lock and unlock pair of operations on a document would
translate between two forms of the document: in the unlocked form,
each tier's segments are separated by uniquely named nodes and
cross-tier reference is certain through using the correct node names.
In the locked form, each tier's segments are separately by a generic
boundary identifier, and other tiers can refer across nodes to a
particular node in the locked tier identified by its
convention-derived name or number.
A naming/numbering might be built up by the editor using a binary tree
system, where starting with a whole document as one segment, inserting
the first node boundary automatically numbers the preceding as 0 and
the following as 1, and splits numbering the precedent as ...0 and the
successor as ...1, thus giving a unique and ordered name to any
segment, and the numbering system reflecting the perhaps meaningless
and forgettable time sequence of the editor's divisions of the
document.
Given a succession of segments, a renaming could be done automatically
via a decimal counting sequence as in: #0, #1, ... #n-1, n-1# or via a
hashed naming #4s8ulkj #98ulaki #kjh92n3 .. where there is no sort
ordering on the names themselves, indeed sorting is not needed since
the content data (text) contains its own ordering.
Lock(separator,document) would go through the document and remove all
the nodes not referred to in multiple tiers, replacing them with
separator, the generic boundary marker. The nodes referred to in
multiple tiers should be retained with a name so that each tier knows
how to cross-reference a segment to segments in other tiers.
A hierarchy of immutable tiers can be represented in this way. A
mostly implicit naming system for segments in tiers might be a dotted
numbering system with parallel hierarchies. A top tier might be no
more than the whole document in one segment. A tier hierarchy
specified, for example, as TopTier.NextTier1.ThisTier could be used to
interpret a dotted numerical reference for a segment at ThisTier such
as #23.#300.#8291, meaning, the current segment #8291 within ThisTier
is also which is within segment #300 in NextTier1 which is within
segment #23 in TopTier. Naming segments proceeds automatically from
beginning to end within each tier from the beginning of the document
starting from #0 and up to #n-1,n-1#.
A naming convention with # representing the boundary as traditional in
formal linguistic morphology, and consistently with a numbering of
content segments as numbered arcs between nodes or boundaries,
provides that nodes can be named with # and a number, whereby the
number identifies the (zero-based) arc number and # before or after
represents the boundary preceding or following the selected arc. In
this convention, then, #A is the name of the boundary preceding A
while A# is the boundary immediately after A. Dual naming
of boundaries follows: #B = (B-1)# and B# = #(B+1).
Where agglutinative sequencing in morphology or higher levels applies,
the above is true. However, where morphemes overlap, the first
segmental indication of a following morpheme might precede rather than
follow the final segmental indication of a preceding morpheme. Then an
abstract sequence #A#B# might appear at a lower level of overlappingly
influenced subsegments as [#A]aaa[#B]baabbbaaa[A#]bbbbb[B#]. Here the
subsegments influenced by #A# are bounded by #A and A#, ditto #B#.
For example, sandhi:
[#1]iti[1#][#2]ahur[2#] agglutinatively, abstract morphemes in order
[#1]it[#2]y[1#]ahur[2#] after sandhi, where /y/ is influenced by both.
ok.