Structured text

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
This is a proposal for a new Wikimedia sister project.
structured text
no logo
Status of the proposal
Statusunder discussion
Details of the proposal
Project descriptionuse stuctured text content, in form of binary tree, instead of traditional text
Is it a multilingual wiki?yes, many language versions
Potential number of languagesyes, many
Proposed taglineno tagline
Proposed URLno
Technical requirements
New features to requiresomething like Wikifunctions. a database/dictionary of morphems and constructions and their features, like wikidata or wiktionary. optional tools: a gui binary tree editor, to easily rearrange binary tree branches by clicks or drag and drops, and viewer, and/or: text editor to run in web browser with support of indendation ("tab"s). a tool to convert from simpler text format to xml and vice versa. a tool to divide words to morphemes. machine learning algorithms maybe trained with user input and then used to automatically translate traditional "linear" text into binary tree. also, ML can be used with parallel trees in different languages to translate.
Development wikino
Interested participants
List of project participants


Overview[edit]

Binary tree structure of texts[edit]

i think that, probably, any sentence of any natural language is semantically a binary tree like this:

(
	(
		(San Francisco)
		(
			be
			(
				the
				(
					(
						(
							(
								(culture al)
								(
									,
									(commerce ial)
								)
							)
							(
								,
								(
									and
									(finance ial)
								)
							)
						)
						center
					)
					(
						of
						(
							(North ern)
							California
						)
					)
				)
			)
		)
	)
	s
)
.

(
	(
		it 
		(
			be
			(
				(
					the
					(
						(four th)
						(	
							(
								(much est)
								(populous city)
							)
							(in California)
						)
					)
				)
				(
					,
					(
						after
						(
							(
								(Los Angeles)
								(
									,
									(San Diego)
								)
							)
							(
								and
								(San Jose)
							)
						)
					)
				)
			)
		)
	)
	s
)
.

Synonymic structures[edit]

some parts of text can be shown several ways as binary trees. for example:

((San Francisco) ((be X) s))

(((San Francisco) (be X)) s)

fourth
(	
	(
		most
		(populous city)
	)
	(in California)
)

(
fourth
most
)
(	
	
	(populous city)
	(in California)
)

(
	fourth
	(	
		(
			most
			(populous city)
		)
	)
)
(in California)

fourth
(	
	(
		(
			most
			populous
		)
		city
	)
	(in California)
)


("fourth" and "most" are not separated here to make it easier to see for the main purpose of this example).

Ways to use this[edit]

it is possible to use a format like this for every language and use functions to transform it to usual form of that language. also, speech synthesis can be done better using the parentheses. also it is possible to transform these trees from language to language

such content, text as binary tree, is (must be) a great resource to learn languages. it is even more good if every element is clickable and opens wiktionary/wikipedia/wikidata in a new background tab, or show that in a tooltip (popping out balloon).

Binary structure of paragraphs and other texts[edit]

i think, any paragraph, probably, also can be structured into binary tree, like this:

(
	(
		(
			creating a new language is a huge effort, and only few people are going to know it.
			you have to discuss different limits of every word in it to come to some consensus...
		)
		(
			and all that work is just to create just another language no better, by its structure, than existing thousands natural languages.
			(lexicon can be bigger than of some languages).
		)
	)
	(
		(
			(
				what you should do instead is just use a format like this for every language and use functions to transform it to usual form of that language.
				also, speach synthesis can be done better using the parentheses.
			)
			also you can transform this formats from language to language.
		)
	)
)

(the regular sentences are intentionally not structured into binary tree in this example). this structure can be useful to better connect sentences via pronouns. and different languages may have different limits and preferences in using one sentence vs several sentences with pronouns. this parentheses may help to (properly) translate that places to other languages.

below in "#Order (mainness) of items in a pair, XML, JSON" i say about semantical order of items in every pair. in the above example, i think there always "٧" sign (meaning "greater than" written top-to-bottom) should be put. because in every pair the latter is like additional explanation to the former.

and i make a tree of mediawiki discussion signature "--QDinar (talk) 12:55, 17 August 2021 (UTC)", for purpose of demonstration of the binary tree concept:

(
	--
	(
		(
			(Q Dinar)
			("()" talk)
		)
		(
			(
				12
				(: 55)
			)
			(
				,
				(
					(
						(17 August)
						2021
					)
					(
						"()"
						(
							(U T)
							C
						)
					)
				)
			)
		)
	)
)

i think, probably, not only natural language texts, but also constructed language and programming language texts can be shown as binary trees.

Relation to discussion of Abstract Wikipedia[edit]

i wrote this proposal (like the above part) in Talk:Abstract_Wikipedia, and, since this is like out of scope of that project, i submit this as a project.

Precomputed data, XML[edit]

the shown form of the tree is not easy, not optimal for modification by computer, because it would have to check subtrees going deeply inside them before it can decide about what to do in current pair of branches. important data can be precomputed and saved in every pair. so, 2 items in every pair is not enough... additional items can be saved not as regular items, but as properties... xml can be used for that.

Order (mainness) of items in a pair, XML, JSON[edit]

seems most important thing to have in every pair is type of pair. a pair may have type of its first child or of second child. for example, ((most (populous city)) (in California)) has type of first child, (most (populous city)). and (most (populous city)) has type of its second child, (populous city), and (populous city) has also type of its second child, city. "((most (populous city)) (in California))" can be also said that it has type of "city". this can be written with xml like

<pair type="city" mainchild="1">
	<pair type="city" mainchild="2">
		<element>most</element>
		<pair type="city" mainchild="2">
			<element>populous</element>
			<element>city</element>
		</pair>
	</pair>
	<pair type="in" mainchild="1">
		<element>in</element>
		<element>California</element>
	</pair>
</pair>

.

manual marking of main item in pair can potentially be avoided, if some program automatically marks them. before that program is ready, manual marking can be useful or required.

proper, meaning standart, accepted, term for "main" is "head". (i am lazy to edit this text to replace it, but i am going to use "head", and maybe i will edit the text.)

Showing headness in text[edit]

how to show headness of items in pairs if more simple formats than xml are used, and language's positioning is retained? some ideas:

*(
	*(
		most
		*(populous *city)
	)
	(*in California)
)

*( *( most *( populous *city ) ) ( *in California ) )

(
	(
		most
		٨
		(populous < city)
	)
	٧
	(in > California)
)

((most<(populous<city))>(in>California))

[
	[
		"most",
		["populous", "city", "1"],
		"1"
	],
	["in", "California", "0"],
	"0"
]

.

Showing headness in GUI[edit]


            |
           / \
         /\ >  \
       /   \     \
     /   <   \     \
   /        / \     / \
most populous<city in>California

.

Ideas to try to make binary tree with positioning and headness easy for programmers[edit]

using empty array items: 0th: left head, 1st: left subordinate, 2nd: right head, third: right subordinate:

[
	[
		,
		"most"
		,
		[,"populous","city",]
		,
	]
	,
	,
	,
	["in",,,"California"]
]

using references/pointers: 0th: left, 1st: right, 2nd: reference to main, third: reference to subordinate, this is php, maybe this can be done easier in some language:

$a=[
	[
		"most"
		["populous","city"]
	],
	["in","California"]
];
$a[2]=&$a[0];
$a[3]=&$a[1];
$a[1][2]=&$a[1][0];
$a[1][3]=&$a[1][1];
$a[0][2]=&$a[0][1];
$a[0][3]=&$a[0][0];
$a[0][1][2]=&$a[0][1][1];
$a[0][1][3]=&$a[0][1][0];

Idea of getting rid of parentheses in linear text format[edit]

((most<(populous<city))>(in>California))

can be written this way:

just showing nesting level without parentheses:

most,,populous,city,,,in,California

most 2 populous 1 city 3 in 1 California

showing nesting level and headness without parentheses:

most<<populous<city>>>in>California

most<<populous<city>3>in>California

Unusual word order (positioning)[edit]

seems, this thing, about which child is main, is not always computable, because unusual word order can be used. (i mean "position(ing)" by "order" in this paragraph). examples of computable things: in "language-independent", main child is "independent", in "independent language", main child is "language". in first case it is shown with "-" mark of usual punctuation, in second it is because usual word order when adjective is first and described thing is second. in "(universal time) coordinated" there is unusual word order, but "coordinated" can be computed as not main, but describer/modifier, by/because of its "ed" suffix. but what if such unusual word order (i think, it maybe known as "french" word order for englishmen) can be used in english rarely, for example, in poems? and what if that modifier part has not "ed" suffix, but it is a regular adjective? for example, what if it is possible to say "city populous" meaning just "populous city"? than it is going to be harder to compute automatically. this case/example not very hard yet, because one is adjective and one noun. harder case would be something like "phone book", it should be understood with usual english word order, "book" must be main part, but what if really main part is "phone", ie, its meaning is "book phone", ie something like "a phone like a book"? that is unusual word order. though, i do not remember i saw such unusual word order in wikipedia english texts. but such things may be in other languages, other than english. also, maybe, in cites of poems in english wikipedia.

3 equivalent forms of tree[edit]

so, this binary tree is ordered binary tree, one child is main and the parent pair retains type of the main child. instead of showing that order with xml attributes, it can be shown by positions of the childs, but that make the tree hard to understand for speakers of language. positions can be changed into "prepositional" ("polish notation", "prefix notation"), (main part is always first), or to "postpositional" ("reverse polish notation") (main part is always the second). and in that 2 forms of tree, xml attributes need to be used to save precomputed real/usual position of the items, in usual form of language. these 3 forms can be quite easily converted from one to another, that function can be built into mediawiki text editor, if it is useful...

an example for the 3 forms of the binary tree:

language's positioning:

(
	(
		most
		(populous city)
	)
	(in California)
)

prepositional positioning:

(
	(
		(city populous)
		most
	)
	(in California)
)

postpositional positioning:

(
	(California in)
	(
		most
		(populous city)
	)
)

.

Items equal by their grammatical mainness in a pair[edit]

in some languages, in some cases, branch order in a pair maybe not important. (both semantical order and position maybe swappable). for example, something like "truck car" and "car truck", this is not a proper example, because in english second item is main, but just show the idea, that 2 components maybe very similar and swappable, and they can be used in both order in a language, without distinguising which is main. in that cases, either one can be selected as main. probably, it is useful to mark such cases, like with an attribute like "equal" in xml tag of pair, (or somehow in a language other than xml).

XML, Wikidata[edit]

also, xml can be used to write links to wikidata, like this:

<pair type="in" mainchild="1">
	<element wikidata=".......">in</element>
	<element wikidata="Q99">California</element>
</pair>

.

Other ways to link to deeper main element[edit]

maybe it is useful to show relative path to the deep main element, this way:

<pair type="city" mainchild="122">
	<pair type="city" mainchild="22">
		<element>most</element>
		<pair type="city" mainchild="2">
			<element>populous</element>
			<element>city</element>
		</pair>
	</pair>
	<pair type="in" mainchild="1">
		<element>in</element>
		<element>California</element>
	</pair>
</pair>

. (ie, for example, to write "122" instead of just "1").

Also, deep main element may be referred using unique identifier.

Traditional punctuation by spaces is not always semantical[edit]

traditional punctuation by using spaces does not conform/follow semantic structure and should not be blindly followed. (it (traditional punctuation by spaces) follows rule/principle of writing suffixes jointly, etc). for example, "frankly speaking" is not

(
	(frank ly)
	(speak ing)
)

but it is:

(
	(
		(frank ly)
		speak
	)
	ing
)

.


Lexicalised constructions, loanwords[edit]

in the example above i wrote "most" as "much est". maybe, it is better to write is just as "most", instead. in the mind of english speakers "most" probably has the "st" suffix, they probably feel it. if to use "st", there is more universal usage of "st". if to use "most", there are like 2 synonyms, "most" and "st". if to write that separately, it is still possible to show it as "most" in xml attribute of the pair/node, but bad side (of separating) is that in every such case tree is going to have more complicated branches, and to use more memory of computer, and more space on user's viewer/editor. such things can be shown collapsed to user, so, that is not problem. not all lexicalised structures can be replaced with a single item. some structures, lexicalised constructions, can have places for putting items (subbranches) in them. so, i think, it is easier to make that all separable things separate, it also makes their components more universal, and it maybe useful to translate it into other languages (if construct has not translation in dictionary, then it is possible to fall back to its components).

also, in the examples in this page, "Francisco" could be divided as "(franc isco)". i am lazy to divide it, because it maybe a bit doubtful for some english speakers, and leave it undivided. also, "Los Angeles" is (los (angel es)). i leave it undivided, because probably it is not clear for many english speakers, and i am lazy. also i am lazy because the examples already show main idea. but, i think, such things (loanwords) should be divided, structures of them should be shown. geographical names usually do not have internal places for other text, and it is ok to not know their meaning, and they have their structure explained in dictionaries. so, loanwords like them can be saved as an item, and their structure can be shown by downloading it from web, or by getting it from local machine. also, in the examples in this page, "populous" could be divided as "(popul ous)".

Homonyms and sub-meanings[edit]

"s", written just so, in the example above, is just for purpose of showing binary tree. homonym should be shown someway, either with links to wikidata, or with codes like "s1", "s2", or codes like "s-plural" or "s-pl", "s-3p-s-i-pr" or "s-verb". also, it is possible to refer to submeanings of words, like after1, after2, maybe using meanings which shown in wiktionary. submeanings can be referred by xml tag arguments, etc.

in later examples on this page, i started to write "s" as "s-verb". i accused Abstract Wikipedia project of creating a new language. but writing "s-verb" instead of "s" is also like creating a new language. i meant, it should be perceived like chinese and japanese speakers perceive their characters. though, i am afraid, if such writing, like writing "s-verb", is widely used, it can affect spoken language. it can be written just as "s", it still should be understandable by speakers of language, and additional info can be written in xml attributes, or like [0,"s","verb"], where "0" means that type of array is element (single item, not pair), (that info can be hidden in gui, or removed in simplified text format).

Virtual morphemes[edit]

in the example above, dot and comma are used just like morphemes. this is because, semantically, they have their special role and change meaning. in speach, they are present as intonation. such things can be named like "virtual morphemes". in different languages there are more different such things.

more examples:

in arabic, word's vowels change. (nouns, verbs). that way of changing of vowels should be used as such a virtual morpheme. for example, "kitab" is "book", and "kutub" is "books". it can be shown like (kitab u-u-plural).

also, in english: "woke" is past simple of "wake", it can be shown as (wake past-simple), see also wikt:Appendix:English_irregular_verbs .

in russian, morpheme can be added to several components, for example: "bolsaya kniga" is "big book", it can be shown as ((bols knig) a-nominative-singular-feminine), "bolsoy knige" is "to big book", it can be shown as ((bols knig) e-dative-singular-feminine). that things can also be named like aya-a-nominative-singular-feminine, oy-e-dative-singular-feminine.

also, in arabic: "kitabun kabirun" is "big book", "kitabin kabirin" is "of big book". they can be written in the semantic binary tree like ((kitab kabir) un-nominative) and ((kitab kabir) in-genitive).

in english subject of verb is shown by position. it can be shown by a virtual morpheme. for example, in "(((San Francisco) (be X)) s)", X is object, (San Francisco) is subject. they differ only by position. (ie there are no morhemes (like suffixes or prepositions) near them to mark that). maybe, it worths to show that with virtual morpheme like this:

(
	(
		((San Francisco) subject-position)
		(be X)
	)
	s-verb
)

.

Subtle positioning of morphemes[edit]

examples:

in indonesian, there are infixes, that are placed somewhere between phonemes of a word. though, seems, as i see from wikipedia, they are used only for creating new lexems. example: "gigi" is "tooth", add "er" infix and you get "gerigi", which is "toothed blade". that can be shown like (gigi er-infix).

in arabic, morheme applied to pair of morphemes maybe put between them, like an infix: "kitabukum" means "your book" (nominative/subjective), "kitabikum" means "of your book". the -u- and -i- morhemes, for nominative and genitive, are put between morphemes for "book" and "your".

also in english, "s-verb" in "((X (be Y)) s-verb)" takes position immediately after "be" ("is" is "be" + "s").

also in russian, verb suffixes are put immediately after it, not after objects of verb.

in the last 3 examples the applied morpheme, that have "subtle" position, is main to the object to which it applies (in a pair) (governs it), and it is attracted by position to the main element of the object.

Lists are subject to be written by lists instead of binary trees[edit]

list shown as list, instead of binary tree, is more easier to read by human, and seems a little easier to compute for computer. so, maybe, it worth to use lists in that case.

ie

(
	(
		(
			(Los Angeles)
			(
				,
				(San Diego)
			)
		)
		(
			,
			(San Jose)
		)
	)
	(
		and
		(San Francisco)
	)
)

can be replaced with something like

list(
	(Los Angeles)
	(San Diego)
	(San Jose)
	(San Francisco)
)

or

(
	list(
		(Los Angeles)
		(San Diego)
		(San Jose)
	)
	(
		and
		(San Francisco)
	)
)

.

How to transform usual linear text into binary tree[edit]

process of "parsing" sentence into binary tree may go like this:

1. San Francisco is the cultural, commercial, and financial center of Northern California.

2.

San Francisco is
(the cultural, commercial, and financial center of Northern California)
.

3.

San Francisco is
(the (cultural, commercial, and financial center) of Northern California)
.

4.

San Francisco is
(
	the
	((cultural, commercial, and financial) center)
	(of Northern California)
)
.

5.

San Francisco is
(
	the
	(
		(
			(cultural, commercial, and financial)
			center
		)
		(of Northern California)
	)
)
.

etc

there, bigger structures are marked sooner. such algorithm can be named "top to bottom parsing/structuring/processing/etc". (in this example, it is not purely/strictly top-to-bottom). because traditionally trees in linguistics and programming are drawn with root on top, "root" referring not to underground root with branches, but just a point where above ground tree branching begins. such sequence seems easier to use with text editor, this is reasonable, because when text is broken into lines sooner, it is easier to move through it.

bottom to top processing looks like this:

1. San Francisco is the cultural, commercial, and financial center of Northern California.

2. (San Francisco) is the (culture al), (commerce ial), and (finance ial) center of (North ern) California.

3. (San Francisco) is the (culture al), (commerce ial), (and (finance ial)) center of ((North ern) California).

4. (San Francisco) is the (culture al)(, (commerce ial))(, (and (finance ial))) center (of ((North ern) California)).

5. (San Francisco) is the ((culture al)(, (commerce ial)))(, (and (finance ial))) center (of ((North ern) California)).

etc

GUI tool may help to build binary trees[edit]

if top-to-bottom sequence is used, user may select big blocks with mouse cursor and separate them. if bottom-to-top sequence is used, user may drag-and-drop items one to another, so they join into pairs.

drag-and-dropping may work like this: if it is required to edit

           FLSI
         /    \
      FL        SI
    /  \      /   \
frank ly    speak ing

to

              FLSI
             /  \
           FLS   \
         /   \    \
      FL      \    \
    /  \       \    \
frank ly    speak   ing

, "speak" could be moved (by drag-and-drop or 2 clicks) to node "FL", and the program should automatically put it into a new pair with it, and remove SI node above "ing" and connect "ing" directly to "FLSI" node; and, maybe, "ing" should be capable to be moved to "FLSI", then the program should remove "SI" node, rename "FLSI" to "FLS", connect "speak" directly (ie without "SI") to "FLS", and create new node "FLSI" above "FLS" and "ing".

before that (before bottom-to-top sequence, or before both types of sequences, and, also before other, "free" sequence types), affixes (suffixes, prefixes, etc) should be separated, special tool may help to do that.

if text editor is used, simple format like with parentheses maybe used, and special tool may automatically convert it into xml, and in order to edit an xml, special tool may convert to simple format.

How Wikifunctions can be used[edit]

example: to convert

(
	(
		(
			(commerce ial 1)
			(, (finance ial 1) 0)
			0
		)
		center
		1
	)
	.
	1
)

to usual text, functions maybe like this:

getusualtextrecursive function:
if argument is single item, return it as string.
if second item of argument pair is "ial", call fuseialsuffix(1stitem), and return its return.
(else) getusualtextrecursive(1stitem), getusualtextrecursive(2nditem), keep their return values.
if main element of second item is comma, join this 2 values, return the result.
if second item is dot, join this 2 values, capitalize first letter of that string, return the result.
(else) join this 2 values with a space between them, return the result.
fuseialsuffix function:
if last letter of argument is "e", remove it. join "ial". return the result.

many functions like "fuseialsuffix" are going to appear, and they can be written in wikifunctions. for me, now, wikifunctions is understood as functions that anybody can edit, i do not know much.

Comment about -ial and -al, as they used in the examples[edit]

in the examples in this page i used both ial and al morphemes. according to wiktionary, they have almost same origin (etymology). i suspect that "financial" is "finance al" where last letter "e" of "finance" has transformed into "i". there are 189 words with "ial" and 2664 with "al" in wiktionary. maybe they should be better used as a single/same morpheme "al", in this writing system and functions.

Proposed by[edit]

User:Qdinar

Alternative names[edit]

binary tree text


Related projects/proposals[edit]

Abstract Wikipedia

Domain names[edit]

wikipedia.org , or can be put into mediawiki cms

Mailing list links[edit]

none

Demos[edit]

none


People interested[edit]