Texvc

From Meta, a Wikimedia project coordination wiki

Warning: This page is largely out of date. texvc has been implemented for some time now; see TeX. See also Problems with texvc.


texvc is a proposal by Taw to implement a substantial portion of AMSLaTeX in MediaWiki.

Most discussion of texvc so far has taken place on wikitech-l and on the test wiki. Arguably, it should take place here on Meta.

Description by the author[edit]

I have described it on en:Texvc, but very briefly.

My principles were completely different. They were (in order of importance):

  1. security
  2. multiformat output
  3. support for all common math
  4. internationalization
  5. ease of use, also without prior LaTeX knowledge

If we want high security, we must take whitelist approach, not blacklist. If we want multiformat output, we must try to understand markup as much as possible. That's why texvc parses it. It has a few quirks to make it easier to use and more internationalized. Two-way source compatibility with LaTeX is less important than any of five points above.

Xypic would almost certainly break requirements 2 and 5, and it isn't that common, so I didn't implement it. But as long as it can be implemented securely and in i18n compatible manner, it's perfectly fine for me.

Toby's critique[edit]

My principles[edit]

  1. Math markup beyond the capabilities of current MediaWiki is not needed right now, although certainly desired. But we will definitely want to have it eventually. Thus, we have both opportunity and motive to take the time to make sure that we do it right, or at least that right will be backwards compatible with anything that we do now.
  2. If our markup is patterned after LaTeX (or some other widely used standard), then the specification should be easily comprehensible to somebody familiar only with LaTeX and the code should be easily importable and exportable from other projects using LaTeX (such as PlanetMath).
  3. Whatever markup we use should be easily comprehensible to somebody familiar only with the rest of MediaWiki. Failing that, the difficult markup should be limited to complicated expressions that can't be done without it.
  4. The output should flow with the rest of our HTML output as well as practicable.

Please comment on the principles in this space (or in a completely different section) so as not to break their numbering.

Applications of the principles to texvc[edit]

The fundamental flaw of texvc, in my opinion, is that it attempts to be an ad hoc modification of LaTeX. In accordance with principle (2), texvc either should be either not particularly like LaTeX or a clearly and simply defined subset/extension of LaTeX. As it is, it's not at all clear to the novice (nor to myself, and I've had some practice) whether a given LaTeX markup command will be supported by texvc or not. texvc in fact attempts to be its own markup language, on its own terms, only resembling LaTeX. For example, texvc's structure doesn't lend itself to the use of \big and its friends (much less to \left and \right), since these act on an immediately following delimiter (a concept that texvc does not recognise) rather than on what would normally be the argument to a LaTeX command.

Indeed, texvc only recognises those LaTeX commands in a specific list. If this is the sort of thing to be done, then it would be best if texvc input didn't look like it was LaTeX input (and the name would have to change too). It could still look to LaTeX to suggest syntax and notation. As an analogy, MediaWiki doesn't look like HTML. However, many HTML tags (such as <br>) are supported in a fashion very much like the way they're used in HTML. texvc could have a similar relationship to LaTeX.

Alternatively, we could allow all LaTeX commands, or all but a specific list of exceptions, or all in a specific list of packages. Then people that think that we have simple LaTeX won't be confused as often by the differences, since our approach will naturally lead to minimising them. Since LaTeX is widely used, especially by the mathematicians that we want to write our articles, this is probably the best way to go.

Taw has made the point that TeX is fundamentally a programming language, not a markup language (like MediaWiki and texvc). This is a good point; we really want a markup language here. (Some example programming commands are \newcount and other \new... commands, which declare variables; \ifodd, and other \if... commands, which make branching decisions; and \catcode, about which more below.) But the high-level LaTeX that is used in almost all LaTeX documents does behave essentially like a markup language; the programming commands are used internally, but not in the document itself. In particular, these commands should never need to appear in Wikipedia. Thus, a new TeX implementation could follow texvc in preparsing the input -- but only to the degree that it searches for these commands. That is, a blacklist, not a whitelist.

A list of banned commands will be easy to generate. The source code to LaTeX and the various common LaTeX packages is freely available and very well documented. The documentation for the source code to TeX itself is The TeXbook, which we hardly want to read through; but the index contains a list of primitive TeX commands (they're the underlined entries), so we can check each of those without trouble. Forbidding TeX input with these commands will be easy for novices from the world of LaTeX to understand; they're quite unlikely to try them in the first place and can readily understand the idea of "no tricky programming stuff that only wizards use". This is in contrast to texvc, which is sometimes still confusing to me (in its particulars, not its overall scope), even though I've watched its entire development (as presented to the wikis and the mailing lists). This plan will still provide the extra level of security (against DoSing and accessing files) that texvc provides by supporting only safe commands; all of the unsafe commands are for programming, not markup, and we will ask of each accepted command "Is this safe?". (In this sense, we still get the security benefits of a whitelist.)

In particular, if we go down this route, then we must not allow \catcode. \catcode changes the meaning of special characters like "\"; by typing "\catcode `\|=0", suddenly commands can begin with "|" as well as "\"! Also, "\catcode `\@=11" will make "@" a letter, allowing it to appear in a command name, which will make available a host of internal commands that are often tricky to understand. \catcode is the lid to Pandora's box, and we can only allow it if the above paragraphs are nullified and we allow everything! (And it is clearly a programming command, so we would naturally forbid it by the above guidelines anyway.)

There is the question of which LaTeX packages to support. I believe that we should only support particularly common ones that an outsider familiar with LaTeX is likely to know. The AMS packages (amssymb, amsmath, and amsfonts) are both very safe and very useful in this regard. Xypic should probably also be supported, since it will make many diagrams (even outside of math) editable in the wiki way, and nobody that doesn't know it will ever be tricked into thinking that its arcane syntax is anything but new to them. (Xypic is also difficult enough to use that it's unlikely to be used when unnecessary, leading to no more complicated input than we have now.) Besides these, PlanetMath apparently uses graphicx (which I'm unfamiliar with) a lot, so we may want that. There are also packages for supporting non-European languages and non-Latin alphabets that might be useful for use inside \mbox in some of the smaller language wikis. (Note that the Latin alphabet as used in European languages is native to LaTeX, even including Icelandic's "þ" and "ð" if we use the ec fonts.)

Another benefit of supporting LaTeX directly is that we add support in broad chunks -- the size of a LaTeX package. We don't need a heavily used texvc request page where LaTeX commands are asked for, with Taw then writing the support for them individually. Instead, we'll need only a lightly used page where LaTeX packages (or broad areas of functionality) are requested; and support for an entire package's worth of commands will happen about as quickly as support for an individual command is added to texvc.

If there are any additional commands that we need in an extension to LaTeX, then these can also be added by creating a special package of our own. If we upload this package to the Comprehensive TeX Archive Network under the public domain (which is more established than the GPL in the TeX community), then this won't cause problems exporting our material to other LaTeX users. texvc, in contrast, has developed some mild extensions (like support for math symbols under the same name as HTML character entities), which make it incompatible with the rest of the TeX world. However, I would try to avoid doing even this, since we want our LaTeX markup to be immediately readable by mathematicians coming from outside Wikipedia. None of the extensions supported by texvc do anything that can't be done without them, so it should be feasible to avoid our own package. Still, that option is available if we ever need it.

There is one major feature of texvc that isn't reflected in any call to support LaTeX directly: the HTML version. I think that this is an excellent idea that we should continue to work on. We will probably never be able to support everything in HTML (and certainly not Xypic's diagrams), but we should support as much as we can. texvc's multiple levels of HTML support are also well designed, in my opinion. I get the impression that most of Taw's work on texvc has gone into this; well, it has not been a waste. That said, I don't think that texvc's HTML output is ready to be used yet. If nothing else, it needs to implement TeX's spacing algorithm (not necessarily precisely), so that the HTML produced by TeX input won't be worse than the HTML produced by the MediaWiki input that we already have on Wikipedia. We can add support for HTML output when it's ready on whatever LaTeX we can manage, while we support PNG output now for (almost) arbitrary LaTeX. (To avoid reinventing any wheels, a look at TtH would probably be in order for anybody that continues to work on this.)

On the basis of principles (3) and (4), I make what I expect to be my most controversial suggestion, but luckily a suggestion that none of my other ideas depend on. Since LaTeX input is so different from MediaWiki, principle (3) is violated if it's used when it's not needed. Principle (4) is also violated if we output PNGs when ordinary HTML will do, since we can never be sure that a PNG will ever flow with text -- and experience suggests that it rarely will (see PlanetMath and MathWorld for numerous examples). Both of these would be, if not solved, at least substantially reduced, if we simply allow LaTeX only in displays, never inline. This may sound drastic, but although I'm one of the primary math contributors to Wikipedia, I've never wanted LaTeX support inline (with one exception mentioned below). I do want it for major displayed equations, and certainly for matrices and diagrams, but I've never needed it inline. OTOH, forbidding inline LaTeX removes the temptation for putting "1 + 1 = 2" (and some slightly more complicated things) in LaTeX input (which both frustrates the nonmathematical editor that doesn't want to get into the TeX stuff and generates slightly flawed output in texvc).

The main point is that any use of fancy mathematical markup, however implemented, should be rare. It was JeLuF that (perhaps to his eventual sorrow) impressed upon me in some old discussion of TeX support on wikipedia-l that wiki markup needs to be kept simple whenever possible. Complementing this, the HTML output that we produce should be as widely supported as possible too. I would go so far as to say that Greek letters shouldn't be used (except in particularly needy cases like "π" for the ratio of a Euclidean circle's circumference to its diameter), on the grounds that Netscape 4 is still widely used and doesn't support them. Now, this example goes far beyond anything that applies to texvc, and you may consider it too extreme. But surely we can agree that we can use an italic letter rather than a Gothic one for an ideal or a bold letter rather than a blackboard bold one for a number field; many recent advanced math textbooks written in LaTeX still stick to these simpler fonts. And surely we can agree that it's better to present the novice Wikipedia editor with "<i>x</i><sup>2</sup>/4 + <i>y</i><sup>2</sup>/9 = 1", which is fully documented on en:Wikipedia:How does one edit a page, than "<math>\frac {1}{4} x^2 + \frac {1}{9} y^2 = 1</math>", which will be documented in a more advanced document, say en:Wikipedia:How does one edit mathematics. The latter may look nicer to the reader, but not to the editor. Saving LaTeX support for the special occasion of displayed formulas will discourage this sort of thing.

(While I'm on the subject, LaTeX will put an inline expression with fractions like the above in TeX's \textstyle, so that the fractions appear smaller, more in keeping with the line height of ordinary text. texvc does not support \textstyle, and writing the HTML support isn't exactly straightforward -- yet another TeX algorithm to programme, when Donald Knuth already programmed it more than 20 years ago.)

Restricting LaTeX to displayed formulas is also not as great a loss as it may seem at first, since one can always resort to a displayed formula if LaTeX really is needed. For example, we could say "A commonly used symbol is{newline}<math>\Vdash</math>{newline}but in limited environments like this HTML document, one can instead write{newline}: ||-{newline}" and then use "||-" throughout the rest of the document. This will be easier to read by editors (as well as by readers using poor browsers), and it will look quite reasonable to a reader -- we're showing pictures of the symbols in a separate display and then using the ASCII version inline in the document. Another possibility is to upload an image specifically for the symbol (compare en:Image:DirectSum.png), which image can easily be generated by displaying the symbol in a preview, saving that PNG, and uploading it back to Wikipedia. I would argue that this too should be avoided, but it would still be easier for editors to understand, since only the special symbol would be a funny image reference.

The one exception, the one bit of inline math that I've wanted but not found available in HTML, is multiscripts. But I don't think that this is enough to override the above considerations. "xi2", or at worst "(xi)2", will ultimately work, so it's not as great an issue as funny symbols, matrices, and diagrams. Also, MathML is on the horizon, and it supports multiscripts, so we wouldn't be resigning ourselves to never supporting multiscripts inline; once MathML becomes widely available (and at least somewhat legible to others), it should be no problem to design a wiki markup specifically for multiscripts.

Probably the prettiest result of forbidding inline LaTeX (but also the least significant) is that we could use a nicer wiki markup to introduce it. Since it would always appear alone on a line, we could simply make LaTeX any line that begins with "$$". This would be a dangerous delimiter inline, since "$$$" might well appear twice in the same paragraph in some text, especially a quotation, but it would be much less likely to cause problems at the beginning of a line. (If we do this, then we should probably also allow an ending "$$" for confused people from the LaTeX world.) This is hardly a persuasive argument, but it's worth keeping in mind if my other arguments are agreed to and we develop a new LaTeX markup accordingly.

A further, very minor point regarding principle (4) is that texvc's display is too large. I would suggest 100, rather than 120, after checking Netscape 6 on both Unix and Windows and Internet Explorer 5 on Windows. Or see if you can copy what PlanetMath does -- they look fine. Finally, displayed math needs a little indent (like we normally get with ":"); we can do this by putting ":" in by hand before "<math>", or automatically when using "$$".

My alternative proposal[edit]

  1. Find out which LaTeX packages are used by PlanetMath (and any other sources that we might be likely to import from or export to).
  2. In light of the above, decide on a set of common LaTeX packages that we wish to support.
  3. Decide which commands (other than those containing the letter "@") from these packages, LaTeX itself, and the primitive TeX commands are used for programming rather than markup. This list must include \catcode (and anything that calls it without undoing those changes).
  4. Decide which additional commands we need and write a little LaTeX package for them (mediawiki.sty). Place this on CTAN for people that wish to export our documents.
  5. Produce once and for all a TeX format file for LaTeX with these packages.
  6. Create a new wiki markup which consists of "$$" at the beginning of a line.
  7. When such markup is found, take the string up to the end of the line and search it for the forbidden commands (watching out for escaped backslashes).
  8. Output "TeX not compiled; forbidden programming commands <list>." if any are found.
  9. Send "\begin{document}$$<the input line>{newline}\ifmmode$$\fi\end{document}" to our TeX format and store the result in a PNG.
  10. If TeX produced an error, then output "TeX error: <the first error>.".
  11. Regardless of any such errors, cache the resulting PNG and stick the output at the appropriate spot, with an indentation.
  12. Meanwhile, continue to work on an HTML rendering of as much of our LaTeX input as we can. (We should look at MathML too, joining others in the TeX community on this project.)

I hope that (4) won't be needed. (6) (and similar) will of course have more complicated syntax if we allow inline TeX (but can be made backwards compatible if we choose to allow that later).

What now?[edit]

The big advantage of texvc over any other proposal is, of course, that it's already available. Why not implement it now and then fix it up later? This may seem especially tempting if you disagree with my arguments about inline expressions. The reason is the last clause of principle 1. Since texvc extends LaTeX in ad hoc ways, we won't be able to move to direct support of LaTeX later without running the risk of breaking formulas. For example, texvc allows one to write "X^\cong" to get the same results as "X^{\cong}", but it would be hard even to write a LaTeX package that would make "^" work that way. In short, processing LaTeX directly will not be backwards compatible with texvc.

I hope to write a module like texvc that will implement the simplest case of the above alternative proposal. I won't include any functionality that might be controversial, like Xypic. If we want to add more stuff later, then we can do this without breaking any earlier formulas.