Open Source

Txt2tags: A great lightweight markup language for many tasks

Marco Fioretti explains why he favors Txt2Tags lightweight markup language (LML) for many text management tasks. Here's what it can do for you.

Computers and the Internet have greatly increased the amount of text that many of us write, edit, reuse or simply archive. When pens and typewriters were the only tools available, they limited both how much text we could produce, and the number of occasions to process it. Today we live in an endless stream of reports, memos, email, websites and tweets that we can copy and paste with a click. Such a situation is a continuous stimulus to write, rearrange, and reuse text. Trying to do it efficiently, however, has some interesting practical implications.

Even when we are aware of how much text we need to manage, we cannot know in advance when we'll want to reuse or adapt some piece, or where, that is on which media or platform (paper, website, smartphone...). If we want to make the most of all the texts we write ourselves or keep stored in our computer, we need to be sure:

  • that their starting format is as simple, portable and future-proof as possible.
  • that we can quickly convert it to many other formats.

The OpenDocument format (ODF) satisfies the first requirement and is great for complex documents. However, it is too complicated for simple texts and is not the easier solution when (automatic) conversion to many other formats is important. These considerations have produced a whole bunch of lightweight markup languages (LML): plain text formats, with very simple special characters or strings that mark up headings, lists, type faces and so on. The work flow for all LMLs is the same:

  • write and store your text, with any editor, in the LML of your choice
  • whenever you need that text in another format (HTML, LaTeX, wiki, PDF...) generate a copy in that format, using the available conversion software for that LML

The LML I prefer, and have been using for most of my work for a few years now, is Txt2Tags. Here's why.

The main reason why I like Txt2Tags is the simplicity and high availability, now and in the future, of its conversion software. I use Txt2Tags because I am sure that I can run it everywhere, with the smallest possible set up effort, without compiling anything, or fighting with dependencies.

The Txt2Tags converter is one small script that Just Works, without relying on any particular library, on every platform where Python runs. Its slogan, "download and run", is true and it's a great part of being "as future-proof as possible": I can create, reuse, and convert the same *.t2t files on any computer I may encounter, from the VPS server hosting my websites to my uncle's Windows box or any Android smartphone.

One simple input...

The second great advantage of Txt2tags is the simplicity of the format itself. A .t2t file is divided in three sections called body, header, and settings. The body, which is the only mandatory part, corresponds to the actual text. The header, instead, contains metadata like document author, date and title. The setting section is the place where you pass instructions to the Python script (more on this in a moment).

The markup rules are simple and keep the source text very readable, which is much more important than you may think. Learning those rules is a breeze, thanks to the online demo and converter. Besides, unlike what happens with other systems, all the marks are symbols, not strings or letters that may confuse spell checkers.

...for many great outputs

What next? Oh, yes, output formats. The features page currently lists 18 of them, from DocBook to HTML, several Wiki flavours, MagicPoint presentations and LaTex. PDF and e-books, you say? No problem. Txt2Tags doesn't support them, because it doesn't need to. On most Gnu/Linux distributions, once you have converted some text to LaTex, PDF is just one more command away:

  $ txt2tags -t tex filename.t2t
  $ pdflatex filename.tex

The same applies to e-books. You can convert .t2t sources to HTML, and then generate ePub versions from there in many ways. Me, I've personally used Txt2Tags to generate OpenDocument slideshows automatically, as well as PDF books and other stuff. With a bit of hacking, you may even add footnotes to your documents!

The power of pre- and post-processing

I mentioned above that Txt2Tags documents have an optional section in which you can give instructions to the converter. The two most important ones are those called preproc and postproc. A command like this inside a *.t2t file:

%!preproc: _something_

means "do whatever is written after the colon NOW", that is, before converting the file to the desired target format. The postproc command, which has the same syntax, works in the opposite way:

%!postproc: _something_else_

thus defining commands that the Txt2Tags script must execute after it has finished the conversion. The most common usage of preproc and postproc is to find and replace specific strings, in whatever moment it is more convenient for you. Some users, for example, use preproc to create lists of links and other abbreviations. Doing so saves typing and keeps the .t2t source file more readable. This line, for example:

%!preproc: url_tros http://www.techrepublic.com/blog/opensource

tells the script to replace all the occurrences of the url_tros string with the URL of the Tech Republic Open Source blogs.

Txt2Tags, a format good for almost any task

Combined together, preproc, postproc and the other Txt2Tags features can extend the functionality of this LML in all the ways described in the official wiki, and many more. Txt2Tags is not the best solution for works with lots of formulas, cross-references, or pictures with captions. In all other cases, however, Txt2Tags is so simple to use that it would be a shame to not try it!

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

19 comments
Tavis
Tavis

I think if you are looking at the suitability of a text format you might consider input (typing, pasting), editing (from single characters to document-wide changes) and syntax separately. When creating HTML, I tend to use Dreamweaver's split screen (markup in one, WYSIWYG in other) editor, and choose one or other mode as it suits. If you use HTML properly, you will separate out style from structure, making it easy to control visual aspects with CSS style sheets. When writing and editing XML, I find a lot of advantages in using a dedicated, schema-aware XML editor which integrates well with typing and editing (valid element/attribute lists on type-ahead, autoclosing tags, syntax checking). Both HTML and XML editors I've used have syntax colouring and checking (for other file types like JavaScript and CSS too), and many keyboard and other productivity shortcuts. There's an overhead in learning any syntax, and the more varieties you learn, the greater the overhead (although this may not be a problem for some people), and the more syntax mistakes you make (for example, I have to remember at least four separate comment syntaxes for XML/HTML, SQL, VB.NET and CSS/JavaScript). Even in wikis where I have seen syntax like Txt2Tags, they still have to provide buttons and guides, so it is not that intuitive, and there is no real help if you make syntax errors (you have to view and find your mistake in the WYSIWYG preview, generally). I can understand your use cases, but I'm not convinced that it would be a widely-used format, especially if it cannot cope with change (like HTML and CSS have done quite successfully).

jfprog
jfprog

...and is very KISSy !

mfioretti
mfioretti

Anil_g, and I_e_cox, thanks for your input. I do know that this is a topic that can ignite endless discussions, if not flames. Each of us has different needs, so there cannot be one "best" format for everybody. For example, as far as I am concerned, PDF is my LESS common output format. So the fact that it takes two steps to get there from txt2tags is not an issue at all. When it comes to looks... I didn't say I really like how txt2tags looks, but its very efficient for me so it's OK. Speaking of "With HTML so ubiquitous, I'm not sure why I would plow around another language": the reason for me is that the simplest the starting format is, the easier it is to generate and analyse it automatically in many ways, and do incremental backups, version control, conversion to email, work on it the same way no matter which OS you have available... YES, of course you can do the same things even if the starting format is HTML, but I prefer the simplest possible foundation.

anil_g
anil_g

There are so many human readable markup languages now. Each one is tuned to the specific purposes of the author, but they all seem to have at least one or two issues. It seems the primary benefit of txt2tags is the lightweight generator tool, but it still needs a two step process for PDF, my most common output format. I hate the txt2tags mark up, it seems so much like code and has too much repetition. Like """ and //. I also hate when you have to use specific coded strings to make something happen. I want it to look as much like natural language as possible. RST is better for me (but still has some glitches). txt2tags is more like a conglomeration of several wiki markups. I don't like it.

l_e_cox
l_e_cox

I still create a lot of plain text documents and a lot of heavily-tagged word-processed documents. But when I want a simple structured document, I just use XHTML. It is NOT fun to write in a plain text editor, so I have searched, on an off, for an editor that will produce good HTML with less typing effort. My latest find is CKEditor (used to be fckeditor). It, likewise, is a script, in this case JavaScript, so it runs in a browser window. It is not my perfect answer, but I like its interface and the documents it produces are fairly well-formatted. With HTML so ubiquitous, I'm not sure why I would plow around for yet another XML-type language to use for documents. However, ease of typing will always be an issue for me, as I have never been that great at using a keyboard.

anil_g
anil_g

Yes, I agree. I hate that too. The whole point (for me) of an LML is that you eliminate the need for buttons entirely because the layout is so intuitive and simple that a button wouldn't help. For instance, to enter bullets: * Why would you need a button for this? But some wiki markups required: [list] [ * ] this is one bullet. [/list] Please! Why all the markup? No wonder they provide buttons, because there's so much typing. It's also ugly, and violates the principal that the plain text document should also be readable. Similarly for links. An LML should definitely support http://this.is.the.link.com Why should you have to type [[http://this.is.the.link.com]] Some wikis (naming no names) get even worse, where you've got to remember different numbers of nested square brackets to deal with different types of links! Aaggh! What was the whole point of introducing an LML again? Have we made the situation worse or better?! I think some wiki markups have been introduced with the sole point in mind of avoiding the security problems of exposing embedded HTML. They haven't really thought of anything else at all. There's a whole industry involving people who've worked out how to use certain wiki markups in innovative ways. They even have competitions, because it's so difficult to use wiki markup well. A REAL LML is SIMPLER, not HARDER!

mfioretti
mfioretti

Tavis, the whole point (for me at least) to go for an LML (and HTML is NOT an LML) is to have the flexibility of generating as many different formats as possible, from the same source. Of course, if one were sure that only HTML/XML are relevant, it would be a different story! Re: "I'm not convinced that it would be a widely-used format, especially if it cannot cope with change" I'm not sure I understand what you mean. LML make it quick to write certain classes of text. They are not for formulas, or large nested tables. But they're that way from the beginning, so either they're OK, or they are not good from the first moment. What did you mean? Which change may find them unable to cope? Thanks..

anil_g
anil_g

I can't find anything on a GAWK markup language. It's hard to find anything because of AWK and because there is an implementation of AWK called GAWK.

anil_g
anil_g

Yes, of course, I only said better FOR ME. That's probably the main reason why there are so many flavours, each is tuned to a particular requirement set. My niggle is that given a particular requirements set all the targetted syntaxes make mistakes. There still isn't the perfect syntax for my requirements and I hate to think that one day I may have to write ANOTHER one. I concur completely re HTML. I do NOT want to write HTML although I am very familiar with it. It's verbose, ugly, difficult to read and if used as HTML in a web page is full of security holes. I specifically want to get away from HTML.

mfioretti
mfioretti

"With HTML so ubiquitous, I'm not sure why I would plow around for yet another XML-type language to use for documents" is that HTML is not the only "end-format" I need. Even if it is the most common. So I prefer to use something that is simple and ready to go many different ways, without risks that extra complexity make those other conversions harder

anil_g
anil_g

If you're not happy with a keyboard you've chosen the worst language. Why labour with XML type languages, one of the key disadvantages is they are text (typing) heavy! RST and other markup languages try to reduce typing. You might like it.

Tavis
Tavis

Marco, thanks for your clarification. What I meant by "cope with change" is how good LMLs are at coping with changing requirements. I see that txt2tags has a changelog where it lists changes to things like targets and syntax and bugfixes (I think this means that Txt2tags is not really so simple, especially if you consider different versions and "fatal errors" which do not generally occur in well-formed markup). Nowadays, we might think of serving two different HTML pages for desktops and mobiles, and I have used XSLT and XPath to transform the same source XML into various different output formats. And HTML5 has a lot of new semantic elements that might be useful for an author to distinguish between, that may not map on to a LML. A lot of thought has been put into things like internationalization, Unicode support, escaping characters and markup, separation of styles and structure and so on more heavyweight markup languages; perhaps an author does not need to consider these for a range of documents though. I can see why a LML might make sense for a single author (comfortable with scripting, downloading updates and debugging, and with good keyboarding skills) publishing a common subset of text to multiple document format outputs as you say. However, I am not sure that LMLs would be more useful than say, a subset of XHTML in exchanging or publishing these source documents, or going back and editing them. So maybe LML is not intended for long-lived documents (Txt2tags is described as a document generator). As a personal preference, I would use a keyboard shortcut Ctrl+1 or Cmd+1 in an editor to produce a heading element in a HTML editor (Ctrl+3/Cmd+3 for ) but either typing + heading + or using another keyboard shortcut in an LML editor (if available) could be someone else's preference. I guess I would spend some time learning a LML if I could use it in various places especially where other formats and editors were not available, perhaps across wiki sites.

anil_g
anil_g

Markup reduces time spent with everything, but even if HTML is your only output markup will be quicker to type and eliminates all the thinking and effort with deciding how to layout. Markup is so much simpler with much fewer choices, and this restriction is beneficial for certain kinds of communication, especially documentation. With markup you are basically hardly even thinking about layout. You simply type the content distinguishing only paragraphs and titles with some few devices such as bullets, lists, quotes, code blocks etc, which are all created by extremely simple and intuitive means, such as position on the page. You can use the text as an original document and I love being able to vim it, since I find that multiple times faster to edit. Then when you want to distribute the content a single command (which can be automated) generates that into PDF (or whatever) of consistent appearance. The whole process has got to be the most labour saving way to do written content.

mfioretti
mfioretti

what jfprog's comment was supposed to mean. Unless he means that he writes everything in HTML or something, then uses gawk to process it???

Tavis
Tavis

I am not familiar enough with LMLs to make a general judgement, but I notice that Txt2Tags has: **bold** //italic// __underline__ --strike-- which are often considered to be stylistic markups, rather than functional/semantic/structural, like code or headings or lists. I think that different styles may be needed for different media (for example, you do not normally see underline being used in HTML because of confusion with hyperlinks, although it may be used in some print). I think you are probably right about an all-Unicode environment being able to successfully transfer properly-authored text, although there may be some edge cases and tricky legacy stuff. In markup like HTML and especially XML, the non-content markup is rigourously separated from content with special characters (<>) and escapes (&;), and nesting rules remove some uncertainties (elements cannot be half in and half out of another element). I have had a quick look at Txt2tags' markup rules: http://www.txt2tags.org/rules.html I can see use cases for LMLs and as part of an authoring process, but I think that if a particular LML has been created to handle all these cases, it is not that simple a language (syntactically). Looked at from another viewpoint, a 'simple' piece of plain text stripped from all context is actually harder to understand. Adding LML has to serve both human comprehension and be watertight for processing (in Txt2tags case: scripting), and each processor has to produce the same result (like each XML DOM processor has to produce the same DOM, or web browser render the equivalent HTML+CSS). In other words LML or richer markup should be unambiguous, perhaps declaratively consistent (no document produced to its ruleset should be invalid). So if your article said 'fast' rather than 'simple', I would understand. I think there is an analogy of sorts between CISC and RISC processors where the complexity is moved from one stage of operations to another.

mfioretti
mfioretti

Hi Tavis, you say: "A lot of thought has been put into things like internationalization, Unicode support, escaping characters and markup, separation of styles and structure" don't you think an LML (on a modern operating system!) helps in this context? I mean, if the environment is modern enough it should already support unicode anyway, and so transfer it to the target formats (Txt2Tags does have an "encoding: " keyword!) Style is also naturally separated from structure with Txt2Tags or others LMLs. No?

anil_g
anil_g

you're looking for much richer content and features in addition to content.

Tavis
Tavis

Yes, anil_g, I think we are maybe considering different scenarios, although with some overlap. One of my reasons for using XHTML markup is for accessibility (in the broad sense) purposes, some examples of which include expanding abbreviations in markup like: <abbr title="Great British pounds sterling">£</abbr>141.29 and as you mention tables which can be handled more completely with HTML http://www.w3.org/wiki/HTML_tables the most complex (and hopefully accessible to screenreaders and such) solution I worked on is archived at: Senior Management Expenses http://web.archive.org/web/20101025080503/http://www.adamsmithcollege.ac.uk/finance/statements/allyears/expenses/seniormanagement/default.aspx where you can see all the table-structure markup. CSS classes such as 'number' are used for style purposes (a right-aligned column), not all such semantics can be inferred from content. So probably my requirements usually include more markup helpful to users that often lies beneath the surface of the HTML page when published, and these are an overhead that many cases will not require, or other output formats would not in any case support or be relevant for.

anil_g
anil_g

I think you've got a whole different scenario, Tavis, to the one I'm considering. I don't think I would have spent much time considering an LML for your purposes. It sounds like you want to generate rich presentational documents, but to automatically deliver different versions for different devices: print / screen / desktop / mobile. I've not looked hard but of course first knee-jerk reaction is to use XHTML or XML. I am pushing LML for scenarios where the presentational requirements are minimal and really only need to be consistent, not configurable. This means no horizontal layout requirements, everything goes down the page. Simple in-line image inclusion. Text tables are the most complicated device.

Editor's Picks