Transform plain text files into Web pages automatically with this PHP script

Learn how you can use PHP to quickly transform plain ASCII text into perfectly readable HTML markup.

Recently, an old friend of mine rang me up to ask for help. He'd been working as a journalist for many years, and had recently received reprint rights to a number of his earlier columns. He was eager to publish his past work on the Web; however, his columns were all saved as plain-text files and he had neither the time nor the inclination to learn HTML and convert them to Web pages. Since I was the only geek in his phone book, he'd called me to see if I could help him.

"Let me take care of it", I said. "Call me back in an hour", I said. And sure enough, when he called back a couple of hours later, I had a solution waiting for him. It involved a little bit of PHP, and it earned me his eternal thanks and a crate of wine.

So what did I do in that hour? That's where this article comes in. I'm going to show you how you can use PHP to quickly transform plain ASCII text into perfectly readable HTML markup.

To begin, let's look at an example of one of the raw text files my friend wanted to convert:

Green for Mars!
John R. Doe

The idea of little green men from Mars, long a staple of science fiction, may soon turn out to be less fantasy and more fact.

Recent samples sent by the latest Mars exploration team indicate a high presence of chlorophyll in the atmosphere. Chlorophyll, you will recall, is what makes plants green. It's quite likely, therefore, that organisms on Mars will have, through continued exposure to the green stuff, developed a greenish tinge on their outer exoskeleton.

An interview with Dr. Rushel Bunter, the head of ASDA's Mars Colonization Project blah blah...

What does this mean for you? Well, it means blah blahblah...

Track follow-ups to this story online at http://www.mars-connect.dom/. To see pictures of the latest samples, log on to http://www.asdamcp.dom/galleries/220/

Fairly standard text: it has a title (or "slug"), a byline, and many paragraphs of text. All that's really needed to transform this document into HTML is to use HTML line and paragraph break markers to preserve the original layout on a Web page. Special punctuation characters need to be converted into their HTML equivalents, and hyperlinks need to be made clickable.

Here's the PHP code (Listing A) to accomplish all of the above:

Listing A

// set source file name and path
$source = "toi200686.txt";

// read raw text as array
$raw = file($source) or die("Cannot read file");

// retrieve first and second lines (title and author)
$slug = array_shift($raw);
$byline = array_shift($raw);

// join remaining data into string
$data = join('', $raw);

// replace special characters with HTML entities
// replace line breaks with <br />
$html = nl2br(htmlspecialchars($data));

// replace multiple spaces with single spaces
$html = preg_replace('/\s\s+/', ' ', $html);

// replace URLs with <a href...> elements
$html = preg_replace('/\s(\w+:\/\/)(\S+)/', ' <a href="\\1\\2" target="_blank">\\1\\2</a>', $html);

// start building output page
// add page header
$output =<<< HEADER
.slug {font-size: 15pt; font-weight: bold}
.byline { font-style: italic }

// add page content
$output .= "<div class='slug'>$slug</div>";
$output .= "<div class='byline'>By $byline</div><p />";
$output .= "<div>$html</div>";

// add page footer
$output .=<<< FOOTER

// display in browser
echo $output;


// write output to a new .html file
file_put_contents(basename($source, substr($source, strpos($source, '.'))) . ".html", $output) or die("Cannot write file");

Let's see how this works:

  1. The first step is to read the raw ASCII file into a PHP array. This is easily accomplished with the file() function, which turns every line of the file into an element of a numerically-indexed array.
  2. Next, the title and author lines (I assume these are the first two lines of the file) are extracted from the array into separate variables using the array_shift() function. The remaining members of the array are then concatenated into a single string. This string will now contain the entire body of the article.
  3. Special characters like ', < and > within the body are converted into their HTML equivalents using the htmlspecialchars() function. To preserve the original formatting of the article, line and paragraph breaks are converted into HTML <br /> elements with the nl2br() function. Multiple spaces within the article body are compressed into a single space using simple string replacement.
  4. URLs within the body are detected using regular expressions, and are surrounded by <a href=...></a> elements. This turns the URLs into clickable hyperlinks when the page is viewed in a Web browser.
  5. The output HTML page is then constructed using standard HTML rules. The article title, author and body are formatted using CSS style rules. Although this script doesn't do it, this is the point at which you would customize the appearance of the final page, perhaps by adding graphical elements, colors or other whiz-bangs to the template.
  6. Once the HTML page has been constructed, it can be sent to the browser or saved to a static file with file_put_contents(). Note that when saving, the original file name is decomposed and a new file (named filename.html) is created for the newly-minted Web page. You can then publish this Web page to a Web server, save it to a CD-ROM or edit it further.

Note: When using this script to create and save HTML files to disk, ensure that the script has write privileges on the directory to which the files are being saved.

As you can see, assuming you have ASCII plain-text data files in a standard format, you can convert them fairly quickly into usable Web pages with PHP. And if you have an existing Web site into which you plan to inject your new Web pages, it's also quite easy to tweak the template used by the page generator to match the look and feel of your existing Web site. So go on, try it out for yourself!

Editor's Picks