Software

Hands-on programming: Extract plain text from documents with Syncfusion's components

Justin James recently tried Syncfusion's Essential DocIO and Essential PDF to help him extract text from documents that he downloaded from the Internet. Here's the code that he wrote to get the plain text from the document.

I recently tried Syncfusion's Essential DocIO and Essential PDF to help me extract text from documents that I downloaded from the Internet. Essential DocIO and Essential PDF are part of a larger suite called Syncfusion Essential Studio Enterprise Edition. The components support a ton of functionality that I'm not using, such as creating PDFs and Word documents, but for my very specific purpose (i.e., getting plain text out of a source document), the tools work like a charm.

Here's the code that I wrote to download a document from the Internet and get the plain text from it. As you can see, I am able to use HTML, XHTML, PDF, Word DOC, Word DOCX, RTF, and plain text.

private string GetDocumentText(string url)

{

var webClient = new WebClient();

var data = Stream.Null;

var reader = new StreamReader(data);

var documentText = string.Empty;

try

{

webClient = new WebClient();

webClient.Headers["User-Agent"] = ".NET Framework";

data = webClient.OpenRead(url);

reader = new StreamReader(data);

var document = new WordDocument();

var contentTypeFinder = new Regex(@"^(\w+\/\w+)");

var contentType = contentTypeFinder.Match(webClient.ResponseHeaders["Content-Type"]).Value.ToLower();

switch (contentType)

{

case "application/xhtml+xml":

case "text/html":

// HTML or XHTML

document.Open(data, FormatType.Html);

documentText = document.GetText();

document.Close();

break;

case "application/pdf":

case "application/x-pdf ":

// PDF

var streamData = new List<byte>();

var readData = data.ReadByte();

while (readData != -1)

{

streamData.Add((byte)readData);

readData = data.ReadByte();

}

var pdf = new PdfLoadedDocument(streamData.ToArray());

foreach (PdfLoadedPage page in pdf.Pages)

{

documentText += page.ExtractText();

}

pdf.Close();

pdf.Dispose();

break;

case "text/plain":

// Plain text

documentText = reader.ReadToEnd();

break;

case "application/msword":

// "Classic" Word format

document.Open(data, FormatType.Doc);

documentText = document.GetText();

document.Close();

break;

case "application/vnd.openxmlformats-officedocument.wordprocessingml.document":

// DOCX Word format

document.Open(data, FormatType.Docx);

documentText = document.GetText();

document.Close();

break;

case "text/rtf":

// RTF format

document.Open(data, FormatType.Rtf);

documentText = document.GetText();

document.Close();

break;

default:

return null;

}

}

catch (Exception ex)

{

// Swallow exceptions

}

finally

{

data.Close();

reader.Close();

data.Dispose();

reader.Dispose();

webClient.Dispose();

}

return documentText;

}

An important note: I put in a generic User-Agent header, which is fine; there is no need to try to emulate a particular browser. However, many sites will actively reject the connection if you don't have any kind of User-Agent header.

I hope you find this code useful. And in case anyone is wondering, I'm very happy with Syncfusion's components, even though I'm only using less than a dozen calls to the entire library. By the time you read this article, I will have purchased the full suite, Syncfusion Essential Studio Enterprise Edition.

J.Ja

Disclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides. He is also under contract to OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and other articles.

———————————————————————————————————————————-

Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!

About Justin James

Justin James is the Lead Architect for Conigent.

Editor's Picks

Free Newsletters, In your Inbox