Tata Group
 
 
Tata Infotech links

print this page
  Tata Infotech > articles
 
Software with an eye for structure
Chirag Kasbekar

Nature offers wonderful solutions. Take your — or, for that matter, your pet cat’s — ability to recognise patterns and structure in the things around you. No matter how crumpled it is, you know that’s a piece of paper lying in the wastebasket. No matter how twisted it is, you know that’s a tree the bird is sitting on. It’s a mark of your intelligence, this flexibility.

Software strives to achieve intelligence. And organisations, which in some respects are like biological organisms, are hoping to use some of it for themselves.

The Exegenix solution
Exegenix Canada, a wholly owned subsidiary of Tata Infotech, has developed just such an intelligent software solution. The Exegenix Conversion System (ECS) is, in many ways, a breakthrough. (To learn more about Exegenix Canada, see: Creativity, initiative, experience and reputation.)

Bill Clarke

"Human intelligence is quite remarkable in its ability to see patterns and structures very quickly," says Bill Clarke, CEO of Exegenix and an astrophysicist at the University of Toronto. "Our technology reads electronic documents much as a human being does; hence it is able to understand the structure of documents. That’s enormously important for organisations wanting to use their data effectively."

Organisations, if they wish to be adaptive and intelligent, need something like a central nervous system through which information flows smoothly and in real time. For that, they need to have a memory which, much like human memory, will allow people to immediately get the information they want. Electronic documents need to be in a format that allows this kind of immediate and effective access.

A question of structure
As Dr Clarke points out, XML, or extensible mark-up language, is one such format. "One of the big problems that organisations have now — organisations with a tremendous amount of information assets — is to structure those assets so they can be easily exploited.

"If you want to, for example, research some topic on the Internet, you’ll go to one of these search engines and say, ‘OK, I want to find everything there is out there on some topic or the other’. And it comes back and tells you that there are 15,000 hits. But 14,900 of these are really not what you needed at all. That’s because, most of the time, documents themselves don’t have structure. You can make some attempt to say you’re really interested in this topic within this framework, within this context, but it’s pretty tough to do that.

"XML enables you to specify all kinds of structural contexts for documents. This enables people who are mining that information to get at what they really want."

ECS completely automates the process of converting an organisation’s documents into XML or any other such format. Not just archival documents, but also live data, as it is generated.

Giving computers vision
Why is this difficult? Well, try telling a computer that the crumpled object in your wastebasket is a piece of paper. Usually, it lacks the flexibility, the mental dexterity to be able to understand something like that. The traditional approach to developing software only adds to the rigidity.

Before ECS came along, organisations struggled with solutions. Says Dr Clarke: "Some organisations would get teams of programmers to look at archival documents and say something like, ‘If I find that there are two tabs followed by a capital letter, that’s probably the beginning of a paragraph. And if I’ve got tabs in the middle of the text, that’s probably a table’, and so on. That is the hardwired approach: as long as a document conforms to a very specific structure, then those programs can deal with them."

Unfortunately, as organisations are well aware, their documents very rarely follow a uniform structure. The image below, taken from ‘There are no unstructured documents’, a paper presented at the XML Europe 2002 convention by David Slocombe and Rodney Boyd of Exegenix, shows different ways in which a list can be presented.

You will notice that even though these lists are constructed differently, they’re all trying to say something similar. A computer, ordinarily, won’t notice. This makes things very difficult for any attempt to automate the process of conversion of documents. And this is where the ability to recognise structure like humans do — through visual cues — becomes handy.

Dr Clarke elaborates: "Human beings look at documents and know intuitively what the headings are, what the paragraphs are, what the tables are, what the illustrations are, what the captions are, what the footnotes are, what the sidebars are, etc. But to make software that approaches an electronic document in that way is extremely difficult."

100 per cent accuracy

Shrikant Pathak

 It’s all the more remarkable then that Shrikant Pathak, vice-president and regional director, Americas, at Tata Infotech, can say with confidence that "absolutely no content is lost through our solution. It results in a hundred per cent accurate conversion".

The core Exegenix technology (the part that actually does the recognition) has two qualities that give it great adaptability. One is its ability to translate documents into any format, not just XML. If tomorrow another technology comes along that is an improvement on XML, the Exegenix technology can be used to convert documents to that technology without much change in its defining features. (So, for those wondering if the ‘Ex’ in the company’s name refers to XML, it doesn’t.)

Mr Pathak offers an example. "A book can be converted from a paper book or an electronic text version of that to an open electronic book format," he says. "It can also be converted to be read on a tablet PC. Once you have the structure that this technology provides you, it becomes enormously easy to re-purpose it into any output format you want."

The other feature is its language independence. If this technology is to be successful across the globe, it has to be able to translate documents that are in languages other than English. And there’s no reason that Exegenix’s technology can’t.

Given that the Exegenix solution is the only one of its kind at the moment, it seems like Tata Infotech has a winner here.

An interesting future
The future promises some interesting times for Exegenix.

Not only does Exegenix expect ECS to be the accepted technology for the conversion or marking up of information five to 10 years from now, it also expects to take this technology beyond XML conversion.

The company’s customers, many of them Fortune 500 companies at the cutting edge of their fields, keep coming up with interesting ideas for further application of Exegenix’s core technology. Says Dr Clarke, "People we’re working with are thinking of new ideas all the time. They keep saying, ‘What about this? Can we do this application? What about offering this service to our customers?’"

Economists inspired by biology like to talk of technological co-evolution as a process by which pioneering technologies make other pioneering technologies possible. It is clear that Exegenix’s technology has opened up a whole new area of opportunities for businesses, and we can expect this pioneering technology to lead to many more interesting technologies in the future. Whatever those might be, there is a good chance that Exegenix will remain at the vanguard of this process.

Creativity, initiative, experience and reputation
Exegenix’s history is a story of creativity and initiative. Exegenix Canada’s chief technology officer, David Slocombe — who had earlier spent some years with Bill Clarke and a few other current Exegenix hands in a company called SoftQuad, a pioneer in publishing technology — came to India to work with a small group of developers at Tata Infotech to create a ‘proof of concept’ for ideas he had about the conversion of documents to XML.

In February 2001 the Tata Infotech board decided to establish a North American operation to take this proof of concept and a small team of developers from India and, as Bill Clarke puts it, "marry that group with some experienced product developers and architects in North America and jointly create the technology and take it to market".

Two companies were set up. One was called Exegenix Canada Inc, a wholly owned subsidiary of Tata Infotech. Exegenix Canada has a substantial stake in the other company, Exegenix Research Inc, which is a pure research and development company. Exegenix Canada, the sales and marketing wing of the initiative, markets the ECS solution to large organisations, including Fortune 500 companies, across North America and Europe, and hopes to expand sales to countries like India in the near future.

Experience counts
Exegenix’s history is also a story of experience. Says Dr Clarke, the company’s CEO: "Exegenix was formed in May 2001, but the staff, particularly the senior staff have a minimum of 15 years in structural mark-up.

"We’re the people who brought out the first SGML editing software, who produced the first HTML (hyper text mark-up language) software. So we’re known in the XML world for the background that we have, our experience and for our professionalism." (SGML, or standard generalised mark-up language, is the mother language from which XML and HTML were spawned.)

Even though he has a PhD in astrophysics from the University of California, and still teaches the subject at the University of Toronto, Dr Clarke brings to Exegenix decades of experience in the computer industry, initially as a scientific programmer, and in the publishing industry, where he applied his knowledge of computer technology to publishing.

When it comes to big clients, Exegenix feeds off Tata Infotech’s reputation as a well-established company. "Exegenix’s relationship with Tata Infotech is a very important one, because it gives us greater credibility with large customers who are not prepared to deal with organisations that may not be there tomorrow," says Dr Clarke. "Having that tremendous legacy, that strength behind us, coupled with our acknowledged experience, understanding and proven track record in this area, is a very important factor for a young company like ours."

top of the page