 |
Chirag
Kasbekar
Nature
offers wonderful solutions. Take your — or, for that matter,
your pet cat’s — ability to recognise patterns and structure
in the things around you. No matter how crumpled it is, you
know that’s a piece of paper lying in the wastebasket. No
matter how twisted it is, you know that’s a tree the bird
is sitting on. It’s a mark of your intelligence, this flexibility.
Software
strives to achieve intelligence. And organisations, which
in some respects are like biological organisms, are hoping
to use some of it for themselves.
The
Exegenix solution
Exegenix
Canada, a wholly owned subsidiary of Tata Infotech, has developed
just such an intelligent software solution. The Exegenix Conversion
System (ECS) is, in many ways, a breakthrough. (To learn more
about Exegenix Canada, see: Creativity,
initiative, experience and reputation.)
"Human
intelligence is quite remarkable in its ability to see patterns
and structures very quickly," says Bill Clarke, CEO of
Exegenix and an astrophysicist at the University of Toronto.
"Our technology reads electronic documents much as a
human being does; hence it is able to understand the structure
of documents. That’s enormously important for organisations
wanting to use their data effectively."
Organisations,
if they wish to be adaptive and intelligent, need something
like a central nervous system through which information flows
smoothly and in real time. For that, they need to have
a memory which, much like human memory, will allow people
to immediately get the information they want. Electronic documents
need to be in a format that allows this kind of immediate
and effective access.
A
question of structure
As
Dr Clarke points out, XML, or extensible mark-up language,
is one such format. "One of the big problems that organisations
have now — organisations with a tremendous amount of information
assets — is to structure those assets so they can be easily
exploited.
"If
you want to, for example, research some topic on the Internet,
you’ll go to one of these search engines and say, ‘OK, I want
to find everything there is out there on some topic or the
other’. And it comes back and tells you that there are 15,000
hits. But 14,900 of these are really not what you needed at
all. That’s because, most of the time, documents themselves
don’t have structure. You can make some attempt to say you’re
really interested in this topic within this framework, within
this context, but it’s pretty tough to do that.
"XML
enables you to specify all kinds of structural contexts for
documents. This enables people who are mining that information
to get at what they really want."
ECS
completely automates the process of converting an organisation’s
documents into XML or any other such format. Not just archival
documents, but also live data, as it is generated.
Giving
computers vision
Why
is this difficult? Well, try telling a computer that the crumpled
object in your wastebasket is a piece of paper. Usually, it
lacks the flexibility, the mental dexterity to be able to
understand something like that. The traditional approach to
developing software only adds to the rigidity.
Before
ECS came along, organisations struggled with solutions. Says
Dr Clarke: "Some organisations would get teams of programmers
to look at archival documents and say something like, ‘If
I find that there are two tabs followed by a capital letter,
that’s probably the beginning of a paragraph. And if I’ve
got tabs in the middle of the text, that’s probably a table’,
and so on. That is the hardwired approach: as long as a document
conforms to a very specific structure, then those programs
can deal with them."
Unfortunately,
as organisations are well aware, their documents very rarely
follow a uniform structure. The image below, taken from ‘There
are no unstructured documents’, a paper presented at the
XML Europe 2002 convention by David Slocombe and Rodney Boyd
of Exegenix, shows different ways in which a list can be presented.
You
will notice that even though these lists are constructed differently,
they’re all trying to say something similar. A computer, ordinarily,
won’t notice. This makes things very difficult for any attempt
to automate the process of conversion of documents. And this
is where the ability to recognise structure like humans do
— through visual cues — becomes handy.
Dr
Clarke elaborates: "Human beings look at documents and
know intuitively what the headings are, what the paragraphs
are, what the tables are, what the illustrations are, what
the captions are, what the footnotes are, what the sidebars
are, etc. But to make software that approaches an electronic
document in that way is extremely difficult."
100
per cent accuracy
It’s all
the more remarkable then that Shrikant Pathak, vice-president
and regional director, Americas, at Tata Infotech, can say
with confidence that "absolutely no content is lost through
our solution. It results in a hundred per cent accurate conversion".
The
core Exegenix technology (the part that actually does the
recognition) has two qualities that give it great adaptability.
One is its ability to translate documents into any format,
not just XML. If tomorrow another technology comes along that
is an improvement on XML, the Exegenix technology can be used
to convert documents to that technology without much change
in its defining features. (So, for those wondering if the
‘Ex’ in the company’s name refers to XML, it doesn’t.)
Mr
Pathak offers an example. "A book can be converted from
a paper book or an electronic text version of that to an open
electronic book format," he says. "It can also be
converted to be read on a tablet PC. Once you have the structure
that this technology provides you, it becomes enormously easy
to re-purpose it into any output format you want."
The
other feature is its language independence. If this technology
is to be successful across the globe, it has to be able to
translate documents that are in languages other than English.
And there’s no reason that Exegenix’s technology can’t.
Given
that the Exegenix solution is the only one of its kind at
the moment, it seems like Tata Infotech has a winner here.
An interesting
future
The
future promises some interesting times for Exegenix.
Not
only does Exegenix expect ECS to be the accepted technology
for the conversion or marking up of information five to 10
years from now, it also expects to take this technology beyond
XML conversion.
The
company’s customers, many of them Fortune 500 companies at
the cutting edge of their fields, keep coming up with interesting
ideas for further application of Exegenix’s core technology.
Says Dr Clarke, "People we’re working with are thinking
of new ideas all the time. They keep saying, ‘What about this?
Can we do this application? What about offering this service
to our customers?’"
Economists
inspired by biology like to talk of technological co-evolution
as a process by which pioneering technologies make other pioneering
technologies possible. It is clear that Exegenix’s technology
has opened up a whole new area of opportunities for businesses,
and we can expect this pioneering technology to lead to many
more interesting technologies in the future. Whatever those
might be, there is a good chance that Exegenix will remain
at the vanguard of this process.
|
Creativity,
initiative, experience and reputation
Exegenix’s history is a
story of creativity and initiative. Exegenix Canada’s
chief technology officer, David Slocombe — who had earlier
spent some years with Bill Clarke and a few other current
Exegenix hands in a company called SoftQuad, a pioneer
in publishing technology — came to India to work with
a small group of developers at Tata Infotech to create
a ‘proof of concept’ for ideas he had about the conversion
of documents to XML.
In
February 2001 the Tata Infotech board decided to establish
a North American operation to take this proof of concept
and a small team of developers from India and, as Bill
Clarke puts it, "marry that group with some experienced
product developers and architects in North America and
jointly create the technology and take it to market".
Two
companies were set up. One was called Exegenix Canada
Inc, a wholly owned subsidiary of Tata Infotech. Exegenix
Canada has a substantial stake in the other company,
Exegenix Research Inc, which is a pure research and
development company. Exegenix Canada, the sales and
marketing wing of the initiative, markets the ECS solution
to large organisations, including Fortune 500 companies,
across North America and Europe, and hopes to expand
sales to countries like India in the near future.
Experience
counts
Exegenix’s history is also
a story of experience. Says Dr Clarke, the company’s
CEO: "Exegenix was formed in May 2001, but the
staff, particularly the senior staff have a minimum
of 15 years in structural mark-up.
"We’re
the people who brought out the first SGML editing software,
who produced the first HTML (hyper text mark-up language)
software. So we’re known in the XML world for the background
that we have, our experience and for our professionalism."
(SGML, or standard generalised mark-up language, is
the mother language from which XML and HTML were spawned.)
Even
though he has a PhD in astrophysics from the University
of California, and still teaches the subject at the
University of Toronto, Dr Clarke brings to Exegenix
decades of experience in the computer industry, initially
as a scientific programmer, and in the publishing industry,
where he applied his knowledge of computer technology
to publishing.
When
it comes to big clients, Exegenix feeds off Tata Infotech’s
reputation as a well-established company. "Exegenix’s
relationship with Tata Infotech is a very important
one, because it gives us greater credibility with large
customers who are not prepared to deal with organisations
that may not be there tomorrow," says Dr Clarke.
"Having that tremendous legacy, that strength behind
us, coupled with our acknowledged experience, understanding
and proven track record in this area, is a very important
factor for a young company like ours."
|
|