November 01, 2005
Open Office Document ConnectorOpen Office DOcument Connector: A Perl Interface for OpenOffice.org and OpenDocument FilesJean-Marie Gouarne
The OpenDocument format is poised to become a de facto standard for the free office software and could be used as a common basis for large-scale content management applications.
Jean-Marie Gouarne is CTO at Genicorp (http://www.genicorp.fr), an IT services provider, and partner in Ars Aperta (http://www.arsaperta.com/en/index.html), a consulting firm focusing on open source software related strategies. His main consulting areas are business intelligence, information systems management and architecture, and software-related legal issues. In addition, Jean-Marie is an OpenDocument specialist and a Perl programmer; he created and maintains the OpenOffice::OODoc CPAN module. He can be reached at jmgdoc@cpan.org.
The OASIS OpenDocument format was officially born this year. Its basic principles and the majority of its semantics and syntax came from the OpenOffice.org project. The OpenDocument format (ODF) is poised to become a de facto standard for the free office software and, in the long term, it could be used as a common basis for large-scale content management applications.
So now is the right time to talk about Perl/OpenDocument integration.
The OpenDocument Concept
The ODF is fully documented [1] and everybody can get and use the specification for free. Nobody can be deterred by legal issues or technical obscurities. An ODF-compliant file is nothing more than a compressed archive which contains a few XML members. Both the compression algorithm and the XML schema are open source. But, above all, a new philosophy is gaining ground in the office software world--any document should have a life without the tool which has been used to create it.
Such an idea should have been straightforward from the beginning because the document belongs to its author and not to the author of the editing software. However, the "open document" concept sounded strange until lately, due to the particular context of the office software marketplace in the '90s. For most of the users, there is an unbreakable link between the tool and the content. So the editing software is both a tool and a lock.
But neither a market singularity nor a vendor lock-in policy can fully explain the delay between the first large-scale deployments of desktop software ('80s) and the emergence of a really open document format. The other concern is the cultural frontier between structured and unstructured data. For most of the IT specialists, the office documents were regarded as unstructured data, and, as a consequence, left out of the scope of the mainstream enterprise software which is dedicated to structured data.
In the last few years, thanks to XML, this frontier began to vanish. Technically speaking, it's more and more difficult to see a document as "unstructured data" when you can describe its structure with a DTD or one of the publicly available schema definition dialects (XSD, RelaxNG). Paradoxically, some so-called "unstructured data", such as office documents, could be nothing else than hyper-structured data, i.e. data with complex, non-tabular and flexible structures. In addition, while XML provides the technological background, information systems must become more and more able to process documents as well as tabular databases in order to meet their business needs. Simply because the documents are at the heart of the business processes and contain a large part of the business knowledge. So, the minds are slowly changing, and the availability of open formats is triggering more and more direct document processing projects which can now rely on standard APIs.
>From a management of information systems point of view, the direct document processing has a significant advantage - it avoids the bad use of proprietary macro languages. Macros are useful tools for individual productivity tasks, but they should not be used as a development tool for enterprise mission-critical applications, because they can't be properly supported in the long term, and they can't run out of the desktop editing software (which is a poor and unsecure platform). On the other hand, direct document processing applications can be written in open and powerful scripting languages, according to the good software engineering practices, and can run in more robust environments. All that is good news for the IT department and, ultimately, for the business.
Introducing OpenOffice::OODoc
The OpenOffice::OODoc module is one of the many answers to these recent needs.
Born as a private project at Genicorp [2], this toolbox was primarily used in a few mid-sized organizations in several business sectors (legal, food and healthcare), in order to allow the automatic generation of very simple operational documents or reports by enterprise applications. Then the module became open source and was made available on CPAN package in the beginning of 2004 [3]. The number of users began to grow and, as a consequence, some bugs were reported and fixed. Two major changes were introduced later.
The first one (1.301) was a rework of the basic access layer, due to performance issues in very large document processing.
More recently,
This toolbox is, to some extent, intended to allow a simple, database-like access to any object in any OpenDocument-compliant file. It relies on the file format, and not on the features of a particular office software. More precisely,
Before going further, let's look at a very short script which illustrates the general logic of the interface:
my $doc = ooDocument(file => 'myfile.sxw');
$doc->cellValue('MyTable', 'B4', 'Hello');
$doc->appendParagraph
(
text => 'The last paragraph',
style => 'Text body'
);
$doc->save;
This example is (I hope) self-documented.
The ooDocument() function is a constructor, it returns an object $doc which is a document interface associated to an OpenOffice.org Writer (SXW) physical file.
The second and third instructions do something with the content of the document. The example shows a cellValue() method, which retrieves and changes the content of a table cell (as you can see, the cell is selected by table name and user-oriented logical coordinates, and not with an arcane XPath expression) then an appendParagraph() one, which, without surprise, creates a new paragraph, with given text and style, at the end of the document. But the most important thing, for now, is the $doc object. It owns every content processing method, and, above all, it hides the details of the physical access to the document.
At the end, a save() call is issued in order to physically commit the changes made in the document by the previous instructions. Before this last instruction, the original file remains unchanged.
This example is only intended to show the basic principles of this interface:
OpenOffice::OODoc is an OpenDocument-aware layer above Archive::Zip and XML::Twig. It only hides the file compression/uncompression steps, avoids the user from learning the OpenDocument XML specification, and provides a compact but readable document-focused language.
Simply put, it's application area covers
Basic FunctionalityThe design goal ofOpenOffice::OODoc is document processing automation with a particular focus on integration between documents and enterprise data. In other terms, this API allows the user to retrieve, read, update, delete or create any part of a document considered as a data structure, but it contains neither layout rendering nor format conversion utility. For example, you can use it to update a table, to create a bulleted list, to change the font size and the background color of a given paragraph, to switch a page orientation from portrait to landscape, but you need OpenOffice.org or another OpenDocument-compliant desktop software to print the result or export it in PDF or some proprietary office document format.
Another point must be clearly explained. OpenOffice::OODoc works with open documents in general and is not limited to a particular class of document. In other words, it can be used against spreadsheets, presentations or drawings as well as text documents. It's a logical consequence of the OpenDocument defintion itself. In proprietary office suites, there is an ad-hoc format for each document class. In the OpenDocument world, a given object is always described by the same data structure whatever the class of the containing document. For example, a table cell can be retrieved in the same way in a spreadsheet (Calc) as in a text (Writer) document.
There are two possible levels of use:
Content, Styles, and MetadataIn the OpenDocument world, there is an explicit separation between several logical spaces, corresponding to several XML members in the physical archive. Because it's very close to the OpenDocument specification (and far from an interactive editing tool),OpenOffice::OODoc doesn't hide this separation, so it must be known by the user.
The most important spaces (at least for a first article) are:
document-content (to select the correct paragraphs and provide the correct style identifier to each one), document-styles (to define and register the new style) and document-meta (because the title belongs to the metadata) respectively.
The following script, where the file name, the search filter and the title come from the command line, shows one of the possible ways to get such a result:
my $filename = $ARGV[0];
my $filter = $ARGV[1];
my $title = $ARGV[2];
# create the document connectors
my $c = ooDocument
(
file => $filename,
member => "content"
);
my $s = ooDocument
(
file => $c,
member => "styles"
);
my $m = ooMeta(file => $c);
# create the named paragraph style
my $bgcolor = rgb2oo("yellow");
$s->createStyle
(
"MyCenteredText",
family => 'paragraph',
parent => 'Text body',
properties =>
{
'fo:margin-left' => '1.8cm',
'fo:margin-right' => '1.8cm',
'fo:text-align' => 'center',
'fo:background-color' => $bgcolor
}
);
# select the target paragraphs and apply the style
foreach my $p ($c->selectElementByContent($filter))
{
$c->textStyle("MyCenteredText") if $p->isParagraph;
}
# put the new title
$m->title($title);
# commit the changes
$c->save;
In this example, the first instruction creates a document interface (
The third instruction instantiates a metadata-focused interface. This object must be created using the ooMeta function, because it deeply differs from the content- and styles-focused objects. Its the only interface with predefined title, subject, author, description, ... accessors.
The
Then we register the title of our choice using, without surprise, the title accessor from
Finally, the save method is called from the
Conclusion
References
[1] See http://www.oasis-open.org/committees/download.php/12573/OpenDocument-v1.0-os.sxw for the full specification and http://en.wikipedia.org/wiki/OpenDocument for a good abstract and some interesting links.
[2] http://www.genicorp.fr
[3] http://search.cpan.org/dist/OpenOffice-OODoc
TPJ
|
|
||||||||||||||||||||||||||||
|
|
|
|