Dr. Dobb's | BibPort: Creating Bibliographic References

BibPort extracts relevant information from a set of legacy documents to create a database of bibliographic references using Visual Studio Tools for Office in combination with Visual Studio 2008 and Word 2007.

Bibliographic software such as BibTeX and EndNote help you construct customized databases of literature references and manage the citation and formatting of a list of references, making the insertion of new citations into documents a seamless operation. A database of citations serves as the repository for each of these tools and can record all pertinent information about specific publications—author names, title, publication type, and date of publication.

However, when a large collection of articles already exists, the task of creating the database of references can be daunting. Manually typing or copying-and-pasting from existing documents into databases can be tedious and error prone. And with existing bibliographic software, there is no pre-packaged solution to assist in automating the initial manual entry of references into the database.

In this article, we present BibPort, an automated tool that extracts relevant information from a set of legacy documents to create a database of bibliographic references using Word 2007's reference-management facilities. In the context of bibliographic text mining, this article is a case study in using Microsoft Visual Studio Tools for Office (VSTO) (msdn .microsoft.com/office/understanding/vsto) in combination with Visual Studio 2008 and Word 2007.

Design Principles for BibPort

Many organizations have specific formatting styles for bibliographic citation; the IEEE, ACM, APA, and MLA. A cursory web search yields several dozen citation styles in widespread use. Therefore, an important BibPort design goal addresses the flexibility needed to work with multiple citation formats. BibPort is not hard-coded to a specific format; new citation formats can be added by supplying new parsers. Furthermore, while individual bibliographic-management tools tend to have their own format, the choice of database format is flexible within BibPort.

BibPort has a plug-in architecture that has a parser manager as a front-end to any number of citation formats. Similarly, an export manager serves as a back-end for multiple export engines for citation database formats. Input/output processes are modularized such that any export engine is able to understand the output of any parser.

Parsing a Citation

Distinguishing the particular type of a reference—a journal, conference paper, or book—within a specific citation format requires an inference to be made based on textual cues appearing in each entry. The inference of the reference type can be extended to differentiate between numerous citation formats. When possible, users may be consulted to disambiguate the conflicts resulting from typographical errors.

Ambiguity in a Typical Reference List Entry

Figure 1 is a typical entry in a reference list formatted in APA style. In this entry, the author "A. Provenzale," has a paper entitled "Page Swapping is Bad. What You Should Know" that has been accepted by "Memory Management and You: Novel Memory Management Techniques." A noticeable problem occurs in this entry due to the title containing two sentences. It is impossible for the parser to determine where the specific title ends and journal name begins. Underlining the journal name is therefore crucial to the successful interpretation of the reference. This illustrates the need for the parser to understand the types of visual formatting cues that can be found in reference lists. The capability to distinguish visual cues requires at least rudimentary understanding of text formatting (underlining and italicization) of a wordprocessor's file format.

Text Formatting Cues to Assist in Resolving Ambiguities

Properly interpreting a reference requires nontextual cues offered by text formatting. To obtain nontextual properties, BibPort requires a mechanism for understanding the document's underlying file format. A manuscript usually follows the standard format of the host wordprocessing tool, such as the structure of a Word document. BibPort understands the Word format and extracts the nontextual information in the citation.

Mining a reference through visual cues is not easy. Many common formats, including the Word document format, encode information in a binary format rather than plaintext. Furthermore, mining the visual cues is tied to a specific wordprocessor's output and is difficult to generalize due to differences in the file formats of wordprocessors. The situation has additional complexity because a parser may need to evolve with the host wordprocessor. For example, each new version of Word tends to change the standard document format in some way, requiring either a more complicated parser or several parsers to decode all formats.

BibPort includes a format-agnostic parser that processes the tokens of a document without understanding the specific underlying format. BibPort's own internal data structure can be extended with a wrapper to integrate with the host wordprocessor's data structure to deliver tokens to the parser. Many wordprocessors provide an API to access the internal structure of documents. For example, Word allows such interaction through VSTO and the Word Object Model (WOM).

Ambiguity Within Reference Types

Ambiguity within a single grammar rule can be resolved partially through nontextual information, which can serve as a guide to delimit the beginning and ending of certain elements (a journal name) of a reference list entry. However, additional challenges remain when parsing a full set of reference types.

The Typical Style

Creating the BibPort grammar for a citation format is not difficult—the process is similar to defining the grammar for a simple programming language. However, there are several citation formats that fail to distinguish between similar types of publications. In the example BibPort grammar for APA (Figure 2), you can see that the grammar productions distinguish between a book and a magazine. At a visual level, the title of the book is completely underlined, but the article title is not. Such visual cues can serve as attributes to assist the parser for a specific grammar. Therefore, there is no ambiguity between a book and magazine in the APA grammar when nontextual information is available.

However, note that the grammar for the journal simply references the grammar for the magazine. This is because the APA style does not distinguish between magazines and journals. However, other citation formats may decide to treat magazines and articles differently, and many management tools distinguish between magazine articles and journal articles. Therefore, even though the grammar may be ambiguous, it's important that the parser have some mechanism for disambiguation so that integration between different styles is possible.

In a grammar specification, optional information is specified by square brackets; for instance, the volume and issue information for the magazine in Figure 2. It is difficult to analyze a citation grammar to produce a parser that's aware of all ambiguities. User-directed parsing needs to disambiguate some citation formats. Most citation formats permit the description of several dozen types of references. In terms of selecting candidates for ambiguous references, BibPort streams every reference list entry to the parser for each reference type. This way, parsers can remain simple, yet all ambiguities are always detected with candidate suggestions.

After an ambiguity is found, users must be involved. Of course, BibPort output should be reliable, so BibPort does not guess about the correct classification of an ambiguous reference without consulting users. Because BibPort is integrated with Microsoft Word, a pop-up dialog menu for each ambiguous reference list entry is displayed to users, who are asked to classify the reference from among the recommended candidates. A similar action can be taken for a reference for which the parser cannot find a suitable production match. An interface lets users populate a data structure with appropriate information, as well as indicate the reference type.

BibPort in Action

One of the new concepts used in Office 2007 is the "ribbon," which replaces the old menu system from Office 2003. With ribbons, users interact with toolbar-like entities that allow switching to different contexts based on the task at hand. Because VSTO is designed to work with Office 2007, it lets programmers design ribbon elements. In Figure 3, we have placed the BibPort controls in the "References" ribbon, which contains all of the native Office bibliographic tools.

Because BibPort is a C# application hosted within the VSTO framework, it can use the standard mechanisms to present dialogs and other types of elements to users. Although BibPort has no UI elements during much of its execution, user-driven disambiguation is made available through a standard Windows Form. Figure 4 demonstrates this process via a journal article, which is not distinguishable from a magazine article in APA style. BibPort simply presents users with each possible interpretation of the reference, letting them guide the tool.

As references are filtered through BibPort, they are programmatically inserted into Word's Source Manager, which is part of Word 2007's new reference capabilities. The Source Manager acts as a database for references, letting users easily insert citations. Although BibPort supports this out-of-the-box, you can also use alternative database formats. BibPort can simply ask for an output filename at the beginning of the process and export the references to the file in another program's native format. However, because the Source Manager is available for manipulation via VSTO, we found it more interesting to target this type of operation. Figure 5 shows the Source Manager after importing a set of references via BibPort.

Implementing BibPort with VSTO

Microsoft's VSTO provides a framework that can host plug-ins written in .NET languages within Office programs. The application and its open documents are represented as objects within a VSTO plug-in. This collection of VSTO objects permits manipulation of the program, as well as navigation of the underlying contents of documents in a structured manner. These objects are available for all supported file types used by Office programs in addition to those supported by after-market file conversion filters. VSTO represents an important abstraction that offers a means of processing many documents quickly and easily. To provide access to Word and its documents, VSTO provides a set of objects conforming to the WOM. With this set of objects, you can inspect the contents of a document in a detailed way.

User Interface Elements

BibPort has been ported through three iterations of Visual Studio, beginning with Word 2003 and the first version of VSTO. Although some changes have been made through the first two revisions, the BibPort code base has remained relatively unchanged between Visual Studio 2003 and 2005. The first two versions targeted Word 2003, which used traditional menu-driven systems. Presenting BibPort to users involved inserting a new menu specifically for BibPort functionality. This insertion was accomplished via several lines of code that registered the menu and created the hooks into BibPort at Word startup.

With Word 2007, ribbons replaced toolbars, requiring some change in the creation of these UI elements. Visual Studio 2008 includes a Ribbon Designer that assists in the creation of new ribbons and extensions. Figure 6 demonstrates the evolution from creating menus programmatically to creating ribbons through a wizard. This capability brings menu creation to the same level of point-and-click design that is common in Windows Forms in Visual Studio.

Nontextual Cues

To ascertain nontextual cues, BibPort accesses a sequential container called "Words" located within a Document object, which corresponds to an open document within Word. In Figure 7, the Words container holds each individual parsed word of the document separately, as well as flags indicating whether specific formatting is activated (such as underlining, italics, or boldface). BibPort accesses the attribute flags within the Word object to obtain the visual cues used to assist in disambiguating a reference.

Using the WOM solves the difficulty of requiring understanding of file formats, but introduces a second problem: Because Word defines the manner in which words and characters are tokenized in documents, there will be several categories of tokens that are not split into the form needed by BibPort's parsers. One particular instance of this is the distribution of punctuation among tokens. The Words array was designed to be used with English words as its logical unit instead of more discrete units with punctuation being independent. Hence, WOM tends to group punctuation marks with words instead of representing them as separate tokens.

Listing One is a sample of the BibPort parsing code. This particular code listing is invoked to determine whether the current word is the beginning of the title for a book. There are several features in this snippet worth noting. First, because BibPort is programmed as a Word AddIn, the various WOM entities are inserted directly into a namespace called ThisAddIn. There is a single Document reference that is available within the Application object that indicates the current working document in Word.

A second observation concerns the code's manner of inspection of the token. In the APA style, the only means of detecting a book title is whether it's underlined. If it's not, the reference certainly is not a book, so BibPort stops parsing. As you can see, this is directly tested by a comparison of a particular attribute of a word against another value within the WOM that represents a single underlining.

Bibliography Management

In Word 2007, the WOM has been expanded to allow programmer access to the new Source Manager. However, the interface provided by WOM is not native; inserting a source involves passing a string containing the XML representation of the source. Unfortunately, this schema was not available on the Internet at the time of writing, so we have reverse engineered the tags used to describe a source; see Table 1. The references in the Source Manager can also be found in the XML file in the user's Application Data directory under Microsoft\Bibliography\Sources.xml.

Adding a new reference involves a simple call to the VSTO method called Globals.ThisAddIn.Application.Bibliography.Sources.Add(xml), but this addition can have some level of complexity because the the source's Tag field must be unique. Due to the COM facilities used in VSTO, exceptions are not trivial. These are typically mapped to an integer and returned as an error code.

Tag	Contains
<b:Source>	All tags listed below; represents a single reference
<b:Tag>	A unique identifier string for this reference as a string. This must be present.
<b:SourceType>	A reference type name as a string, e.g. JournalArticle, Book, ArticleInAPeriodical, or Misc.
<b:Guid>	A unique identifier, often provided by Word, as a string. This is not required when inserting a reference, but is generated. This is used in addition to the Tag mentioned previously.
<b:Author>	This tag is interesting: It is used in two places, one inside another. There is a "root" use, which contains another Author tag along with Editor, Translator, Compiler, and another Author. Each of these tags under the root Author tag contains a NameList.
<b:Editor>	A NameList of the editors.
<b:Translator>	A NameList of the translators.
<b:Compiler>	A NameList of the compilers.
<b:Namelist>	Any number of Person entities.
<b:Person>	A First, Last, and Middle entity. Represents a person's name.
<b:First>	A string that is a first name.
<b:Middle>	A string that is a middle name.
<b:Last>	A string that is a last name.
<b:Title>, <b:Year>, <b:Month>, <b:Day>, <b:City>, <b:StateProvince>, <b:CountryRegion>, <b:Pages>, <b:Volume>, <b:Edition>, <b:Issue>, <b:Medium>, <b:ShortTitle>, <b:StandardNumber>, <b:Comments>	Strings corresponding directly to the tag name.

Table 1: XML tags used to describe a source.

Conclusion

There are many challenges that make text mining difficult [1]. The wide variety of file formats—RTF, .doc, .wpd, and the like—require a text-mining application to parse multiple document types, while also being able to output to multiple formats. BibPort hides such complexity transparently behind an abstraction layer to help you in writing text-mining applications in a more generalized manner. The combination of VSTO with VS 2008 removed many of the accidental complexities in processing Word files to extract bibliographic information. Furthermore, Word 2007's Source Manager provided a convenient repository for this information, and VSTO provided the mechanisms necessary to use this new resource.

There are several projects that share similar goals to BibPort. There is a strong need to mine bibliographic citations from documents available from a web search and digital libraries [2]. Perhaps the most well-known example is CiteSeer [3], which is a popular website that understands citations in different formats to allow cross-referencing of research papers.