Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

recls Refactored


September, 2004: recls Refactored

Matthew Wilson is a software development consultant for Synesis Software, and creator of the STLSoft libraries. He is author of the forthcoming book Imperfect C++ (Addison-Wesley, 2004). Matthew can be contacted via http://imperfectcplusplus.com/.


In the previous five columns, I introduced recls, a platform-independent library that provides recursive filesystem searching, and demonstrated techniques for integrating such a C/C++ library with C++ ("normal" classes, and STL sequences), C#, COM, D, Java, and Ruby by implementing recls mappings for those languages. The source for all the versions of the libraries and the mappings are available from http://www.cuj.com/code/ and http://recls.org/downloads.html.

This month, I take a break from new mappings to cast an eye over the library and consider areas for refactoring. I also expand the existing implementation to provide two new areas of functionality. The changes I describe here take the recls library from Version 1.4.1 to 1.5.1. (As usual, recls relies on STLSoft, and Version 1.5.1 requires Version 1.7.1 of STLSoft; http://stlsoft.org/downloads.html.)

API Changes

There've been three changes to the recls library from the perspective of client code. The first of these is the addition of the following four functions, which provide simple aids to directory management:

recls_bool_t Recls_IsDirectoryEmpty
  (recls_char_t const       *dir);
recls_bool_t Recls_IsDirectoryEntryEmpty
  (recls_info_t      hEntry);
recls_filesize_t Recls_CalcDirectorySize
  (recls_char_t const  *dir);
recls_filesize_t Recls_CalcDirectoryEntrySize
  (recls_info_t  hEntry);

(The Recls_IsDirectoryEntryEmpty() function has already been put to good use in the new Remove Empty Directory shell extension; http://shellext.com/.) These functions are implemented using the Recls_SearchProcess() callback function; see Listing 1.

Implementation Changes

Up to Version 1.5, the UNIX and Win32 implementations were largely contained in the recls_api_unix.cpp and recls_api_win32.cpp files. Contained within each of these were the implementations for the ReclsSearchInfo and PlatformDirectoryNode classes, which represent the search object and a directory node, respectively.

Before I could reasonably expect to add FTP searching into the mix, this situation had to be clarified, and the code factored into a sensible delineation from its original form. The definitions of the ReclsFileSearch and ReclsFtpSearch classes are contained in the ReclsFileSearch.h and ReclsFtpSearch.h header files. Since the FTP functionality shares most of the recls C API, the internal classes must use polymorphism. Hence, these two specific classes now derive from the interface class ReclsSearch.

The PlatformDirectoryNode class has been morphed into the two platform-specific versions of the ReclsFileSearchDirectoryNode (in ReclsFileSearchDirectoryNode_unix.h/.cpp and ReclsFileSearchDirectoryNode_win32.h/.cpp) and ReclsFtpSearchDirectoryNode (in ReclsFtpSearchDirectoryNode_win32.h/.cpp) classes.

This new structure has cleaned things up considerably, and I now have a firm footing for the introduction of new types of hierarchically represented data to the library. One thing worth covering is that, as mentioned a couple of months ago, there are considerable similarities between the class definitions and/or implementations for the UNIX and Win32 versions. This is also the case between the FTP and file versions. When you see this level of similarity, it rings the refactoring bell, and you can be assured I'll continue to monitor the situation to avoid copy-paste hell. (All the Recls* prefixes to class and file names are pretty unnecessary, but just slipped off the end of the to-do list for this month. Expect them to contract somewhat in the future—since they're internal implementation classes, this can be done without affecting client code.)

Multifilter Patterns

The first major enhancement to the current functionality is the ability to supply composite filter patterns (for example, "*.cpp:recls*.h:makefile;"). The API uses the platform's nominal separator (":" on UNIX, ";" on Win32). As well as being a convenience for anyone wishing to use recls in this way, it also represents a significant efficiency gain. N filters are tested in each directory searched, rather than conducting N consecutive searches from the search root directory. Although this means that there are N times as many file enumeration class instances created—either unixstl::glob_sequence or winstl::findfile_sequence—this is still the preferred option because the bulk of the cost is in traversing the filesystem hierarchy. The results of a test run over my work drive show the advantage of having the recls API handle the multiple patterns; see Table 1.

Multipattern filtering is handled via the new (with Version 1.7.1) STLSoft searchspec_sequence class. It is parameterized with the underlying sequence type whose model it emulates (glob_sequence), and uses the string_tokenizer template to split the pattern. The iterators that it presents to client code actually iterate over a number of ranges of the underlying search sequences, corresponding to the number of patterns. In other words, it "linear-izes" N ranges into a single one.

The reason I went to the effort of writing such a class was not just to advance the cause of generic programming. In fact, the main reason is that I wanted to be able to insert a sequence class into the recls implementation with minimal disruption to the existing, and tested and working, code. Because searchspec_sequence presents a functionally identical interface to its underlying sequence, this was almost a no-brainer. For the code archaeologists amongst you, just search for the USE_SEARCHSPEC_SEQ preprocessor symbol in the latest recls source to see what a relatively easy operation it proved to be.

Eventually, I plan to use a custom wildcard component to reduce the searches to one per directory, and also to homogenize and enhance the syntax of the patterns between the different platforms.

FTP Recursive Searches

In the first installment of this column, I mentioned that one of my eventual plans for recls was to be able to recursively search other types of hierarchically stored information. The first of these is recursive searching of FTP sites.

Because the latest version of STLSoft introduces the new subproject InetSTL ("STLSoft for Internet programming"), I was able to borrow the tested implementation of the filesystem searching and adapt it to FTP searching using the inetstl::findfile_sequence class. Before I did this, however, I carried out the refactoring described earlier, since the code had turned into something of a big sludgy mess.

When considering ways in which the recls API could be expanded, I looked at several alternatives. One option is to have separate implementations, but this isn't attractive either in terms of code bloat or flexibility. Another option would be to use the existing API functions, but add more flags to the RECLS_FLAGS enumeration for the new search types. The problem with this approach is that there's a limit to the number of values that can be represented by the 32-bit value. Also, there are additional parameters required for FTP: the host, username, and password. I'm considering a string protocol-based solution, whereby any type/protocol may be expressed in the root-directory parameter to Recls_Search(), but am holding fire on that until we see the format and semantics of other recursive systems that may be incorporated into the library. For the moment, I simply added another function.

Because FTP deals with files, all the existing functions dealing with file entries can be used for FTP search results. Hence, the only change to the C API is the addition of the Recls_SearchFTP function:

recls_rc_t Recls_SearchFtp( recls_char_t const   *host
                          , recls_char_t const  *username
                          , recls_char_t const  *password
                          , recls_char_t const  *searchRoot
                          , recls_char_t const  *pattern
                          , recls_uint32_t      flags
                          , hrecls_t           *phSrch);

In expanding the library for FTP, it became clear that I made a few errors in the design of some of the mappings. Conversely, just as I was pleasantly surprised when I ported the library to UNIX, I was similarly relieved to discover how easy it was to port the library to FTP. The reason this was significant is that FTP, unlike filesystem enumeration, supports at most one concurrent directory enumeration for each connection; in the UNIX and Win32 filesystem enumeration implementations, an open enumeration exists for each level in the directory search.

The effect of this is that the ReclsFileSearchDirectoryNode and ReclsFtpSearchDirectoryNode classes have similar implementations, but where the former uses the file sequence to maintain an open enumeration on the filesystem directories, the latter fully enumerates all directories into a container and maintains only a single open file enumeration for the entities. Thankfully, the bulk of the existing implementation works as well with a list of directory names as it did with a directory enumerator. This is another example of STL's power and elegance.

The FTP functionality is currently only available on Win32, using the WinInet API. I plan to provide a UNIX implementation in a future release.

Changes to the Mappings

Although the effect of the introduction of FTP searching on the C API was minimal, the same cannot be said of the effects on the mappings. Some mappings reflect admirable prescience on my part in getting the initial design right; others, sadly, do not.

C++. The original C++ mapping provided a FileSearch class, which acted as an enumerable object returning FileEntry instances. The changes to accommodate the FTP functionality were relatively straightforward. I moved all the functionality that processed the recls search handle into an abstract class Search, from which FileSearch and the new FtpSearch class now derive. Client code has remained untouched, requiring only a recompilation. Because the file and FTP classes are polymorphically related to their base class, you can write functions to process hierarchies of files independently of whether they reside on a local filesystem or on an FTP server.

C#. With the C#/.NET mapping, I followed the same path as the C++ mapping: I introduced a common abstract base class Search, and derived the existing FileSearch class and the new FtpSearch class from it. Although client code does not need to be rewritten, it did have to be recompiled, since the metadata has changed due to the change in the inheritance relationship of the FileSearch class. This is not really the done thing with .NET, which holds a released assembly as inviolate as a COM interface, so I bumped the AssemblyInfo version to 1.1. Anyone that's been using recls and has been concerned with the version number will no doubt have been building it in signed assembly form. With a change in version number, they can upgrade, or not, without concerns about breaking existing client code.

COM. With the various hassles and potential upset to existing client code, it was a relief to work through the changes for the mapping to COM, one of my favorite technologies. Say what you like about it—and one has to accept that it's hard to understand and a lot of work to use—but COM still beats the other languages/technologies hands down when it comes to extensibility. The introduction of FTP functionality was very straightforward and affects existing file-searching client code not one whit. Not even a recompilation of client processes is required.

All that was needed was to add the IFtpSearch interface to the IDL file, and derive the FileSearch class from that, in addition to IFileSearch. The implementation of IFtpSearch::Search() is virtually identical to that of IFileSearch::Search(), except for the fact that it takes the extra host+username+password parameters and calls Recls_SearchFtp(). As well as not requiring recompilation of client code, the COM mapping does not require linking of any client code, to incorporate WinInet.lib, which has to be done for client code of the C API, and the C++, D, and STL mappings.

D. There've been a number of little things wrong with the std.recls module for a while, most notably that I used ThisCase() rather than thisCase() for methods, contrary to the D Standard. I've been itching to upgrade it, and have now done so along with the changes for the FTP mapping. Since D is not yet at the 1.0 stage, I felt comfortable (after canvassing opinion on the D newsgroup) in hijacking the original badly named Search class for the new abstract base class, and adding the FileSearch and FtpSearch classes. The Entry class remains the same.

Java. Alas, for Java, I must confess I pretended for a moment that I was the CTO of an omnipotent software company, and just went ahead and changed things at will. In some senses, this is justified since the original implementation had some mistakes in it anyway. The main problem had been that I'd forgotten to remove my C++ head and don the Java one when writing the Java classes. This meant that the finalize() methods of the original Entry and Search classes were spelled with a capital "F"; since you need to be especially careful with Java to avoid resource leaks, this was hardly good form.

Given the decision to tear into the first version, I took advantage and turned the Search class into an abstract base, from which the new FileSearch and FtpSearch now derive. It all works peachy now, and the test program readily instantiates one or the other depending on whether you've specified an FTP host on which to search. (A nice side effect of the considerable latency of FTP enumeration is that the Java client now seems just as fast as those of its C-family brethren.)

The implementation of the JNI methods has improved somewhat, mainly due to the separation of the original Search class functionality into the Search and FileSearch classes. One point I'd stress: When you've implemented all the new JNI functions, don't forget to export them from your Win32 DLL (via the EXPORTS section in the .DEF file), or your class will fail to load at runtime due to unresolved methods.

Ruby. I made a mistake in the original Ruby mapping, in making the Entry class a member type of the FileSearch class. Clearly, once the FtpSearch class is introduced, the FileSearch::Entry class can't really serve for FTP searches as well. In this case, I decided to break the existing implementation, and any dependent client code, by moving the Entry class into the Recls module, as a peer class of FileSearch and FtpSearch. This really isn't the done thing as a rule, but in this case I felt I could get away with it because the overwhelming majority of uses of the Entry class would be as part of an each{} construct, where the specific type of the entry is not mentioned. Since a (FileSearch::)Entry instance cannot be created in Ruby code, but can only be retrieved from a FileSearch (or an FtpSearch), it's hard to envisage any client code that would have explicitly stipulated the Entry's full type at all.

As far as the implementation goes, it was trivial to introduce the FtpSearch class. The original implementation of FileSearch_each was factored into a common function, Search_ each_helper(), which both FileSearch_each and the new FtpSearch_each call, passing in their search handles obtained from the requisite recls C API function.

STL. The STL mapping was, perhaps, the trickiest to effect. STL sequence (and other) classes are not generally related by runtime polymorphism, but rather by sharing semantic similarities between polymorphically disparate types. Hence, it doesn't seem appropriate to derive from a common base for sequences for file and FTP searching.

Normally, it's better to simply have separate types, and that was my inclination. However, I instead chose to define an overload to the reclstl::basic_search_sequence constructor when the preprocessor symbol RECLS_API_FTP is defined. If I'm honest, this is probably not the best thing, and was mainly done as an expediency, since the refactoring of the iterator, value-type, and shared-handle classes to interoperate with two sequence classes is a nontrivial undertaking, and I was running out of time. Although I expect to have addressed this by the next release, I invite recls users to let me know how they'd prefer to see the STL mapping refactored.

Further Work

There are a few more things I'd like to do with refactoring the library before progressing to handle different types of information, or adding new mappings, but I suspect you would feel a bit cheated if I didn't bring a new language to the party next time. As for what that may be, I shall refrain from prediction (since I've been so bad at it thus far). Whatever language(s) we move on to next, I shall continue to refine the implementation and interface of the library, and to introduce new types of hierarchical information from time to time.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.