Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Introducing recls Mappings


In the November column, I introduced you to recls, a platform-independent library that provides recursive filesystem searching. Hopefully, you've downloaded and used Version 1.0 (http://www.cuj.com/code/ or http://recls.org/). In addition to providing a single API for use from C/C++, the library also provides mappings to different languages and technologies. These mappings form the main focus of this column for the next few installments.

This month, I introduce the first three mappings. The C++ wrapper classes mapping was included in the library's 1.0 release. The other two mappings are C++ STL and C#. Together, these make up Version 1.1 of the library.

Building recls

My intention is for recls to be as standalone as possible (except for system and mapping-related libraries), with the proviso that it relies on the STLSoft libraries. Specifically, the code requires Version 1.6.5 (or later) of the STLSoft libraries (http://stlsoft.org/downloads.html and http://recls.org/). The makefiles also require that you define an environment variable STLSOFT_INCLUDE to refer to the absolute path of the directory in which the STLSoft library files reside. As the author of both libraries, I assure you that they work together. And since the raison d'etre of the recls project (and this column) is to learn about language integration, it seems a reasonable shortcut to what would be a daunting coding task. The STLSoft code is almost entirely used within the recls implementation, and does not form part of the public interface. Other than that, it only appears in the C++ STL mapping.

There are new STLSoft components that have not made it into a distribution but are used within the implementation of recls 1.1. (At this writing, the planned release of STLSoft 1.7.1 is months away, lurking behind too many commitments.) These files are, therefore, included in the recls 1.1 archive and in a patch available from the STLSoft and recls web sites. Once installed, you should have no problems. Let me know if you find any and I'll post updates to the recls web site.

I'm providing makefiles for all of the supported compilers, and some IDDE project files (initially Visual Studio 98), but you have to handle your own static/dynamic-library builds and such. Naturally, I'll respond to bugs in the recls source and try to help where I can. However, the only IDDE I'm anything like expert in is Visual Studio 98, so you're largely on your own. But if you want to create project files and submit them, I'll happily post them on the web site for others.

recls Improvements

There are a few changes to the core library between Versions 1.0 and 1.1. The obvious difference is that there are two new functions:

  • Recls_GetDirectoryPathProperty() provides both directory and, for operating systems that have them, drive. In other words, this can be thought of as the location of a given file entry.

  • Recls_GetSizeProperty() is required for mappings that cannot use the raw structures, such as the C# mapping.

I've also amended how return codes are defined. They were formerly defined as:

#define RECLS_RC_NO_MORE_DATA   
             ((recls_rc_t)(-1 - 1003))

but because the recls_rc_t type is defined within the recls namespace, C++ client code that might reference the return code but not have used the type (whether by nice using declarations, or naughty using directives) would not compile. I've fixed this by defining the return code via a macro, as in:

#if !defined(RECLS_NO_NAMESPACE)
# define RECLS_QUAL(x)          ::recls::x
#else
# define RECLS_QUAL(x)          x
#endif /* !RECLS_NO_NAMESPACE */
 ...
#define RECLS_RC_NO_MORE_DATA 
      ((RECLS_QUAL(recls_rc_t))(-1 - 1003))

This works fine with both C and C++ compilation. (Of course, I should probably make it an enum, but it's worth pointing out this technique, which is useful for functions and types shared between C and C++.)

Under the covers, the library has changed in a couple of ways. First, it now returns directories as search items as well as (or instead of) files. In other words, the Recls_Search() function pays attention to the RECLS_F_FILES and RECLS_F_DIRECTORIES flags, where it previously ignored the flags and always enumerated files. A corollary to this is the introduction of a return code RECLS_RC_INVALID_SEARCH_TYPE. This is returned (by the Win32 implementation) if the requested types are nonzero and do not include the files or directories flags. (If the flags are 0, files is always assumed.) Also defined are RECLS_F_LINKS and RECLS_F_DEVICES, which are currently ignored.

The second behavioral change is that the RECLS_F_RECURSIVE and RECLS_F_DIRECTORY_ PARTS flags are acted upon. Hence, if you don't specify the former, you'll only receive matching entries in the search directory. If you don't specify the latter, the directory parts will not be evaluated, resulting in slightly faster searching.

The last change is that Recls_SearchProcess() is now implemented. This function takes a pointer to a function (which takes a pointer to the entry info structure and a user-defined parameter) and conducts an entire (cancelable) search, applying the given function to each matching entry.

int RECLS_CALLCONV_DEFAULT 
EntryFunc(recls_info_t info,
         recls_process_fn_param_t /* param */)
{
  printf("%s\n", info->path.begin);
  return 1; /* Continue search */
}
Recls_SearchProcess("h:\\recls", "*.h", 
             RECLS_F_RECURSIVE, EntryFunc, 0);

This is useful when the operation you wish to apply to the items is simple because it reduces client code considerably, as shown in the new SearchProcess_test test program in the archive.

Note that the API is still ANSI-only; there's no Unicode yet.

C++

The C++ class mapping is straightforward. There are three classes defined within the recls::cpp namespace (aliased to reclspp [1]). The FileSearch class (see the simplified public interface in Listing 1) manages a search handle hrecls_t and provides enumeration via the HasMoreElements(), GetNext(), and GetCurrentEntry() methods. It presents file entry information in the form of the FileEntry class (Listing 2), which wraps a recls_info_t instance. The string_t type is defined in the C++ mapping root header, reclspp.h, to be std::string, unless you stipulate otherwise via the preprocessor. Once the recls library supports Unicode, this will change to a traits-based approach.

I've kept these classes as simple as possible to represent an efficient and convenient alternative to the raw API for C++ client code. There are no exceptions and no runtime error checking (other than assertions via recls_assert.h). Users of the FileSearch class must call GetCurrentEntry() when they know that an entry is available due to calling HasMoreElements(). A search is initiated in the constructor and cannot be restarted. Hence, a FileSearch instance is just that — a (single) file search. The constructor does not throw an error; rather, you must use the usually crummy construct-and-test technique. In this case, it's reasonable because the HasMoreElements() method must be tested before retrieving elements.

The only other notable part of this mapping is that the FileEntry::GetDirectoryParts() method returns an instance of the DirectoryParts class. This is a thin layer over the shared recls_info_t instance, copied (actually, just a reference-count increment) from the source FileEntry instance. The DirectoryParts class provides two methods to give access to the separate parts of the entry's directory:

size_t size() const
string_t operator [](size_t index) const;

STL

The C++ STL mapping consists of several cooperating templates, defined within the recls::stl namespace (aliased to reclstl [1]). One declares instances of basic_search_sequence<char or wchar_t>, then accesses and manipulates the asymmetric range defined by [begin(), end()).

typedef reclstl::basic_search_sequence<char>
                                  sequence_t;
sequence_t search("/usr/include", "*h",
            RECLS_F_FILES | RECLS_F_RECURSIVE);
std::for_each(search.begin(), search.end(), . . . );

Rather than detail here how to implement STL sequences over other kinds of enumeration APIs, I opt for the technology evangelist and TV chef "here's one I prepared earlier" approach. Specifically, my article "Adapting Win32 Enumeration APIs to STL Iterator Concepts" (Windows Developer Network, March 2003; http://www.windevnet.com/documents/win0303a/) covers this subject in detail. It also explains the implementation of the WinSTL basic_findfile_sequence template, which is used in the Win32 implementation of recls.

In anticipation of the Unicode support, I've used a traits class, recls_traits (Listing 3), to select the appropriate methods from the API. For example, the recls_traits<char>::GetNextDetails method is defined as:

template <>
struct recls_traits<char>
{
  static recls_rc_t GetNextDetails
         (hrecls_t hSrch, entry_type *pinfo)
  {
    return Recls_GetNextDetails(hSrch, pinfo);
  }

This makes it easy to update the STL mapping to handle Unicode as well as ANSI because, when the C API changes to have Recls_GetNextDetailsA() and Recls_GetNextDetailsW(), this method will be changed to be implemented in terms of Recls_GetNextDetailsA(). Similarly, recls_traits<wchar_t>::GetNextDetails will be implemented in terms of Recls_GetNextDetailsW(). All the other reclstl classes will not need to be changed. Voilà!

Again, this mapping uses some STLSoft headers. stlsoft_iterator.h defines the template iterator_base with which the container implementation is blessedly insulated from the inconsistencies and incompatibilities to be found in the various Standard (!) Library implementations of the last half decade or so. The other header used is stlsoft_proxy_ sequence.h, which defines the proxy_sequence template. Let's take a closer look at the reclstl classes:

  • basic_search_sequence. The constructor takes a search root, pattern, and flags, then stores them in member variables. It contains four methods: begin(), end(), empty(), and size(). The end() method returns an empty iterator, which marks the end of the iterable range. The begin() method creates an instance of the iterator class, basic_search_sequence_const_iterator, passing a search handle returned from a call to Recls_Search() with the root, pattern, and flags. Thus, an instance of the search sequence can support multiple enumerations, though the contents of each enumeration varies according to changes in the underlying filesystem. The empty() method simply reports whether begin() == end(), potentially incurring a high cost because begin() returns an iterator representing the first matching item or one equivalent to end(). Given an unsuccessful search criterion over an extensive directory structure, this could take time. This variation in time is one of the ways in which STL-like sequences implemented over operating system or technology-specific APIs fail to conform to the STL sequence-container concept (http://www.sgi.com/tech/stl/Container.html).

  • The sequence class also contains the size() method, which works by conducting an enumeration over the entire range because the iterator model supported by the recls C API is Input and not Random-Access. Thus, size() is an expensive operation and it shouldn't really be in the class at all. Furthermore, its results can be misleading because the value returned by two successive calls to size() from within a process that, itself, does not change the contents of the OS can vary. It has probably served a purpose in precipitating this discussion, but it'll likely be deprecated in the next version, and gone in the one after that.

  • basic_search_sequence_const_iterator. The iterator class provides the expected operations of an Input Iterator: preincrement and postincrement, dereference, equality, and inequality tests. Listing 4 shows the implementation of two of these methods. The class owns the search handle passed to its nondefault constructor and releases it during the destructor or when enumeration has reached the end point.

  • Dereferencing is implemented by calling Recls_GetDetails() to acquire the recls_info_t for the current point in the search, which is then returned in an instance of the value type.

  • basic_search_sequence_value_type. The value type provides access to the attributes of a search entry, as well as managing the lifetime of the entry, including handling copy construction/assignment correctly by calling Recls_CopyDetails(). Listing 5 is an abridged listing of its public interface and the implementation of a couple of the methods. Most of the methods directly access the elements of structure for efficiency.

    An interesting feature of the class, representing the second STLSoft dependency, is the get_directory_parts() method, which returns an instance of stlsoft::proxy_sequence<recls::strptrs_t, string_t,...>. Briefly, the template provides an STL-like sequence face over an existing (usually POD) range, and provides translation to a value type (one of the template parameters) from the proxied range iterator's operator *(). In this case, the proxy acts over the directory parts' asymmetric range and returns instances of string_t from operator *(). Hence, it provides a user-friendly interface over a raw range, so that client code need not handle the asymmetric (character) ranges represented by the individual directory parts.

C#

The first thing you need to do to support .NET Interop via C# is to have a DLL. Thus, I've created a DLL project that's included with Version 1.1. It is implemented by linking the static library with a C file containing a DllMain and a .DEF file containing the exports. The DLL was built with Visual C++ and statically linked to the C Runtime Library. As such, it weighs in at 48 KB. In the long run, I'd like to pare this down because there's lots of cruft in there; it doesn't have any static objects or use stdio.

To use the exported symbols from any DLL in C#, in what is known as "Native Interop," you must use the DllImport attribute from the System.Runtime.InteropServices namespace. There's a lot to learn about Interop [2], but you should be able to glean a fair amount by looking through the implementation of the C# mapping. The essential step is to declare — but not define — your imported functions and decorate them with the DllImport attribute, as in:

 [DllImport( "recls_dll", EntryPoint="Recls_Search"
          , CallingConvention=CallingConvention.StdCall
          , CharSet=CharSet.Ansi, ExactSpelling=true)]
private static extern
  int Recls_Search( string searchRoot, string pattern
                  , uint flags, out hrecls_t hSrch);

This declares a function Recls_Search(), taking two strings and a uint, and returns the search handle via an out parameter. It states that the function resides in recls_dll.dll (if not specified, the extension is assumed to be .DLL), is called Recls_Search, uses the __stdcall convention (callee cleans stack), and expects ANSI rather than Unicode character strings. The ExactSpelling attribute requires that the Interop layer use the exact name, rather than apply A or W postfixes, which it is able to do for you.

One option would be to define the recls_fileinfo_t structure within C#, using the StructLayout attribute, which would arguably be more efficient. But I couldn't face all the mess of dealing with the pointer ranges from within C#. (If you want to do that, I'll be happy to post it on the site.) Also, it would be fragile and difficult to change, especially when expanding the C API to Unicode and ANSI versions.

So the entries are treated as if they are opaque and are defined as IntPtr. One irritant is that, although it's possible to use C#'s alias mechanism to weakly typedef hrecls_t and recls_info_t from IntPtr, they are fundamentally the same type and can be mistakenly interchanged, so it's only a help in porting the code across from C (Listing 6; available at http://www.cuj.com/code/). It'd be better if C# provided a strong typedef, but it doesn't.

One smart move [2] when dealing with Interop is to isolate all the external functions in another class, as I've done by defining a recls_api class within the recls namespace. Hence, the FileSearch, FileEntry, DirectoryParts, and ReclsException types are all implemented in terms of recls_api rather than having to mess around with imported functions. As well as insulating them from change, it also means that they can deal with .NET types only; recls_api handles all the translation from the C API types to .NET types; Win32 FILETIME values to .NET's DateTime; Win32 ULARGE_INTEGER values into C#'s ulong (64-bit integer); and C-strings into .NET's String. The strange conversion is when passing character buffers to the C API. This is done by instantiating a StringBuilder instance, ensuring it has sufficient capacity, and passing it to the API as an object reference, as in:

[DllImport("recls_dll", EntryPoint = "Recls_GetDirectoryPartProperty", . . .)]
private static extern
 uint Recls_GetDirectoryPartProperty(recls_info_t fileInfo, int part
                        , StringBuilder buffer, uint cchBuffer);

public static string GetEntryDirectoryPart(recls_info_t entry, int index)
{
  StringBuilder buffer    = new StringBuilder(261);
  uint          capacity  = (uint)buffer.Capacity;
  uint          cch       = Recls_GetDirectoryPartProperty(entry, index
                                                      , buffer, capacity);
  buffer.Length = (int)cch;
  return buffer.ToString();
}

In the first cut, I had the FileEntry instances copy the IntPtr for their entries from the FileSearch class. Thus, some FileEntry instances were holding the structure while others were releasing it back to the recls C API in their finalizers. Embarrassing, certainly, but easy to fix: except that the failure symptom reported a NullReferenceException. Naturally, this makes you think "C# object reference" rather than EXCEPTION_ACCESS_VIOLATION. However, once I'd stuck in more debugging code and hit myself over the head a couple of times, all was right with the world.

There's an interesting design decision in the implementation of FileSearch (Listing 7; available at http://www.cuj.com/code/) that is made enumerable by the provision of a GetEnumerator() method, which returns an object implementing the IEnumerator interface. When creating a FileSearch instance, it is desirable to find out at that time whether the search parameters are going to lead to a valid search. Hence, we wish to start the search in the constructor. However, there are two reasons why this cannot be the case:

  • An enumerable type can be placed in multiple foreach statements. Would you create the initial search in the FileSearch constructor and subsequent enumerations in the enumerator objects? This is unbalanced.

  • The enumerator object should implement Reset(), which means that the enumerator object, rather than the enumerable type, should initialize a search.

The consequence of deferring the search until it's used is that the exception is thrown from within the foreach, rather than from the object's construction — which is unappealing from a common-sense point of view, but necessary. Note that an empty, but otherwise valid, search does not cause an exception to be thrown.

FileSearch instances can be used within a foreach loop, so it's trivial to enumerate through the matching entries:

FileSearch  fs = new FileSearch(searchRoot,
                           pattern, flags);

foreach(FileEntry fe in fs)
{
  // do something with fe
  System.Console.WriteLine(fe.Path);

The C# compiler evaluates whether a foreach enumerator is "disposable," i.e., implements the IDisposable interface. If it does, then the compiler guarantees that the Dispose method will be called no matter how the foreach loop terminates. Since our enumerators contain unmanaged resources (search handles and entry structures), it is a good idea to implement the IDisposable interface, as can be seen in the code.

Documentation

From Version 1.0, I've created documentation (http://recls.org/help/) for the library using Doxygen (http:// doxygen.org/). You can see some of the Doxygen tags in Listing 1. Documentation is hard to write and it's probably not perfect. I'll gladly hear any comments for improvement.

The exception to using Doxygen is the C# mapping, since the C# compiler can generate (via the /doc flag) XML documentation files directly from the source, assuming you've used the correct tags (Listing 7). The resultant files, when installed alongside their assemblies, can provide Intellisense information to the Visual Studio.NET IDDE, which is nice. Also, the free documentation tool NDoc (available at http://ndoc.sourceforge.net/) can be applied to the XML files to produce compiled HTML Help (.CHM) files that also link to all the requisite .NET SDK online documentation. It produces a professional-looking package, so this is what I'm using in the case of C#. (Alas, Visual C++ does not yet perform the same service for Managed C++, so that mapping will be done using Doxygen along with all the others.)

Next Steps

In the next installment, I will address:

  • Unicode and ANSI versions of the API.

  • Renaming of some inappropriately named methods; providing backwards compatibility.

  • UNIX version, initially Linux [3].

  • COM mappings: IEnumXxxx enumerators, and Automation collections (Count, Item, _NewEnum).

  • Maybe the D mapping, given time (and space).

Notes

And References

[1] Wilson, Matthew. "Open-Source Flexibility via Namespace Aliasing," C/C++ Users Journal, July 2003.

[2] Clark, Jason. "Calling Win32 DLLs in C# with P/Invoke," MSDN magazine, July 2003.

[3] If someone wants to grant me a sandbox login (with a compiler, of course) to their architecture of choice — Mac, VMS, whatever — I'll be glad to port it. (I have to admit, I just get a big kick out of writing cross-platform code.)


Matthew Wilson is a software development consultant for Synesis Software, creator of the STLSoft libraries, and author of the upcoming Imperfect C++ (Addison-Wesley, 2004). He can be contacted at [email protected] or http://stlsoft.org/.



Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.