FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
August 09, 2001
Perl and :CueCat Help You Roll Your Own Data Capture

Brent Michalski
You can write your own software to use the :CueCat without having your activities tracked. I've written a Perl program that scans the ISBN barcode from a book, then looks up the book information online and stores the data in a MySQL database.
Perl

The :CueCat is a barcode scanner, available for free from Radio Shack, that hooks between your keyboard and your computer. It was originally designed to scan special barcodes in magazines: you see an ad that interests you, you scan the barcode, and a related web site with more information pops up in your browser. You can also scan items, such as books or sodas, or really anything with a UPC symbol on it, and be taken to a web site containing more information.

While all of this scanning and web surfing is going on, your every move is tracked and stored in a database so that Digital Convergence can target better marketing at you. This part makes me uneasy; however, user tracking only happens when you use the software provided with the :CueCat scanner. You can write your own software to use the :CueCat without having your activities tracked. I've written a Perl program that scans the ISBN barcode from a book, then looks up the book information online and stores the data in a MySQL database.

Perl has often been called the "duct tape of the Internet," and

rightfully so. It it is a great language for a task like this that requires

accessing data on the Internet and parsing it for a new use. Also, since Perl

has been around for quite some time, there are hundreds of modules. Perl's main

site for modules is http://www.cpan.org

(or http://search.cpan.org for a friendlier

interface).

Putting Perl Modules to Work

We'll be using the DBI, LWP, LWP::UserAgent, HTML::Parse, HTML::FormatText,

and Business::ISBN modules. All of the modules are available from

http://search.cpan.org .

The DBI module is what allows us to connect to a database. DBI stands for DataBase Independent because this module is used for all database connectivity. Along with the DBI module, you need to download the appropriate DataBase Dependent (DBD) driver for the database that you intend to connect to. This is the part that is database specific.

If you don't have MySQL, any database with a DBD driver should work. The Perl

DBI home page is http://www.symbolstone.org/technology/perl/DBI

: there you will find installation instructions, DBD drivers for many different

databases and tutorials.

LWP is the Library for WWW access in Perl. It provides an easy interface to do just about anything web related, such as automatically fetching web pages or posting data to a form somewhere on the web.

LWP::UserAgent is a module that provides a simple WWW user agent (it acts like a browser). It allows you to very simply perform automated web tasks. It is different from the LWP module in that it is a subclass of LWP and provides a much narrower range of functionality -- but makes this functionality easier than if you had used LWP alone.

HTML::Parse is a deprecated module, but I still find it very useful. It allows you to easily parse an HTML document. When we do the ISBN search, we are returned an HTML page as the result. HTML::Parse allows us to easily remove the tags and get to the information on the page that we are interested in.

HTML::FormatText allows us to transform the HTML page into standard text, sort of like what you would get if you used a text-based browser such as Lynx. This, combined with the HTML::Parse module make our work much easier.

Business::ISBN provides a few functions that are used to deal with ISBN numbers. It allows us to check for valid ISBN numbers, check for country-codes from ISBN numbers, check publisher codes, ensure hyphens are placed correctly, fix the ISBN checksum and more.

Hacking the :CueCat

The output of the :CueCat is encrypted, but there are several software solutions to the encryption and also some hardware "hacks" to remove the encryption altogether. Hardware hacks include cutting a wire trace on the :CueCat main board to disable the serial number (which is sent with each scan), to adding wires to disable the encryption altogether. A quick search online will get you a whole bunch of web sites that show you exactly how to "declaw" a :CueCat. The program we'll be covering in this article will work with either a regular :CueCat or a "hacked" :CueCat.

In planning this program, I decided not to have a GUI interface for the data entry. Instead I've used a command-line interface for two reasons: First, you can scan a large number of books much faster without a GUI in your way. Second, the :CueCat was sending some extra characters that caused some unexpected results on web browsers. And third, Laziness. Ok, three reasons.

The first section of code tells the program where to find Perl and loads in

all of the modules that are needed for this application. We load the strict

pragma. Using strict is a very good idea

when you write programs. It forces you to predeclare variables and turns on

more error checking so that if you have any bugs in your code, they will be

easier to track down. The variables declared on Line 9 are outside of any subroutines

or code blocks, so the scope of these variables exists throughout the program.


1: #!/usr/bin/perl;

2: use DBI;
3: use LWP;
4: use LWP::UserAgent;
5: use HTML::Parse;
6: use HTML::FormatText;
7: use Business::ISBN qw(is_valid_checksum);
8: use strict;

9: my ($title, $price, $authors, $format, $publisher, $pub_date, $isbn);

The next section connects to the database and creates the SQL queries that we will use in our program. We connect to the MySQL database that we'll be using to store the book information. "DBI:mysql:books" is the data source; it consists of the module (DBI), the database driver (mysql), and the database name (books). The driver would need to be changed to whichever database you are using if it is not MySQL. Also, case is important here! The string "MySQL" will NOT work in the data source -- it must be lowercase. The second argument, "books," is the username that we are connecting as and the third argument, "cuecat," is the password.

 



10: my $dbh = DBI->connect("DBI:mysql:books","books","cuecat") 
11:    or die "Error! $DBI::ERRSTR\n\n";

This is followed by two separate SQL statements. The first one does a simple select, which looks for an ISBN number. This one will be used to check if the book that was just scanned is already in the database. The second SQL statement is what we use to insert a new record into the database.




12: my $SQL1 = "SELECT isbn FROM library where isbn = ?";
13: my $SQL2 = qq{INSERT INTO library (isbn,title,author,price,format,publisher,
14:                                    pubdate,notes) VALUES (?,?,?,?,?,?,?,?)};

The question marks in the SQL statements are called placeholders. Placeholders are just like variables -- but for SQL statements. When we execute these SQL statements, we must pass the same number of values as the statement has placeholders. If we had two placeholders, but only passed one variable when we executed the SQL statement, an error is generated.

We then create two references that hold the prepared SQL statements. When an SQL statement is "prepared" using the DBI module, it is stored in a precompiled format. This allows for faster execution when it is called. It also allows you to name an SQL statement with something that is easier to remember and work with. For example, the following SQL statements get stored in the $check and $insert variables. The $check statement checks to see if a book currently exists in the database, and the $insert statement is used to insert a new book into the database:




15: my $check  = $dbh->prepare($SQL1);
16: my $insert = $dbh->prepare($SQL2);

The next section is the main loop of the program. Line 17 creates an infinite loop, because we want to keep being prompted for new books until we want to quit the program. This allows us to add books to the database much faster. This infinite loop can be broken out of if the user presses the "q" key.


17: while(1){

18:     print "\nISBN: ";
19:     my @book = ();

20:     my $isbn = <STDIN>;
21:     chomp($isbn);

22:     last if($isbn eq  "q"  );
The user sees a simple prompt to enter an ISBN number. Granted, this is not a very "user-friendly" way to ask for input, but remember that this is a simple command-line program. We create a my variable called @book as an array and clear any contents that it might contain. A my variable has a scope of the block that the variable is in. This means that the array @book is not available outside of the while loop, which is how we want it.

Next, we manipulate the ISBN number and check the database to see if it already exists. An if statement checks to see if there is a period (.) in the $isbn string. If there is a period in the string, we call the Cue_Decrypt subroutine on the $isbn variable and store the results back into the $isbn variable. On a normal :CueCat, the data returned from a scan will have at least one period because of how the output is encrypted. On a "hacked" :CueCat, the output will be in standard, plain-text mode and not need to be decrypted. The code on line 23 ensures that we only decrypt data that needs to be decrypted. All standard ISBN codes begin with 978 when scanned. We do not need the 978, so we remove it. Then we use Perl's substr function to take the first 10 characters from the variable -- ISBN numbers are 10 characters long.

23:     $isbn = Cue_Decrypt($isbn) if($isbn =~  /\./);
24:     $isbn =~ s/^978// ;
25:     $isbn = substr($isbn,0,10);
Most of the Cue_Decrypt subroutine was found on the Internet. I've heard that it's code from Larry Wall, but I have not been able to confirm it. Some of it is quite complex, even though it is only a few lines long. Because some of it is so complex, a full description is beyond the scope of this tutorial.

A new Business::ISBN object is created using the $isbn variable, and we store the reference to this new object in $cs . Then the fix_checksum function is called from the Business::ISBN module. This calculates what the last number of the ISBN should be, and makes any needed corrections to it.


26:     my $cs   = new Business::ISBN($isbn);
27:     $cs->fix_checksum();

We use the as_string function from the Business::ISBN module to store the new, unencrypted ISBN number into the $isbn variable, then remove any dashes from the ISBN number. This step is not needed and was added simply by choice to simplify searching of ISBN numbers. If you create a search form for your database, not having to worry about matching with the dashes, and checking whether or not the user entered them will make things much easier.


28:     $isbn = $cs->as_string();
29:     $isbn =~ s/\-// g;
30:     next if(Check_DB($isbn));  # If in the database, don't add
Line 30 can be a real timesaver. It performs a check of the database for the book that you scanned. If the book is already in the database, it breaks out of the current loop and prompts the user again to enter an ISBN number. It does this check before going to any site -- so it saves quite a bit of time if the record already happens to be in the database.

Line 31 is the URL where we search for the ISBN information. The reason this search engine was picked is because it returns the data in a format that is fairly easy to parse. If anyone knows of an open ISBN search engine/site that returns the data in XML, please let me know. XML would be much easier to parse, and would make the search engine much more open for other applications as well.

31:     my $url = "http://shop.bn.com/bookSearch/isbnInquiry.asp?isbn=$isbn" ;
32:    # Create a user agent object
33:     my $ua = new LWP::UserAgent;
34:     $ua->agent("CueCat/0.1 "   . $ua->agent);
35:    # Create a request
36:     my $req = new HTTP::Request GET => $url;
37:     $req->content_type('application/x-www-form-urlencoded' );
38:    # Pass request to the user agent and get a response back
39:     my $res = $ua->request($req);

40: my $content = $res->content;
41:     my @content = split("\n" , $content);
After we create a new UserAgent object and store the reference to the object in $ua , we call the agent function from LWP::UserAgent and set it to "CueCat/0.1". This will then be the type of agent that gets logged in the web server log files. This allows web server reporting software to track the different web browsers and their versions. You can have a bit of fun here and put something like "Mozilla/12.3" to masquerade as a Netscape 12.3 browser in the server log files.

Line 36 creates a new HTTP::Request object and tells it to GET the resulting page of the URL stored in $url . A reference to this request is stored in $req . Line 40 gets the content from our request and stores it in the $content variable. This should be the full HTML source of the web page that was returned, which is then stored line-by-line in the @content array to allow easier parsing.

Next we strip out unwanted information from the array, including any HTML markup elements. Each line gets pushed onto the @book array if it is not empty, and does not include an IMAGE tag or "TABLE NOT SHOWN."


42:     my $start = 0;
43:     foreach (@content){
44:         if( /<!-- content cell -->/){ $start++; }
45:         if($start > 0){
46:             s/\ / /g;
47:                my $foo =
                HTML::FormatText->new->format(parse_html($_));
48:             if($foo !~ /^\s+$/){
49:                 if($foo !~ /IMAGE/){
50:                     if($foo !~ /TABLE NOT SHOWN/ ){
51:                         push(@book, $foo);
52:                     }
53:                 }
54:             }
55:             if(/Pub\. Date/) { $start = 0; }
56:         }
57: }

Here is an example of actual data that was pushed onto the @book array.


User Friendly: The Comic Strip
J.D. Illiad Frazer Preface by Eric S. Raymond
Our Price: $11.65
Retail Price: $12.95
You Save: $1.30
(10%)
In Stock: 24 hours (Same Day)
Need this delivered by Christmas? No problem. Order today! (Manhattan
orders placed by 11am will be delivered on the Same Day!)
Format: Paperback, 1st ed., 128pp.
ISBN: 1565926730
Publisher: O'Reilly & Associates, Incorporated
Pub. Date: October 1999

As you can see, we removed most of the text that we started with, and are left with just a few, easily parseable lines. If the line contains "Pub. Date" we are done parsing the data, so the if loop from line 45 is exited, and there is no time wasted parsing extra data.

We can now set the variable $title and $authors from the first two elements in the @book array.

These values should always work for us. How do we know this? Trial and error. If you are getting data from a site and want to parse it somehow, the only way to make your code work is to actually get data from the site, study the data that was returned, and create some code that parses the returned HTML document into a format that you want.

Line 62 begins a foreach loop that traverses the @book array. Each if block in this loop then checks for certain keyword values that should be in the text that we are parsing.

For each keyword we look for in these blocks, if the value is found, the block is entered and the data we are looking for is extracted.

In each block, you can see a regular expression that is used to get rid of extra data that we don't want. It looks like this: s/^\W*Publisher:// the ^\W* means start at the beginning of the line (^), look for a non-word character (\W) and match it zero or more times (*). Then, look for the text Publisher: and replace all of this with nothing (//). What this effectively does is remove the labels preceding the values that we are looking to extract.

 



62:     foreach(@book){
63:         if(/Retail Price:/){
64:             $price = $_;
65:             $price =~ s/^\W*Retail Price://;
66:             chomp($price);
67:         }
68:         if(/Format:/){
69:              $format = $_;
70:              $format =~ s/^\W*Format://;
71:              chomp($format);
72:          }
73:          if(/Publisher:/){
74:              $publisher = $_;
75:              $publisher =~ s/^\W*Publisher://;
76:              chomp($publisher);
77:          }
78:          if(/Pub\. Date:/){
79:              $pub_date = $_;
80:              $pub_date =~ s/^\W*Pub\. Date://;
81:              chomp($pub_date);
82:          }
83:      }

Line 84 begins a here document that will output some information to the user. Then lines 85 - 91 print out the book information so that the user can see what is actually getting put into the database and line 92 ends the here document.

84: print<<BOOK;
85: Title:        $title
86: Author(s):    $authors
87: Price:        $price
88: Format:       $format
89: Publisher:    $publisher
90: Publish Date: $pub_date
91: ISBN:         $isbn
92: BOOK

Lines 93 - 94 are a single Perl statement spread across two lines for readability. These lines pass the book data and execute the SQL statement to add the information to the database. Line 95 dings the computer's dinger (causes the computer to beep once).

 93:     $insert->execute($isbn,$title,$authors,$price,$format,$publisher,
 94:                      $pub_date,''  ); 
 95:     print chr(7);  # Dinggggg... 
 96: }
 97: sub Check_DB{
 98:     $check->execute(shift);
 99:     $check->fetch ? return 1 : return 0;
100: }

Lines 97 - 100 are the Check_DB subroutine. This subroutine checks the database to see if the book that was just scanned is already in the database. Line 98 calls the execute function and passes it "shift." This returns the first item passed to this subroutine when it was called. So, if you were to say:


$foo = Check_DB("123abc");

The execute that gets called on line 98 shifts the value "123abc" and that is what gets sent to the execute function.

Line 99 uses the trinary operator. On the left side, we called the fetch function. If any data was returned, this will return a non-zero value and therefore the first item after the ? is executed (return 1). If no data was found, then 0 would be returned from the fetch operation and we would execute the statement after the colon (return 0).

This is a simple way to check if the value was in the database. 0 means nothing was found, 1 means something matched. The trinary operator is very useful and can often save you a lot of if..else statements in your code.

The Cue_Decrypt Subroutine

Lines 101 - 110 make up the Cue_Decrypt subroutine. Line 102 shifts the first value passed to the subroutine and stores it in $ _. Line 103 uses the map function and stores the results in the @data array. The map function allows you to perform a set of operations on each item in an array, and it returns an array which is the results of the operations performed on the data elements.

In our example, it may not be clear what list we are passing in to the map function. If you look at line 108 you will see that we are matching with a regular expression. This regular expression is where the data comes from that we pass into the map function. We are going to use the string ENr7CNz1DNPXDxrWEW as our input and we'll walk through each step. This value is the third part of an ISBN code scanned from a book with a :CueCat. Line 104 performs a translation on the data. The first portion of the tr/// operation ( a-zA-Z0-9+- ) expands out to this:


abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+-

The second portion ( -_ ) expands to:


 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

So, when the translation is done, the letter "a" becomes a space, "b" becomes a "!", and so on. With our input value of ENr7CNz1DNPXDxrWEW, we now have>G1[<G9U=GIQ=71P>P

Line 105 unpacks the data, which is uuencoded. This is one of those lines that has a lot going on, and won't be explained in detail. For a detailed explanation of unpack, type perldoc -f unpack from a shell and if Perl is installed properly, you will get a documentation page. Once we execute the code on line 105, we end up with this:

 
zt{rvuvzqutp{

Line 106 is used to remove any null characters from the end of the string, if there are any. Line 107 is another "magic" line. This line does a bitwise XOR on the string. The result after we do this is:

 

9781565926738

Which is exactly what we were looking for! Line 108 is how the list gets fed into the map function. This line uses the matching operator //, to split the scan at each period. The encrypted scans will ALWAYS have three sets of numbers, separated by periods. The first is the :CueCat serial number (so they can track you), the second set is the type of barcode that was read, and the third set is the actual decoded barcode. So it just chunks the scanned input into 3 sections.

Line 109 returns the third element of the array. Remember, the third set of numbers is the decoded barcode, and that is what we were looking for.

Ready to Scan

Now we have finished the program and are ready to scan all of our books into a database! We have used Perl to manipulate a database and to get web pages from the Internet. Both of these things are very powerful and should prove to be extremely useful for many other applications that you may find yourself writing in the future. Next time you have to write a quick program, especially one that deals with text manipulation, use Perl.

Code Listing:

(You can also download this code as a zipped text file.)

#!/usr/bin/perl


use DBI;
use LWP;
use LWP::UserAgent;
use HTML::Parse;
use HTML::FormatText;
use Business::ISBN qw(is_valid_checksum);
use strict;


my ($title, $price, $authors, $format, $publisher, $pub_date, $isbn);


my $dbh = DBI->connect("DBI:mysql:books","root","c600go")
   or die "Error! $DBI::ERRSTR\n\n";


my $SQL1 = "SELECT isbn FROM library where isbn = ?";


my $SQL2 = qq{INSERT INTO library (isbn,title,author,price,format,publisher,
                                   pubdate,notes) VALUES (?,?,?,?,?,?,?,?)};


my $check  = $dbh->prepare($SQL1);
my $insert = $dbh->prepare($SQL2);


while(1){
    print "\nISBN: ";
    my @book = ();


    my $isbn = <STDIN>;
    chomp($isbn);


    last if($isbn eq "q");


    $isbn = Cue_Decrypt($isbn) if($isbn =~ /\./);


    $isbn =~ s/^978//;
    $isbn = substr($isbn,0,10);


    my $cs   = new Business::ISBN($isbn);


    $cs->fix_checksum();


    $isbn = $cs->as_string();
    $isbn =~ s/\-//g;


    next if(Check_DB($isbn));  # If in the database, don't add it again


    my $url = "http://shop.bn.com/bookSearch/isbnInquiry.asp?isbn=$isbn";


   # Create a user agent object
    my $ua = new LWP::UserAgent;
    $ua->agent("CueCat/0.1 " . $ua->agent);


   # Create a request
    my $req = new HTTP::Request GET => $url;
    $req->content_type('application/x-www-form-urlencoded');


   # Pass request to the user agent and get a response back
    my $res = $ua->request($req);


    my $content = $res->content;
    my @content = split("\n", $content);


    my $start = 0;
    foreach (@content){
        if(/<!-- content cell -->/){ $start++; }


 
        if($start > 0){
            s/\ / /g;
            my $foo = HTML::FormatText->new->format(parse_html($_));
            if($foo !~ /^\s+$/){
                if($foo !~ /IMAGE/){
                    if($foo !~ /TABLE NOT SHOWN/){
 print "Foo: $foo\n";


                        push(@book, $foo);
                    }
                }
            }
            if(/Pub\. Date/) { $start = 0; }
        }
    }


    my $title   = $book[0];
    my $authors = $book[1];


    chomp($title);
    chomp($authors);


    foreach(@book){


        if(/Retail Price:/){
            $price = $_;
            $price =~ s/^\W*Retail Price://;
            chomp($price);
        }
        if(/Format:/){
            $format = $_;
            $format =~ s/^\W*Format://;
            chomp($format);
        }
        if(/Publisher:/){
            $publisher = $_;
            $publisher =~ s/^\W*Publisher://;
            chomp($publisher);
        }
        if(/Pub\. Date:/){
            $pub_date = $_;
            $pub_date =~ s/^\W*Pub\. Date://;
            chomp($pub_date);
        }
    }


print<<BOOK;
Title:        $title
Author(s):    $authors
Price:        $price
Format:       $format
Publisher:    $publisher
Publish Date: $pub_date
ISBN:         $isbn


BOOK


    $insert->execute($isbn,$title,$authors,$price,$format,$publisher,
                     $pub_date,'');


    print chr(7);  # Dinggggg...
}


 
sub Check_DB{
    $check->execute(shift);
    $check->fetch ? return 1 : return 0;
}


 
sub Cue_Decrypt{
    $_ = shift;


    my @data = map {
        tr/a-zA-Z0-9+-/ -_/;
        $_ = unpack 'u', chr(32 + length()*3/4) . $_;
        s/\0+$//;
        $_ ^= "C" x length;
    } /\.([^.]+)/g;


    return($data[2]);
}


Brent is the moderator of the Perl channel for Dr. Dobb's Online. He can be reached at brent@perlguy.net.


TOP 5 ARTICLES
No Top Articles.
DR. DOBB'S CAREER CENTER
Ready to take that job and shove it? open | close
Search jobs on Dr. Dobb's TechCareers
Function:

Keyword(s):

State:  
  • Post Your Resume
  • Employers Area
  • News & Features
  • Blogs & Forums
  • Career Resources

    Browse By:
    Location | Employer | City
  • Most Recent Posts:
    MEDIA CENTER  more
    NetSeminar
    Modernize your Development by Moving Build and Code Quality Upstream
    Moderated by Jon Erickson, Editor-in-Chief of Dr. Dobb's, this interactive panel discussion brings industry experts Anders Wallgren, CTO of Electric Cloud and Gwyn Fisher, CTO of Klocwork together for a candid discussion of the cost savings, productivity and quality benefits that can be achieved by stabilizing builds and code quality as early in the development cycle as possible.

    The reality of today's development environment - geographically distributed teams, the use of Agile development practices, increasing application complexity, etc. - is straining the viability of the traditional coding, build and release process. To stay ahead of the curve, development teams are modernizing their approach to dealing with these issues, and as a result are achieving new levels of development productivity. Register for the webcast.
    Date: Wednesday, July 15, 2009
    Time: 11 am PT/2 pm ET
    Modernize your Development by Moving Build and Code Quality Upstream
    Moderated by Jon Erickson, Editor-in-Chief of Dr. Dobb's, this interactive panel discussion brings industry experts Anders Wallgren, CTO of Electric Cloud and Gwyn Fisher, CTO of Klocwork together for a candid discussion of the cost savings, productivity and quality benefits that can be achieved by stabilizing builds and code quality as early in the development cycle as possible.

    The reality of today's development environment - geographically distributed teams, the use of Agile development practices, increasing application complexity, etc. - is straining the viability of the traditional coding, build and release process. To stay ahead of the curve, development teams are modernizing their approach to dealing with these issues, and as a result are achieving new levels of development productivity. Register for the webcast.
    Date: Wednesday, July 15, 2009
    Time: 11 am PT/2 pm ET
                                   
    INFO-LINK

    Resource Links: