Tuesday, June 30, 2009

Clean Code – Programmers are Authors!

I recently read the book Clean Code, by Robert C, Martin, but I waited to comment on it until i saw how it fundamentally changed how I write code. I see there are quite a few blogs discussing it but i wanted to give my take on it, and specifically, in the short term, as I hope that other aspects will affect me more in the long run as well. (like improving my unit testing!)

The first powerful principle, laid down at the beginning is that “We are authors” and he develops this point two key ways that I thought i would share:

1. We read a lot of code.

Yes, its true. Whether like me, you are now reading others’ code as you have inherited their code, or you need to re-read your own code that you wrote 6 months ago and cannot remember, we are always reading our code. When we write complex code, long functions, etc, even our code takes way too long to remember what it does

2. Our code is our only reliable way to know what the code does. Comments are unreliable. Yes, its true – i see it in our application. I see that someone from our team changed code and left the now irrelevant comment in place. So, what does that mean? That the only effective and reliable way to document code is to write it in a way that it is readable. (yes, there are needs for comments on a class, but on specific lines of code, no)

So, how do we do that? Well, go ahead and read the book! But i will start and say – keep the functions short, keep the names of the functions descriptive so that the name of the function itself makes it clear what you will do in the function (and of course, don’t do too much in the function)

Tuesday, June 16, 2009

Irregular Expressions

A friend sent me the following code that did not work as he expected

   1: public static void main(String[] args) {    
   2:      
   3:     String str = " status=\"The word deleted is in this sentence and no carriage return";    
   4:     String str1 = " status=\"The word deleted is in this sentence and carriage return\n";    
   5:      
   6:     assert str.matches(".*deleted.*");    
   7:     assert str1.matches(".*deleted.*") : "carriage return threw off the regex";    
   8: }
If you run this code you will see that carriage return is not considered “.*” by default. Of course, a little looking around helped us realize that this was documented,

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified

In other words, we need to use the ugly Pattern class in order to make this work, as follows:

   1: public static void main(String[] args) {    
   2:      
   3:         String str = "The word deleted is in this sentence and no carriage return";    
   4:         String str1 = "The word deleted is in this sentence and carriage return\n";    
   5:      
   6:         Pattern p = Pattern.compile(".*deleted.*", Pattern.DOTALL);    
   7:         Matcher m = p.matcher(str);    
   8:         assert m.matches();    
   9:         m = p.matcher(str1);    
  10:         assert m.matches();    
  11:      
  12:     }

The key to making this work is the parameter Pattern.DOTALL used in the compile.

And how about in Groovy? It works of course!

   1: String str = " status=\"The word deleted is in this sentence and no carriage return";    
   2: String str1 = " status=\"The word deleted is in this sentence and carriage return\n";    
   3: assert str =~ /.*deleted.*/    
   4: assert str1 =~ /.*deleted.*/

As written about extensively, one of the really well done features of Groovy is definitely the regular expression syntax but here we see that the implementation was also improved.

Batch loading to the database – Part 2 – SimpleJdbcInsert

As mentioned in my last post, we will now have a a method foe insertion of a delimited file into a database that does not involve our DBA’s.

In Spring 2.5, the SimpleJdbcInsert class was added making this task especially easy, and the added bonus here is we see the elegance of Groovy when combined with this class.

Here is the code.

   1: class DBInsert {   
   2:     
   3: DataSource ds;   
   4: SimpleJdbcInsert jdbcInserter;   
   5: int numRowsPerBatchInsert = 500;   
   6: int numRowsProcessed = 0;   
   7: List fieldNames = ['Column1', 'Column2', 'Column3']   
   8: List<Map<String, Object>> rowsToInsert = [];   
   9: java.sql.Date today;   
  10: final static String SchemaName = "myschema";   
  11:     
  12: public void init() {   
  13:   jdbcInserter = new SimpleJdbcInsert(ds).withSchemaName(SchemaName).withTableName("myTable");   
  14:   today = new java.sql.Date(cal.getTimeInMillis())   
  15: }   
  16:     
  17: /**   
  18:  * We assume that the row is delimited by commas.   
  19:  * This can be refactored to take some other delimiter as second parameter   
  20:  */   
  21: public void insertRow(String s) {   
  22:   if (numRowsProcessed++ > 0) {   
  23:     def rowAsList = s.split(",")   
  24:     
  25:     def iterator = rowAsList.iterator()   
  26:     Map<String, Object> insertMap = ['eff_dte': today]   
  27:     fieldNames.collect {   
  28:       insertMap[it] = iterator.next().trim()   
  29:     }   
  30:     rowsToInsert += insertMap;   
  31:     if (numRowsProcessed % numRowsPerBatchInsert == 0) {   
  32:       completeInsert();   
  33:     }   
  34:   }   
  35: }   
  36:     
  37: public void completeInsert() {   
  38:   jdbcInserter.executeBatch((Map<String, Object>[]) rowsToInsert.toArray());   
  39:   rowsToInsert.clear();   
  40: }   
  41: }
Right off the bat, one nice thing we have in groovy is the removal of the need for getters and setters, automatically making the file smaller. But beside that, lines 26-29 have the concise, yet very readable Groovy closure that takes the column names and creates a map between them and values of the specific line. And this little class gives us what we need with barely any parsing and NO sql.

Monday, June 8, 2009

Batch loading to the database – Part 1 – External Tables

At our company, we have a large amount of data that we load each day to our database. Some of this data can be a few million lines. Since our database is Oracle (11g), there are two very elegant and efficient ways to get this done. The first is via Oracle sqlloader application, which has many options and configurations to optimize the load. However, we have seen that unless you relax indices on the tables you are loading to, the process can still be quite taxing on the database.

Another quite elegant method I recently learned about (new as of 10g), is the use of external tables. The data sits as a file on the disk of the Oracle database but is accessible using normal SQL. You can then manipulate and transform this data using an “insert select” to put the data in the correct table which is especially nice if the data requires some transformation before it is in it final format. The simplest example of this is using a CSV file as you see in this example. However, it gets better. You can even user a pure XML File, which may even have headers and footers that you choose to ignore (as we had), all easily set in the CREATE TABLE DML.

The following example shows how we did this.

   1: CREATE TABLE titan.external_table1
   2:     (field1                        VARCHAR2(20),
   3:     field2                         VARCHAR2(20),
   4:     field3                         VARCHAR2(20),
   5:     field4                         VARCHAR2(1),
   6:     description                    VARCHAR2(2000))
   7:   ORGANIZATION EXTERNAL (
   8:    DEFAULT DIRECTORY  EXT_TAB_DIR
   9:     ACCESS PARAMETERS(records delimited by "</XMLBEGINTAG>"
  10: badfile EXT_TAB_DIR:'EXCEPTION.bad'
  11: logfile EXT_TAB_DIR:'EXCEPTION.log'
  12: fields
  13: (
  14: filler char(2000) terminated by "<XMLBEGINTAG>",
  15: field1 char(20) enclosed by "<FIELD1>" and "</FIELD1>",
  16: field2 char(20) enclosed by "<FIELD2>" and "</FIELD2>",
  17: field3 char(20) enclosed by "<FIELD3>" AND "</FIELD3>",
  18: field4 char(1) enclosed by "<FIELD4>" AND "</FIELD4>",
  19: description char(2000) enclosed by "<DESCRIPTION>" AND "</DESCRIPTION>"
  20: )
  21: )

The syntax you see on Line 7 is what informs Oracle that this is going to be an external table line 8 indicates the name of the oracle directory (see the link above for more about that). Line 14 tells us that everything before the first tag in the file (an XML header) should be considered filler and ignored. And then we define the begin and end tag of each field.

As i started work on another project (not at work, and thus not using Oracle), i saw that the external table feature is even in HSQLDB (the default Grails database). Using this feature, i was able find a simple way to load an encoded CSV file to my database and have all the text appear correctly in my grails application when displaying the data.

The syntax is straightforward and well documented, and here it is for my example:

set TABLE CC_TRANSACTION SOURCE "src/outFile.txt;fs=|;ignore_first=true;encoding=Windows-1255

The line above indicates where the file is, what the delimiter is (|) and the encoding. This table now appears as a regular table in the schema and be used to join or manipulate data for insert elsewhere.

Not all companies are likely to allow access to the Oracle Server filesystem. So, in the next post I hope to discuss an alternative that would work with any database. It also shows the beauty of the groovy syntax together with the ease of use of Spring.

Tuesday, June 2, 2009

FtpClient – the advantage of open source code

Yesterday, it was brought to my attention that one of our apps, whose job is to retrieve files via ftp and process them has not been processing files over the past few days (ok, a bit longer and i should have NOT ignored those emails). A look at the stack trace showed the problem was not in our code (even though we HAD just released new code around the time the problem started, whew!).

After investigation, we found the problem is that our customer seems to have changed their configuration of their ftp server. If you ever manually ftp somewhere, you see a line that usually looks something like this:

230 User anonymous logged in.
Remote system type is UNIX.

Well, unfortunately, when we logged in to this customer, the message was a bit different:

230 User anonymous logged in.
Remote system type is XXXXXXXXXXXX.

As it turns out, this line about the remote system type is very important to our ftp library – Jakarta Commons Net’s FtpClient. A look at the source code based on what the stack trace showed that this string determines the parser to be used and if no matches are found, then it throws a ParserInitializationException. The exact error is:

ParserInitializationException: Unknown parser type:

This seems to be the intent of the designer, even though in my opinion if a regular ftp client can access the site, the FtpClient should be able to as well. A bit of looking around on the net did not help, so we set upon a simple and elegant solution.

We subclassed this class, the DefaultFTPFileEntryParserFactory class and added the following code:

public class MyFTPFileEntryParserFactory extends DefaultFTPFileEntryParserFactory
{
public FTPFileEntryParser createFileEntryParser(String key)
{
if (key.equals("XXXXXXXXXXXX")) {
key = FTPClientConfig.SYST_UNIX;
}
return super.createFileEntryParser(key);
}
}



Now, the only issue left was to tell the application to use our parser instead. There are two ways to do this. One is to use listFiles and pass the parser, or in our case it was easier when creating the FtpClient (in Spring of course), to add the setter for the ParserFactory to use our class.

This is a great example of the convenience of using Spring here. All that happens is that you add your new class to the class path and update the application Context!

I will add that this is not the first time we have been bitten by this library. Last February 29 we were hit by this issue.