Monday, June 8, 2009

Batch loading to the database – Part 1 – External Tables

At our company, we have a large amount of data that we load each day to our database. Some of this data can be a few million lines. Since our database is Oracle (11g), there are two very elegant and efficient ways to get this done. The first is via Oracle sqlloader application, which has many options and configurations to optimize the load. However, we have seen that unless you relax indices on the tables you are loading to, the process can still be quite taxing on the database.

Another quite elegant method I recently learned about (new as of 10g), is the use of external tables. The data sits as a file on the disk of the Oracle database but is accessible using normal SQL. You can then manipulate and transform this data using an “insert select” to put the data in the correct table which is especially nice if the data requires some transformation before it is in it final format. The simplest example of this is using a CSV file as you see in this example. However, it gets better. You can even user a pure XML File, which may even have headers and footers that you choose to ignore (as we had), all easily set in the CREATE TABLE DML.

The following example shows how we did this.

   1: CREATE TABLE titan.external_table1
   2:     (field1                        VARCHAR2(20),
   3:     field2                         VARCHAR2(20),
   4:     field3                         VARCHAR2(20),
   5:     field4                         VARCHAR2(1),
   6:     description                    VARCHAR2(2000))
   7:   ORGANIZATION EXTERNAL (
   8:    DEFAULT DIRECTORY  EXT_TAB_DIR
   9:     ACCESS PARAMETERS(records delimited by "</XMLBEGINTAG>"
  10: badfile EXT_TAB_DIR:'EXCEPTION.bad'
  11: logfile EXT_TAB_DIR:'EXCEPTION.log'
  12: fields
  13: (
  14: filler char(2000) terminated by "<XMLBEGINTAG>",
  15: field1 char(20) enclosed by "<FIELD1>" and "</FIELD1>",
  16: field2 char(20) enclosed by "<FIELD2>" and "</FIELD2>",
  17: field3 char(20) enclosed by "<FIELD3>" AND "</FIELD3>",
  18: field4 char(1) enclosed by "<FIELD4>" AND "</FIELD4>",
  19: description char(2000) enclosed by "<DESCRIPTION>" AND "</DESCRIPTION>"
  20: )
  21: )

The syntax you see on Line 7 is what informs Oracle that this is going to be an external table line 8 indicates the name of the oracle directory (see the link above for more about that). Line 14 tells us that everything before the first tag in the file (an XML header) should be considered filler and ignored. And then we define the begin and end tag of each field.

As i started work on another project (not at work, and thus not using Oracle), i saw that the external table feature is even in HSQLDB (the default Grails database). Using this feature, i was able find a simple way to load an encoded CSV file to my database and have all the text appear correctly in my grails application when displaying the data.

The syntax is straightforward and well documented, and here it is for my example:

set TABLE CC_TRANSACTION SOURCE "src/outFile.txt;fs=|;ignore_first=true;encoding=Windows-1255

The line above indicates where the file is, what the delimiter is (|) and the encoding. This table now appears as a regular table in the schema and be used to join or manipulate data for insert elsewhere.

Not all companies are likely to allow access to the Oracle Server filesystem. So, in the next post I hope to discuss an alternative that would work with any database. It also shows the beauty of the groovy syntax together with the ease of use of Spring.

No comments:

Post a Comment