Entries in Programming Techniques (1)
What I like about Office 2007
I've been using Office 2007 for a while. Honestly, my initial reaction was:
1) things seemed harder to find - for example, how to you get to the Visual Basic macro editor in Excel 2007?
2) why would you use the new formats - docx, xlsx, and pptx? I figured for years you would have to save them back to 97-2003 format if you wanted anybody else to read them.
I finally found out why #2 is an immediate benefit to me. Not to me as an end user, but as a software vendor. The sooner the 97-2003 format is gone, the better. (Yes, I'm sure the answer to that is "never", but one can hope).
Helpstream fully indexes binary documents that are in the knowledge base, are attached to cases or case history, or are attached to community discussion. Given our 100% Java/Linux architecture, the proprietary Office 97-2003 format documents present an annoying technical challenge. In past products, I've done as Joel Spolsky suggests here (see "Let Office do the heavy work for you"). I've written C++/ATL code to automate the Office applications, extracting the text to feed into Lucene. This requires either running your app on Windows and using JNI from your Java app, or throwing a Windows box in the data center and making an RPC/Web Service network request. I went the JNI route in the past, and it worked well.
For Helpstream however, I wanted to keep a 100% Linux architecture, and still generate and index Office documents. Generation is pretty easy - for Word you can use open source packages to generate RTF. Extracting the text form the proprietary format is still a challenge. For Office 97-2003 docs, after some Google searching, I found an obscure 100% pure Java library from a commercial vendor named Davisor which allowed me to convert the documents to XML, and therefore extract the text.
With Office 2007 files, there is a new opportunity. Indexing those documents in Java becomes trivial! The .docx, .pptx, .and xlsx files are zip files that contain a bunch of files, among them some xml files. The following code will give you good input to Lucene for any Office 2007 document from pure Java:
public void indexOffice2007Document(InputStream inputStream, Writer writer) throws Exception {
ZipInputStream zis = new ZipInputStream(inputStream);
ZipEntry zi = null;
while ((zi = zis.getNextEntry()) != null) {
String file = zi.getName();
if (file != null && file.toLowerCase().endsWith(".xml")) {
File tempFile = File.createTempFile("tmp", ".xml");
try {
byte [] chunk = new byte[8096];
int bytesRead = 0;
FileOutputStream fs = new FileOutputStream(tempFile);
while ((bytesRead = zis.read(chunk)) != -1) {
fs.write(chunk, 0, bytesRead);
}
fs.close();
// index it
try {
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(tempFile);
NodeIterator iter = XPathAPI.selectNodeIterator(doc.getDocumentElement(), "//text()");
Node node = null;
while ((node = iter.nextNode()) != null) {
String val = ((Text)node).getTextContent();
if (val != null && val.length() > 0) {
writer.write(val);
writer.write(" ");
}
}
} catch (Throwable t) {
// ignore XML parse errors -- log if you want to
}
} finally {
tempFile.delete();
}
}
}
}
Now, things are easy! No 3rd party licensing issues or support/upgrades to deal with, and no need to deploy a Windows box at all. And no need to dig up your C++/ATL skills that you had hoped you were done with for good when you last used them years ago. And I'm certain it scales better than automating Office COM object (which are probably running out-of-process).
