The New Halloween Document
How to use the HSSF API
Capabilities
This release of the how-to outlines functionality for the current svn trunk. Those looking for information on previous releases should look in the documentation distributed with that release.
HSSF allows numeric, string, date or formula cell values to be written to or read from an XLS file. Also in this release is row and column sizing, cell styling (bold, italics, borders,etc), and support for both built-in and user defined data formats. Also available is an event-based API for reading XLS files. It differs greatly from the read/write API and is intended for intermediate developers who need a smaller memory footprint.
Different APIs
There are a few different ways to access the HSSF API. These have different characteristics, so you should read up on all to select the best for you.
General Use
User API (HSSF and XSSF)
Writing a new file
The high level API (package: org.apache.poi.ss.usermodel) is what most people should use. Usage is very simple.
Workbooks are created by creating an instance of org.apache.poi.ss.usermodel.Workbook. Either create a concrete class directly (org.apache.poi.hssf.usermodel.HSSFWorkbook or org.apache.poi.xssf.usermodel.XSSFWorkbook), or use the handy factory class org.apache.poi.ss.usermodel.WorkbookFactory.
Sheets are created by calling createSheet() from an existing instance of Workbook, the created sheet is automatically added in sequence to the workbook. Sheets do not in themselves have a sheet name (the tab at the bottom); you set the name associated with a sheet by calling Workbook.setSheetName(sheetindex,"SheetName",encoding). For HSSF, the name may be in 8bit format (HSSFWorkbook.ENCODING_COMPRESSED_UNICODE) or Unicode (HSSFWorkbook.ENCODING_UTF_16). Default encoding for HSSF is 8bit per char. For XSSF, the name is automatically handled as unicode.
Rows are created by calling createRow(rowNumber) from an existing instance of Sheet. Only rows that have cell values should be added to the sheet. To set the row's height, you just call setRowHeight(height) on the row object. The height must be given in twips, or 1/20th of a point. If you prefer, there is also a setRowHeightInPoints method.
Cells are created by calling createCell(column, type) from an existing Row. Only cells that have values should be added to the row. Cells should have their cell type set to either Cell.CELL_TYPE_NUMERIC or Cell.CELL_TYPE_STRING depending on whether they contain a numeric or textual value. Cells must also have a value set. Set the value by calling setCellValue with either a String or double as a parameter. Individual cells do not have a width; you must call setColumnWidth(colindex, width) (use units of 1/256th of a character) on the Sheet object. (You can't do it on an individual basis in the GUI either).
Cells are styled with CellStyle objects which in turn contain a reference to an Font object. These are created via the Workbook object by calling createCellStyle() and createFont(). Once you create the object you must set its parameters (colors, borders, etc). To set a font for an CellStyle call setFont(fontobj).
Once you have generated your workbook, you can write it out by calling write(outputStream) from your instance of Workbook, passing it an OutputStream (for instance, a FileOutputStream or ServletOutputStream). You must close the OutputStream yourself. HSSF does not close it for you.
Here is some example code (excerpted and adapted from org.apache.poi.hssf.dev.HSSF test class):
Reading or modifying an existing file
Reading in a file is equally simple. To read in a file, create a new instance of org.apache.poi.poifs.Filesystem, passing in an open InputStream, such as a FileInputStream for your XLS, to the constructor. Construct a new instance of org.apache.poi.hssf.usermodel.HSSFWorkbook passing the Filesystem instance to the constructor. From there you have access to all of the high level model objects through their assessor methods (workbook.getSheet(sheetNum), sheet.getRow(rownum), etc).
Modifying the file you have read in is simple. You retrieve the object via an assessor method, remove it via a parent object's remove method (sheet.removeRow(hssfrow)) and create objects just as you would if creating a new xls. When you are done modifying cells just call workbook.write(outputstream) just as you did above.
An example of this can be seen in org.apache.poi.hssf.usermodel.examples.HSSFReadWrite.
Event API (HSSF Only)
The event API is newer than the User API. It is intended for intermediate developers who are willing to learn a little bit of the low level API structures. Its relatively simple to use, but requires a basic understanding of the parts of an Excel file (or willingness to learn). The advantage provided is that you can read an XLS with a relatively small memory footprint.
One important thing to note with the basic Event API is that it triggers events only for things actually stored within the file. With the XLS file format, it is quite common for things that have yet to be edited to simply not exist in the file. This means there may well be apparent "gaps" in the record stream, which you either need to work around, or use the Record Aware extension to the Event API.
To use this API you construct an instance of org.apache.poi.hssf.eventmodel.HSSFRequest. Register a class you create that supports the org.apache.poi.hssf.eventmodel.HSSFListener interface using the HSSFRequest.addListener(yourlistener, recordsid). The recordsid should be a static reference number (such as BOFRecord.sid) contained in the classes in org.apache.poi.hssf.record. The trick is you have to know what these records are. Alternatively you can call HSSFRequest.addListenerForAllRecords(mylistener). In order to learn about these records you can either read all of the javadoc in the org.apache.poi.hssf.record package or you can just hack up a copy of org.apache.poi.hssf.dev.EFHSSF and adapt it to your needs. TODO: better documentation on records.
Once you've registered your listeners in the HSSFRequest object you can construct an instance of org.apache.poi.poifs.filesystem.FileSystem (see POIFS howto) and pass it your XLS file inputstream. You can either pass this, along with the request you constructed, to an instance of HSSFEventFactory via the HSSFEventFactory.processWorkbookEvents(request, Filesystem) method, or you can get an instance of DocumentInputStream from Filesystem.createDocumentInputStream("Workbook") and pass it to HSSFEventFactory.processEvents(request, inputStream). Once you make this call, the listeners that you constructed receive calls to their processRecord(Record) methods with each Record they are registered to listen for until the file has been completely read.
A code excerpt from org.apache.poi.hssf.dev.EFHSSF (which is in CVS or the source distribution) is reprinted below with excessive comments:
Record Aware Event API (HSSF Only)
This is an extension to the normal Event API. With this, your listener will be called with extra, dummy records. These dummy records should alert you to records which aren't present in the file (eg cells that have yet to be edited), and allow you to handle these.
There are three dummy records that your HSSFListener will be called with:
- org.apache.poi.hssf.eventusermodel.dummyrecord.MissingRowDummyRecord
This is called during the row record phase (which typically occurs before the cell records), and indicates that the row record for the given row is not present in the file. - org.apache.poi.hssf.eventusermodel.dummyrecord.MissingCellDummyRecord
This is called during the cell record phase. It is called when a cell record is encountered which leaves a gap between it an the previous one. You can get multiple of these, before the real cell record. - org.apache.poi.hssf.eventusermodel.dummyrecord.LastCellOfRowDummyRecord
This is called after the last cell of a given row. It indicates that there are no more cells for the row, and also tells you how many cells you have had. For a row with no cells, this will be the only record you get.
To use the Record Aware Event API, you should create an org.apache.poi.hssf.eventusermodel.MissingRecordAwareHSSFListener, and pass it your HSSFListener. Then, register the MissingRecordAwareHSSFListener to the event model, and start that as normal.
One example use for this API is to write a CSV outputter, which always outputs a minimum number of columns, even where the file doesn't contain some of the rows or cells. It can be found at /src/examples/src/org/apache/poi/examples/hssf/eventusermodel/XLS2CSVmra.java, and may be called on the command line, or from within your own code. The latest version is always available from subversion.
In POI versions before 3.0.3, this code lived in the scratchpad section. If you're using one of these older versions of POI, you will either need to include the scratchpad jar on your classpath, or build from a subversion checkout.
XSSF and SAX (Event API)
If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint.
One important thing to note with the basic Event API is that it triggers events only for things actually stored within the file. With the XLSX file format, it is quite common for things that have yet to be edited to simply not exist in the file. This means there may well be apparent "gaps" in the record stream, which you need to work around.
To use this API you construct an instance of org.apache.poi.xssf.eventmodel.XSSFReader. This will optionally provide a nice interface on the shared strings table, and the styles. It provides methods to get the raw xml data from the rest of the file, which you will then pass to SAX.
This example shows how to get at a single known sheet, or at all sheets in the file. It is based on the example in svn src/examples/src/org/apache/poi/examples/xssf/eventusermodel/FromHowTo.java
For a fuller example, including support for fetching number formatting information and applying it to numeric cells (eg to format dates or percentages), please see the XLSX2CSV example in svn
An example is also provided showing how to combine the user API and the SAX API by doing a streaming parse of larger worksheets and a traditional user-model parse of the rest of a workbook.
SXSSF (Streaming Usermodel API)
SXSSF (package: org.apache.poi.xssf.streaming) is an API-compatible streaming extension of XSSF to be used when very large spreadsheets have to be produced, and heap space is limited. SXSSF achieves its low memory footprint by limiting access to the rows that are within a sliding window, while XSSF gives access to all rows in the document. Older rows that are no longer in the window become inaccessible, as they are written to the disk.
You can specify the window size at workbook construction time via new SXSSFWorkbook(int windowSize) or you can set it per-sheet via SXSSFSheet#setRandomAccessWindowSize(int windowSize)
When a new row is created via createRow() and the total number of unflushed records would exceed the specified window size, then the row with the lowest index value is flushed and cannot be accessed via getRow() anymore.
The default window size is 100 and defined by SXSSFWorkbook.DEFAULT_WINDOW_SIZE.
A windowSize of -1 indicates unlimited access. In this case all records that have not been flushed by a call to flushRows() are available for random access.
Note that SXSSF allocates temporary files that you must always clean up explicitly, by calling the dispose method.
SXSSFWorkbook defaults to using inline strings instead of a shared strings table. This is very efficient, since no document content needs to be kept in memory, but is also known to produce documents that are incompatible with some clients. With shared strings enabled all unique strings in the document has to be kept in memory. Depending on your document content this could use a lot more resources than with shared strings disabled.
Please note that there are still things that still may consume a large amount of memory based on which features you are using, e.g. merged regions, hyperlinks, comments, ... are still only stored in memory and thus may require a lot of memory if used extensively.
Carefully review your memory budget and compatibility needs before deciding whether to enable shared strings or not.
The example below writes a sheet with a window of 100 rows. When the row count reaches 101, the row with rownum=0 is flushed to disk and removed from memory, when rownum reaches 102 then the row with rownum=1 is flushed, etc.
The next example turns off auto-flushing (windowSize=-1) and the code manually controls how portions of data are written to disk
SXSSF flushes sheet data in temporary files (a temp file per sheet) and the size of these temporary files can grow to a very large value. For example, for a 20 MB csv data the size of the temp xml becomes more than a gigabyte. If the size of the temp files is an issue, you can tell SXSSF to use gzip compression:
Low Level APIs
The low level API is not much to look at. It consists of lots of "Records" in the org.apache.poi.hssf.record.* package, and set of helper classes in org.apache.poi.hssf.model.*. The record classes are consistent with the low level binary structures inside a BIFF8 file (which is embedded in a POIFS file system). You probably need the book: "Microsoft Excel 97 Developer's Kit" from Microsoft Press in order to understand how these fit together (out of print but easily obtainable from Amazon's used books). In order to gain a good understanding of how to use the low level APIs should view the source in org.apache.poi.hssf.usermodel.* and the classes in org.apache.poi.hssf.model.*. You should read the documentation for the POIFS libraries as well.
Generating XLS from XML
If you wish to generate an XLS file from some XML, it is possible to write your own XML processing code, then use the User API to write out the document.
The other option is to use Cocoon. In Cocoon, there is the HSSF Serializer, which takes in XML (in the gnumeric format), and outputs an XLS file for you.
HSSF Class/Test Application
The HSSF application is nothing more than a test for the high level API (and indirectly the low level support). The main body of its code is repeated above. To run it:
- download the poi-alpha build and untar it (tar xvzf tarball.tar.gz)
- set up your classpath as follows: export HSSFDIR={wherever you put HSSF's jar files} export LOG4JDIR={wherever you put LOG4J's jar files} export CLASSPATH=$CLASSPATH:$HSSFDIR/hssf.jar:$HSSFDIR/poi-poifs.jar:$HSSFDIR/poi-util.jar:$LOG4JDIR/jog4j.jar
- type: java org.apache.poi.hssf.dev.HSSF ~/myxls.xls write
This should generate a test sheet in your home directory called "myxls.xls".
- Type:
java org.apache.poi.hssf.dev.HSSF ~/input.xls output.xls
This is the read/write/modify test. It reads in the spreadsheet, modifies a cell, and writes it back out. Failing this test is not necessarily a bad thing. If HSSF tries to modify a non-existant sheet then this will most likely fail. No big deal.
HSSF Developer's Tools
HSSF has a number of tools useful for developers to debug/develop stuff using HSSF (and more generally XLS files). We've already discussed the app for testing HSSF read/write/modify capabilities; now we'll talk a bit about BiffViewer. Early on in the development of HSSF, it was decided that knowing what was in a record, what was wrong with it, etc. was virtually impossible with the available tools. So we developed BiffViewer. You can find it at org.apache.poi.hssf.dev.BiffViewer. It performs two basic functions and a derivative.
The first is "biffview". To do this you run it (assumes you have everything setup in your classpath and that you know what you're doing enough to be thinking about this) with an xls file as a parameter. It will give you a listing of all understood records with their data and a list of not-yet-understood records with no data (because it doesn't know how to interpret them). This listing is useful for several things. First, you can look at the values and SEE what is wrong in quasi-English. Second, you can send the output to a file and compare it.
The second function is "big freakin dump", just pass a file and a second argument matching "bfd" exactly. This will just make a big hexdump of the file.
Lastly, there is "mixed" mode which does the same as regular biffview, only it includes hex dumps of certain records intertwined. To use that just pass a file with a second argument matching "on" exactly.
In the next release cycle we'll also have something called a FormulaViewer. The class is already there, but its not very useful yet. When it does something, we'll document it.
What's Next?
Further effort on HSSF is going to focus on the following major areas:
- Performance: POI currently uses a lot of memory for large sheets.
- Charts: This is a hard problem, with very little documentation.
by Andrew C. Oliver, Glen Stampoultzis, Nick Burch, Sergei Kozello