Anatomy of an Output Class
What should be considered when one in given the task of generating PDF reports? Daryl will explain some aspects that come with this daunting task. Upon changing employers recently, I found that I had inherited a rat’s nest of code used to generate project reports in PDF form on the fly. For each of our clients, there were numerous reports, all broken down into individual files with a staggering amount of code duplication and general redundancy. And I had the dubious honor of combing through these files in order to add columns to some tabular reports.
Generating PDFs is nothing like generating HTML. Everything you print to the page is pegged to coordinates on a grid rather than floating freely on the page within vague dimensional constraints. And so adding a column to an existing tabular report in PDF can be a nightmare.
After spending a frustrating morning crossing my eyes at thousands of lines of code while counting columns and calculating character widths and changing coordinates until the adjustments were finally correct, I decided to write a class that would handle all of this drudgery for me. What’s more, I decided that the class should be useful for more than PDF output, and I built in support for HTML, XLS, and XML output as well.
The resulting code itself, while sufficient to handle the bulk of my needs, isn’t quite ready for a public consumption yet, but some of the choices (and mistakes) I made while developing it might prove useful to others working on similar tasks. Accordingly, I offer here an account of how I brought an output class to life.
Anatomy of an Output Class – PDF Basics
Since PDF output was what prompted me to put this class together, let’s start with some PDF basics. First off, while I’ve seen utilities that’ll generate PDFs from scratch by writing raw PDF output, I had PDFLIB at my disposal and made good use of PHP’s functions that interact with the library. Among these are functions that allow you to open and close a PDF, write text, set the font specifications, calculate string widths, move the cursor around, draw lines, insert page breaks, and add hyperlinks. Again, I’m not going to go into code specifics here; for that, I recommend php.net’s handy dandy function reference.
To create a PDF, you have to give it a page size and then start moving the cursor around and writing and drawing. You open a page, add what text and drawings you wish to add, close the page, and either move on to the next page or close the PDF. Since I pull my data from databases, my initial impulse was to pass my class an array of database row arrays. So while I was looping through my result set, I was performing any data validation and crunching and then pushing the current row onto an array, which I then passed to the output class. Our reports tend to have column headers on each page, and these tend to be hyperlinks that allow you to sort the results by column. It was necessary, then, to detach the column headers from the rest of the data so that they could be added to each page.
“Simple,” I thought. “I’ll just allow the user to set a flag determining whether or not column headers are being printed and, if so, print the first row of the data array on each page.”
Of course, extra code to force the class to print the links was also necessary. And I wrote the whole class with these conventions in mind. Then I came across a report that had hyperlinks in places besides the column header rows. The whole first column might contain links, for example, or the 19th column. Further, I discovered that some of our reports had bar graphs on the back page that were generated by taking counts of certain fields within the data set. Approaching the data as a set of rows was going to be cumbersome at best given the way PDFs handle the insertion of hyperlinks, and I hadn’t even given consideration to graphs at this point. I saw myself starting to write lots of spaghetti code in order to format my data before passing it to the class, and I decided there had to be a better way.
The Data Wrapper
Enter the Datum class. Datum is singular for data, and I named the class on the premise that if you have an array of objects that format data, the array might be called “data,” but each object would be a datum. (Next, perhaps I’ll write a grammar class.) In order to provide abstraction for my row data and my graph data (and after some trial and error with real-life application of the class), I decided that a datum object should hold the following information:
text (A label for graphs or a value for row data)*
link (The link, if applicable)
color (An array of RGB color vals from 0-1, suited more to PDF output than otherwise)
row (The row number for this object, probably generated by a count)
col (The col number for this object, probably generated by a count)
val (The value, most useful for graph data)*
pdf_width (The PDF width for the given string)
textarray (If the PDF width is too wide for the given column, an array of the text split into lines that fit the column)
The Datum class includes a number of get and set methods that allow external code to manipulate these values. It’s a small class. While looping through my database results, for each piece of data — that is, for each column within each row — I create a Datum object, using row and column counts to set the coordinates for each object. This allows me also to specify links for any given row or column, or any given piece of data. For example, I could do a modulus of the column and row counts and make every fifth column in every third row a hyperlink. Or I could check the text value for each object and make each object that reads “PHP Development” link over to devarticles.com. And since I’m packing all this information into Datum objects, I can pass those off very easily to any other class or set of functions I wish. That said, I did tailor the Datum class to be especially useful with the Output class. I’m not likely to need a PDF width for HTML output; nor am I likely to need to break long values into rows and pass that textarray attribute to the XML Output interface.
Class Architecture
Because I knew I’d ultimately be adding support for XML, XLS, and HTML and that some of the functionality of those sorts of output might overlap, I first wrote a base class named Output that contains properties and methods common to the four types of output. It’s here that I set font attributes, whether or not to display column headers, and other properties common to more than one output type. The function that passes the Datum object array to the class is also common to all types of output and can be found in the base class.
Then I split out four other classes that extend the Output class. They are PDF_Output, HTML_Output, XLS_Output, and XML_Output. These are what I instantiate when building a document. Because PDF_Output does things like calculate column widths and row counts, it’s by far the most complex of these classes, so I’ll devote the bulk of my essay to it. HTML_Output and XML_Output are pretty simple; they both loop through the data array and wrap the values with the appropriate markers to indicate HTML table elements or XML tags. When putting XLS_Output together, I cheated by using an existing class that actually writes the spreadsheet with the proper encoding and just passing my data to that class.
Anatomy of an Output Class – Writing PDFs
Adding Links
I’ve mentioned already that one of the early challenges I faced was dynamically adding hyperlinks to the PDF. It’s a tricky thing to do because, unlike HTML, PDF associates links not with the words they’re attached to, but with grid coordinates. So in order to add a hyperlink, you have to know the starting x and y coordinates for the hot spot, and you have to know the width and height of the hot spot. If you want to have your hyperlinks appear as a different color than the standard text, you also have to do that manually. So the routine goes as follows:
1. Determine that a given piece of text should be a hyperlink.
2. Get its x and y coordinates and its height and width.
3. If need be, change the text color.
4. Write the text.
5. If need be, change the text color back.
6. Add the hyperlink based on the x and y coordinates and the width and height.
7. Move on to the next piece of text.
If you’re building your columns and rows manually with set widths and coordinates, this is no big deal, but remember that we’re putting all of this together dynamically, so we don’t know from the start where these coordinates are going to fall. At all times while I’m writing a page of the PDF, I keep track of the x and y coordinates of my cursor and how wide and tall each bit of text is. This makes it easy to add hyperlinks if they’ve been specified. As I move across the page, I increment my y value by the height of the given text plus any vertical spacing and my x value by the width of the given text plus any horizontal spacing and gutter. But how do we determine these widths on the fly?
Calculating Column Widths
Calculating column widths and row heights was actually the hardest part of this project. On the surface, it seems fairly easy: Just take the page width, subtract from it the total of the widest columns, and divide the difference to calculate the gutter to place between columns. This would be great if you had unlimited page width, but I’m usually working with legal sized reports, some of which contain 20 or 30 columns, including one for comments of (potentially) several hundred characters.
It doesn’t take long in these circumstances for the total of the widest columns to exceed the page width, giving you negative gutter sizes, which makes for really ugly PDFs. There are two ways to handle this. The first is to determine the difference between the actual and the available widths and, if the actual exceeds the available, to truncate any columns that exceed the width of the actual width divided by the number of columns (plus a little gutter). The second is to wrap long columns over multiple rows.
This is where the textarray element of the Datum object comes in handy, though this whole procedure has its own problems: For example, if your output includes horizontal and vertical rules separating cells, your calculations for those are thrown off by multiple-row-spanning data; pagination also becomes a little more difficult to manage; you still have to reconcile the actual width with the available width and adjust the whole grid accordingly. And of course there are still limits to how well this can work.
There is always a point at which, no matter how many adjustments and calculations you do, the actual page width and the available page width just can’t be reconciled in a way that makes for pleasing output. I never said the class was perfect, but it does make for pretty rapid development of PDFs where circumstances and display are pretty mundane.
Before I move on, I want to address one more issue with column width calculation. It’s not always appropriate to adjust column widths uniformly. Imagine you’ve got 20 columns to display. Nineteen of them will contain one or two characters of text. The remaining column is a comments column that could contain several hundred characters. Imagine further that your available width is 1000 pixels and that the total of the widths of the widest columns is 1200 (so in our example, 19 of the columns are maybe 10 – 15 pixels apiece and the last column makes up the rest). Let’s go through the basic logic we’d use to reconcile the widths.
1. Actual width is 200 wider than available width, so we’ve got to subtract 200 from the overall width.
2. Divide 200 by 20 to deduct evenly from all columns. This’ll screw up the shorter columns and won’t take enough off the wide column. So instead:
3. Subtract only from the widest column. But that could force it to break into multiple rows, giving us 19 columns of equal size (and probably too wide for the text they actually contain) and one shortish column that spans, potentially, 5 or 10 rows — not the most appealing output. So instead:
4. Find the minimum possible width of each column based on the longest word unbroken by a space. Find the current width of each column. While the current column width exceeds the minimum possible width, subtract from each column a width proportional to its width proportional to the full width until you reach the minimum possible width for the given column. Break columns into rows or truncate as needed.
The result is that columns containing little text will be short and columns containing more text will be wider. We’re doing the best we can here to subtract proportionally from each column until it reaches its breaking point. Of course we’re still somewhat limited. For example, imagine your available page width was 1000, and you had 20 columns, each of whose shortest word was 52 pixels wide (like “supercalifragilisticexpialadocious” or “antidisestablishmentarianism”). The sum of the minimum possible widths in this case is 1040, and there’s simply nowhere else to subtract from. The output will go haywire. These limitations apply to PDFs created manually too, of course.
Adding Recurring Elements and Graphs
Our reports tend to have a number of recurring elements, including headers, page numbers, logos, and grid lines. Because PDF generation is page-centric — that is, because you open a page, do everything you wish to do on that page, close it, and open the next — it makes sense to add each recurring element to an array of recurring elements of its type and then to print these on each page by developing a function for that purpose that’s called during the generation of each page.
Grid lines are a little different in that they’re not added to an array. They’re drawn based on the coordinates of the current text element. Nevertheless, we write the code for drawing lines only once and let the looping and the math do the rest.
The reports my company formerly wrote made use of a nifty class that draws graphs on the fly and returns PNG images that can be embedded into PDFs. The images are a little fuzzy, however, and they come with some overhead. So I decided to write my own graph handler that generates cleaner graphs complete with drop shadows and that, even better, interacts nicely with my Datum class.
As with adding recurring images or titles to the pages, you simply call the add_graph() function, passing in the array of Datum objects containing the graph info, and the Output class builds scaled graphs for you on the fly. Though I’ve got hooks in place that one day may allow for the insertion of graphs at any point within the reports, the behavior now is to add graphs at the end of the PDF. I usually build one array of Datum objects for grid data and another for graph data.
Putting it All Together
I’ve referred so far to a series of set and get methods used to initialize the Datum class. The Output class has its own such methods, and of course there are a number of other methods that do things like add recurring elements to their arrays and perform repetitive functions such as the printing of these recurring elements. But the real meat of the PDF_Output class is the execute method, whose basic flow is outlined below.
1. Open a PDF with dimensions x and y.
2. Begin the first page and set the font specifications. Print headers if we’ve got that option set.
3. For each Datum object, determine the PDF string width of its text and set that property of the Datum object to the value returned. Also count the number of rows and columns based on the “row” and “col” attributes of the Datum objects.
4. Find the minimum width for each column based on the longest single word. Also set the row height for each row to 1.
5. Determine the page width difference (available – actual) and distribute among columns proportional to their widths.
6. For each Datum object, [1] if we’re truncating, truncate text and print text and vertical lines or [2] if we’re wrapping, do some code to wrap the given text as needed, print vertical lines, and set cursor position to accommodate wrapping, pagination, etc.
7. If our adjusted row count is greater than the number of rows we’ve determined will fit on the page, print recurring elements, end the current page, start a new page, and reset the cursor to the origin. Print headers if we’ve got that option set.
8. At the end of the loop, end the page and move on to the graph code if any graphs have been set.
A lot more goes on in the code than what might be apparent from this brief outline, but you see the basic idea. One pitfall of the method I’ve chosen to do my output is that it involves looping through the dataset twice from within the class (and that on top of looping through the results once to build the Datum objects) — certainly not the most efficient routine, and less so the larger the data set. So convenience of up-front coding using this class is counterbalanced by less than optimal performance.
For small-to-mid-sized reports, I find that this is a big time saver that lets me keep my sanity. For larger reports, the wait gets a little tiresome, especially since our larger reports are typically doing multiple complex queries to put the data set together. In one case, I converted an existing complex report to the Output class. In order to compensate for the slower load time, I also converted it from using an ODBC connection to using the Sybase driver and found that the final performance more or less matched the original (pre-Output class) performance.
To give you an idea of what kind of up-front code savings this class can lead to, I’m providing below some code that might be used to generate a simple PDF. Compare this tidy code to a couple of thousands of lines of repetitive code that’s now managed more concisely from within the class. This also gives you a chance to see the black box in action, though as I mentioned at the start, the code’s not yet ready to have the inside of the box exposed.
//Database connection stuff goes here...
$data=array(); // Will hold Datum objects
$rows1=array();
$rowcount=0;
$colcount=0;
//For each row in the result set...
for($i=0; $i<@odbc_num_rows($result); $i++){
$colcount=0;
odbc_fetch_into($result,$rows1);
//For each column in the current row...
foreach($rows1 as $r){
//Perform any data validation, e.g. converting dates to a readable format.
//Also, in this case, set links for every fifth column in every third row.
if($rowcount % 3 ==0 && $colcount % 5 ==0){
$link="http://www.somewhere.com";
}
else{
$link="";
}
//Create a new Datum object.
array_push($data, new Datum($r,$link,"",$rowcount,$colcount));
$colcount++;
}
$rowcount++;
}
$o=new PDF_Output();
$o->set_link_color(1,0,0);
$o->set_page_height(612);
$o->set_page_width(1008);
$o->set_data($data); //This is where we send the $data array to the object.
$o->set_border(0.1);
$o->set_font_size($fontsize);
$o->set_x_margin(36);
$o->show_page_numbers("Page "); //"Page" here is a prefix for the actual page number and is optional
$o->set_font("Times-Roman");
$date=date('M d, Y');
$o->set_x_spacing(6); //Both horizontal and vertical spacing can be set
$o->set_col_wrap(1);
$o->add_text("Generated on " . $date, $o->get_page_width() - 160, 20, $o->get_font(), 9); //Add text blurb to repeating objects array.
$o->add_image(0,$o->get_page_height()-50,"png","/var/www/navreports/images/poweredby.png", "http://www.somewhere.com",.1); //Add image to repeating objects array.
$pdf=$o->execute(); //Put the returned PDF into a buffer.
$len = strlen($pdf);
header("Content-Type:application/pdf");
header("Content-Length: $len");
header("Content-Disposition: inline; filename=" . $o->get_filename());
print $pdf;
?>
Anatomy of an Output Class – Conclusion
Generating “spreadsheet” PDFs can be a real hassle, and I sought to relieve myself of that hassle by writing an Output class, which is itself far from perfect. For larger data sets or for PDFs that don’t follow the format of a (potentially) multi-page set of rows and columns of data, my class is of limited use. I suspect that a number of the methods I’ve used to perform certain tasks could be optimized or reworked to be more user-friendly, and I won’t release any code until I’ve had a chance to do that reworking.
I offer this summary of my efforts, with particular emphasis on the parts of the class that gave me problems, in hopes that it may be useful to others developing similar pieces of code. My primary flaw was one of design — I hadn’t fully considered some of the problems that might arise before I waded in and developed the first draft of the class. I have attempted here to expose some of the issues that escaped my initial attention and to offer groundwork toward workable solutions for those issues.
The end result has been very satisfying for me. Where before, the creation of a new PDF report meant copying an old report and spending a few hours doing the tedious work of tweaking widths and positions buried in a couple of thousands of lines of code, I can now copy the above code sample, apply a few tweaks to handle data formatting and building of graphs, and be done with it.
Time for development of our routine reports has been reduced significantly, and code maintenance is much easier. The class is currently about 1150 lines. (The graph code itself is about 200 lines.) Of course, that count includes code for the other three types of output as well, along with ample comments. The code is cleaner and much more portable than what we had been using before building the class. With all its headaches and boils and bruises, the Output class has proven most beneficial.
Note
In retrospect, this could stand to be changed. The value should always be in the value field instead of crossing over this way.



