One Year at Zenoss

Yesterday marked my one-year anniversary at Zenoss. When I decided to take the job at Zenoss I knew it would be a challenging position, and time has proven this assumption correct. The new (to me, anyway) technology, the remote work environment and the realities of a startup company with limited resources have all added their own challenges to the job.

I found the Zenoss product itself to be more mature than I had expected. While it still has plenty of rough edges, especially in the user interface, it is a rather feature-rich application with a very large user base in both the free community and paid enterprise versions. There is a tremendous amount of work left to do on the application, but with each version we see a significant step forward and an expanded user interest because of the changes.

When I came to Zenoss, my new co-workers and I brought with us a culture of using agile development methodologies and processes to the company. We expected, and were asked by company management, to implement those processes at Zenoss. Over the past year, we have seen the right things happen from this shift in engineering culture:

  • development time has become more predictable
  • planning and priority selection take a much more important role than ever before, and not just in engineering (in fact, most importantly not in engineering, but at the executive level)
  • quality assurance is now a fundamental part of the engineering process, not a luxury for later
  • code and design quality has improved dramatically as the engineering team begins to understand how much control they really do have over the process

There still remain many challenges, but these are the same ones I have seen with all teams that struggle with the discipline needed for agile methodologies. Effectively breaking down large development tasks into two-week iterations remains a hurdle; I suspect at a fundamental level many of the developers do not fully believe this is possible, which is a long, long discussion. There was also a reluctance, until recently, to fully engage with teammates on performing effective code and design reviews. These will both improve over time as the team matures.

When I first started at Zenoss there were 5 developers in Annapolis, 2 developers in Austin and me , usually working remote in Houston. Since then, the team dynamics have changed so that I and one other developer are remote most of the time, 3 developers are full-time in Austin and 1 in Annapolis. This team dynamic change has made the infrastructure needed to support remote development much more important, but at the same time it has made the remote developers somewhat more isolated since a bit more now happens only in one location. Since I travel to Austin frequently I can avoid the worst of this unavoidable trend, but it will likely always remain a challenge as long as we have remote employees.

On a personal level, working remotely full-time for this long has been much more difficult than I expected. I last did this much remote work in the mid-90s, and it was difficult to do then, as well. On the surface, I spend all of my work hours focused on some problem on the computer, but in reality much of my job is very social and without face time it starts to wear on me very quickly. Frequent visits to Austin help tremendously, but in reality this will likely always be an issue until I relocate.

I’ve encountered almost all of the traditional problems people have while working at home, but especially keeping focus and balancing the time spent between work and home life. Focus is purely a self-discipline issue, but it is remarkable how much of this is provided by the routine of commuting to an office. Maintaining a good work-life balance is also tremendously hard when working at home, unless you dedicate a specific place in the house for work. It took a few months to really learn this lesson but eventually I bought a second desk and made myself an office separate from my normal home office. Before then, I found it way too easy to bring my laptop into the living room and I’d continue to check e-mail and work on problems throughout the evening. Sometimes, I still do, but not at least it is not a mindless occurrence.

Zenoss is a very small company, and naturally we have very limited resources. Of course, today’s economic environment makes this true for almost every single company. But more specifically, I find ourselves working around what effectively is a lack of resources in our IT department. I often have to build my own virtual machines at home, or buy little widgets to help productivity because I am remote. At some level, I don’t mind this, since it’s nice being able to actually work somewhere you can do this, rather than being told you can’t.

After a year, I’m still not a huge fan of the Python programming language, but I’ve come to accept it and I can be productive with it. Working with it feels a whole lot like I stepped back in time 15 years, and saying that is a great way to start an argument, so I’ll just leave it at that.

One great surprise from the past year is that I became the guru for Zenoss’s Windows monitoring technology. We have a custom implementation of MS-RPC that is based upon the Samba 4 source tree. On top of this layer, we have provided our own implementations of the DCOM, WMI, and Windows registry interfaces so we can call these services directly on any Windows device. This is a great feature, as it allows us to communicate directly to Windows devices from our Linux based product, without requiring a Windows device to be running our software just to provide Windows-based connectivity, unlike some other systems management products.

Over the next few months I will be taking on more and more of the implementation of the new Zenoss user interface, which is a great project that should dramatically improve the user acceptance of quality of the application. It also gives me more time to work on user interfaces, which is always an area of software development I enjoy but rarely get a chance to do.

Secure Windows Monitoring with Zenoss

Starting with version 2.3.x, Zenoss can monitor computers running Microsoft Windows with a variety of data collection protocols: SNMP, WMI over DCOM/MS-RPC and Perfmon over MS-RPC.

In Zenoss Core, the status of Windows services and the Windows Event Log are monitored using Windows Management Instrumentation (WMI) queries over the DCOM/MS-RPC protocol. In the implementation of MS-RPC that Zenoss is based upon, authentication credentials are sent to the remote server using the Windows Challenge / Response (NTLM) mechanism. Using this authentication mechanism, the actual password is never sent across the network, but rather the server produces a “challenge” value that the client must calculate using the password rather than sending it across the network.

NTLM authentication is the same mechanism that Windows devices themselves use for client/server communications, such as file sharing and remote administration.

Starting with version 2.3.x, Zenoss Enterprise gathers Perfmon data using the remote Windows registry API over the MS-RPC protocol. This technique is both more efficient and secure than the previous one. The same authentication mechanism used by Zenoss’s WMI library is used here, providing the same level of security.

Prior to version 2.3.x, Zenoss Enterprise used a different mechanism to collect Perfmon data from Windows devices. This mechanism used a utility known as winexe to remotely execute commands on the Windows device (in this case, the typeperf.exe Windows utility). Unfortunately, the winexe utility sends the username and password used for authentication across the network in clear text, providing a less than ideal configuration for security.

Zenoss users monitoring Windows devices should be running version 2.3.3 or newer for the best possible security when communicating with those devices.

Getting a Native Code Stack Trace from a Zenoss Daemon

Zenoss uses the Python programming language for the vast majority of its code, and all of the daemons and commands that run are Python scripts. Several daemons also make use of native code (i.e. code written in languages like C or C++ that must be compiled into object files and organized into libraries) to perform functions such as remote Windows and SNMP connectivity.

Occasionally, one of these daemons crash in these native libraries, and not in the actual Python code. When this happens, the Python interpreter is unable to produce a relatively friendly stack trace that it would for pure Python code. For example, a crash in a Python script would produce something that looks familiar to most programmers:

Traceback (most recent call last):
File "test.py", line 5, in ?
z = y / x
ZeroDivisionError: integer division or modulo by zero

By contrast, if you have a crash inside of a native library you likely would not see anything more than Bus Error or a similar message, and often nothing at all — the daemon process will just exit. For example, here we have a dynamic library written in C with a single function: doit. This function will attempt to access a NULL pointer when called, which results in the following output:

$ python -i
Python 2.4.4 (#1, Feb 23 2009, 09:17:03) 
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctypes import *
>>> from ctypes.util import *
>>> lib = CDLL(find_library('test'))
>>> lib.doit()
Bus error

To get a stack trace from the native code, the Python interpreter must be run from the GNU debugger, or gdb. All of the Zenoss daemons share a common architecture in how they are started, so the process for running the daemon from within gdb will be similar no matter which daemon you use.

  1. Determine your ZENHOME directory location by running echo $ZENHOME from the shell prompt. In the remainder of this example, we will assume it is set to /Users/cgibbons/zenoss
  2. Pick the daemon you want to run. In this example, we will use zenwin — the daemon responsible for monitoring the state of services on Windows devices.
  3. Look at the actual daemon script:
    $ cat $ZENHOME/bin/zenwin
    #! /usr/bin/env bash
    #############################################################################
    # This program is part of Zenoss Core, an open source monitoring platform.
    # Copyright (C) 2007, Zenoss Inc.
    #
    # This program is free software; you can redistribute it and/or modify it
    # under the terms of the GNU General Public License version 2 as published by
    # the Free Software Foundation.
    #
    # For complete information please visit: http://www.zenoss.com/oss/
    #############################################################################
     
    . $ZENHOME/bin/zenfunctions
     
    PRGHOME=$ZENHOME/Products/ZenWin
    PRGNAME=zenwin.py
    CFGFILE=$CFGDIR/zenwin.conf
     
    generic "$@"
  4. Run gdb in the python interpreter:
    $ gdb python
    GNU gdb 6.3.50-20050815 (Apple version gdb-962) (Sat Jul 26 08:14:40 UTC 2008)
    Copyright 2004 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License, and you are
    welcome to change it and/or distribute copies of it under certain conditions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB.  Type "show warranty" for details.
    This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .... done
     
    (gdb)
  5. Set the program arguments. Note how the zenwin script above is used to build the actual argument string:
    (gdb) set args /Users/cgibbons/zenoss/Products/ZenWin/zenwin.py --configfile=/Users/cgibbons/zenoss/etc/zenwin.conf run -v10 -c
  6. Finally, run the daemon process within the debugger:
    (gdb) run
    Starting program: /Users/cgibbons/zenoss/bin/python /Users/cgibbons/zenoss/Products/ZenWin/zenwin.py --configfile=/Users/cgibbons/zenoss/etc/zenwin.conf run -v10 -c

The daemon will then run as if it were started directly from the command-line. Any pdb trace statements will still be activated and you can use pdb commands as expected. But, once a native code crash is detected by the debugger, the gdb prompt will be provided and the gdb where command may be used to view the native code stack trace. For example, if we do this with our previous doit test, we’ll see this output:

>>> lib.doit()
 
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x93ebe457 in __vfprintf ()
(gdb) where
#0  0x93ebe457 in __vfprintf ()
#1  0x93ef2da7 in vfprintf_l ()
#2  0x93f17fbb in printf ()
#3  0x003b9ffe in doit () at test.c:6
#4  0x0039476d in .LCFI1 () at /Users/cgibbons/src/zenoss/trunk/inst/build/ctypes-1.0.1/source/libffi/src/x86/darwin.S:81
#5  0x00394701 in ffi_call (cif=0xbffff1a8, fn=0x3b9fe6 <doit>, rvalue=0xa0414584, avalue=0xbffff120) at /Users/cgibbons/src/zenoss/trunk/inst/build/ctypes-1.0.1/source/libffi/src/x86/ffi_darwin.c:249
#6  0x0038f21e in _CallProc (pProc=0x3b9fe6 <doit>, argtuple=0x15a030, flags=4097, argtypes=0x0, restype=0x21b600, checker=0x0) at source/callproc.c:665
#7  0x00389f02 in CFuncPtr_call (self=0x174880, inargs=0x15a030, kwds=0x0) at source/_ctypes.c:3357
#8  0x00007e12 in PyObject_Call (func=0x174880, arg=0x15a030, kw=0x0) at Objects/abstract.c:1795
#9  0x00080dcb in do_call [inlined] () at Python/ceval.c:3776
#10 0x00080dcb in PyEval_EvalFrame (f=0x209960) at Python/ceval.c:3591
#11 0x0008327f in PyEval_EvalCodeEx (co=0x1ac5e0, globals=0x173a50, locals=0x173a50, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2741
#12 0x00083547 in PyEval_EvalCode (co=0xbffff064, globals=0xbffff064, locals=0xbffff064) at Python/ceval.c:484
#13 0x000aae36 in PyRun_InteractiveOneFlags (fp=0xbffff064, filename=0xdb4a6 "<stdin>", flags=0xbffff818) at Python/pythonrun.c:1287
#14 0x000aaf73 in PyRun_InteractiveLoopFlags (fp=0xa04175e0, filename=0xdb4a6 "<stdin>", flags=0xbffff818) at Python/pythonrun.c:706
#15 0x000abe29 in PyRun_AnyFileExFlags (fp=0xa04175e0, filename=0xdb4a6 "<stdin>", closeit=0, flags=0xbffff818) at Python/pythonrun.c:669
#16 0x000b5f8a in Py_Main (argc=0, argv=0xbffff8a4) at Modules/main.c:493
#17 0x00001d0b in _start ()
#18 0x00001c39 in start ()
(gdb)

If the native library was built with debugging symbols a nice programmer-friendly stack trace will be generated like in the above example. Here we can see exactly what line our doit function crashed at. Now the bug should be easy to find and fix, right?

new MacBook setup

I bought another Mac today, a nice 2.4 GHz 13-inch unibody MacBook. I had planned on buying a 17-inch unibody MacBook Pro, and very nearly did, but luckily sanity won out and I remembered how much of a hassle it was to carry around those giant things, even if they are only “only” 6.6 lbs.

I had a 2.0 GHz 13-inch MacBook a couple of years ago when the first Intel-based models came out and I do remember the screen resolution, while not abundant, was more than adequate for browsing, e-mail and even development. And, the unibody model is only 4.5 lbs, so 2 lbs lighter will be a lot nicer to carry around. I also sprung for a spare battery to try and help get a little closer to the awesome battery life of the 17-inch model.

Of course, the obvious question is why buy another laptop when I’ve already got a nice 15-inch MacBook Pro that work provides? The answer there is easy: I don’t want to do anything personal, even development, on the work provided machine.

Now, on to the actual system setup, documented here for posterity.

  1. After account creation, run software update and get all the latest updates installed first.
  2. Create an applecare account with an easy, but secure, password. This way the Apple store geeks can have that account should there need to be any repair work done.
  3. Change the battery lifetime display with Show -> Time.
  4. Secure the screensaver by using System Preferences -> Security -> General and checking the Require password to wake this computer from sleep or screen saver option.
  5. Disable the Front Row remote by using System Preferences -> Security -> General and checking the Disable remote control infrared receiver option.
  6. Disable the Front Row keyboard shortcut by using System Preferences -> Keyboard & Mouse -> Keyboard Shortcuts and disabling the Hide and show Front Row shortcut.
  7. Enable full keyboard shortcuts by checking the All controls option in the bottom of the same Keyboard Shortcuts screen.
  8. Enable the Use secure virtual memory option in System Preferences -> Security -> General.
  9. Encrypt my home directory with System Preferences -> Security -> FileVault.
  10. Install Growl 1.1.4 from http://growl.info/
    1. Install the GrowlSafari extra package.
    2. Install the HardwareGrowler extra package.
      1. Drag HardwareGrowler.app to /Applications
      2. Disable the HardwareGrowler dock icon by following the instructions at http://growl.info/documentation/hardwaregrowler.php
      3. Add HardwareGrowler to the start at login list by using System Preferences -> Accounts -> Login Items and dragging HardwareGrowler to the list.
    3. Enable Growl starting at login with System Preferences -> Growl and enabling the Start Growl at login option.
  11. Remove unused printer drivers by deleting the appropriate folders in /Library/Printers folder (everything but Brother, hp and PPDs in my case).
  12. Install the XcodeTools package from the Installation DVD’s Optional Installs directory.
  13. Drag Xcode to the dock by going to /Developer/Applications and dragging the icon to the dock.
  14. Add Activity Monitor to the dock by going to /Applications/Utilities and dragging the icon to the dock. Seondary-click on the icon and enable Open at Login.
  15. Add Terminal to the dock by going to /Applications/Utilities and dragging the icon to the dock.
    1. Change the default Terminal settings by starting Terminal.app, selecting Preferences (Cmd-,) and then changing the “new window with settings” to Pro.
    2. Select the Pro scheme in the Settings tab and click default.
    3. Choose the Window tab with the Pro scheme selected, click the Background color chooser and set the opacity level to 90%.
    4. Change the window size to 80 columns and 36 rows.
  16. Customize vim by creating ~/.vimrc with the following content:

    :color elflord
    :syntax enable
    :set shiftwidth=4
    :set expandtab
    :set autoindent
    :set cindent
    :set enc=utf-8
    :set nu
    :set showmatch
    :set laststatus=2
    :set nocompatible
    :set gfn=Monaco:h15:a
  17. Enable color highlighting for ls by adding the following lines to /etc/bashrc:

    alias ls='ls -CFG'
    alias dir='ls -FGlas'
  18. Install the Safari 4 beta from http://www.apple.com/safari/download
  19. Install Firefox 3 from http://getfirefox.com/ and drag it to the dock.
  20. Install iStat pro from http://www.islayer.com/apps/istatpro/
  21. Install MySQL 5.1 x86 community edition from http://dev.mysql.com/downloads/mysql/5.1.html be sure to install the StartupItem package as well as the preference pane.
  22. Add MySQL to the shell profile by appending the following to /etc/bashrc:

    export PATH=/usr/local/mysql/bin:$PATH
  23. Install EverNote from http://www.evernote.com/
  24. Install DropBox from http://www.getdropbox.com/
  25. Install the Windows Media Components for QuickTime from http://www.microsoft.com/windows/windowsmedia/player/wmcomponents.mspx
  26. Install Twitterrific from http://iconfactory.com/software/twitterrific
  27. Disable automatic synchronization for iPhones and iPods since this won’t be the primary iTunes machine by going to iTunes Preferences and enabling Disable automatic syncing for iPhones and iPods on the Devices tab.
  28. Install the iPhone SDK from http://developer.apple.com/
  29. Party! Or maybe just nap.

Oddities in Gathering Windows Performance Data

At Zenoss we do quite a bit of remote monitoring of computers running Windows. In the Enterprise edition of the product, we collect raw performance counter data using the conventional remote Windows Registry APIs.

We ran into an issue recently with a customer running Windows 2000 where the data from the remote server was being truncated prematurely. Since we implement our own remote API (so we can run natively on Linux and with Python, rather than requiring Windows), there was some immediately concern we ran into a low-level bug in our protocol implementation. Thanks to the release of the Windows Communications Protocols (MCPP) last year we have great detail on how our API layer should function.

Reviewing the MCPP in detail compared to our implementation showed no bugs against the specification, but I did notice some odd behavior. Normally when using the RegQueryValue API you specify a NULL buffer point and a zero-length buffer size so that the call will provide the actual size of the buffer needed. With this particular customer’s server I noticed that this behavior wasn’t behaving as documented in the MCPP.

An error code of ERROR_MORE_DATA was being returned. The MCPP says that when this value is returned the server will populate the size output variable with the actual size in bytes of the needed buffer. In this case, the size was always the same size as the input. After some experimentation I found that if I passed in approximately 64 Kbytes more data the call would finally succeed.

While quite odd behavior, this is actually the documented and expected state in the Win32 API documentation for RegQueryValueEx, but not in the MCPP. Specificially, when using the HKEY_PERFORMANCE_DATA key the ERROR_MORE_DATA behaves differently and the caller has more responsibility in guessing an appropriate buffer size.

The following pseudo-code shows the basic flow for how RegQueryValueEx should be used, either for locally or remote performance data access.

size = 65536 # starting size, probably computed from a previous registry call
params.in.data = params.out.data = buffer(size)
while 1:
    params.in.size = size
    params.out.size = 0
    dcerpc_winreg_QueryValue(params)
    if params.out.result == ERROR_MORE_DATA:
        size = size + 65536 # add another 64 Kbytes of data to the buffer
        params.in.data = params.out.data = buffer(size)
        continue
    break

After fixing that issue I was still left with one oddity. Let’s say, for example, it took 293,500 bytes of data before the RegQueryValueEx call was successful. And yet, the actual amount of returned data would only be 195,000 bytes, or something similar. This behavior seems quite different than on the other Windows operating systems we have tried so far.

This is the first time we’ve tried our data collection against a Windows 2000 server running Exchange locally. Windows 2000 has also been the source of several other key behavior differences in how performance data is returned, so my current speculation is how the server actually determines what data to be returned varies greatly between operating system versions. We normally query the performance counter registry for only a subset of values. It may well be that on Windows 2000 a buffer size large enough to retrieve all performance counters is required, even though once the call is complete it actually used quite a bit less.

Quirky, but another bug gone.