Saturday, May 12, 2018

-- How to defy the elasticity property --

Leadership in corporate America often struggles with the demands of doing more with less. This is especially true when you work for a public company where results must either meet or exceed expectations, and those expectations increase with every quarter. The pressure forces each service unit in a public company, including IT, to face the same question: How do we comply with corporate objectives while maintaining the ability to service natural organic growth without sacrificing quality? Although each situation is unique for a company, the answer always seems to require a combination of technology and efficiency.

They say necessity is the source of all innovation, and it is by necessity that I have found innovative ways to use technology and efficiency to accomplish more with fewer resources. With a team of just three full-time database administrators, I manage the 24x7 operation of more than 1,300 databases, spread over multiple platforms including Oracle, SQL Server, MySQL, Teradata, and MongoDB. For those of you who are not familiar with the database administration side of IT, the market standard recommends one DBA manage between 30 and 50 databases (depending on database size and complexity). Using the leanest end of that standard, my team should be composed of 26 database administrators, which doesn’t take into consideration the added complexity of having to support databases on five different platforms.

            This is taking the “do more with less” mantra to the extreme. I have only 15% of the resources that industry standards say I need in order to successfully run my department. Luckily I am a technical manager, with a background in database administration so I could count myself as a resource (if I ever find a way out of the back-to-back meetings). And I am especially grateful that I was allowed to hand pick my team, and selected some of the best resources I’ve worked with over the last thirty years. Each of my three DBAs has 20+ years of experience, specialize in different technologies, and possess different professional strengths. You may argue that since we need coverage for so many databases I should have hired more junior DBAs to stretch my payroll budget, rather than just a select few with substantial experience. My response would be that time is literally money; every minute an essential database is down, the company’s earnings potential is negatively impacted. It’s important to have the right tools for the job, and my experienced DBAs know what warning signs to watch for to prevent systems from breaking down, rather than just responding when things go wrong.

I am proud of what my team has been able to accomplish, but if someone had told me at the interview that we would have to support such a complex operation with such a lean team… well, my mind would have been screaming, “Danger Will Robinson, danger!” and I likely would not have taken the position. To be fair, my employer was unaware of the real situation, and it was only through several years of diligent investigation that we finally uncovered all the information systems we are responsible for. It didn’t take long for me to realize that in order to succeed, I would need to:
1) automate and standardize as much as possible,
2) focus on proactive support,
3) set multi-year objectives to bring the ecosystem to a supportable version, and, most importantly,
4) hire and retain the best talent my budget would allow in order to reduce the technical gap.
Having information systems properly supported is key to the success of any organization. While we have found ways to make a lean teamwork, I am not suggesting you reduce the size of your team: remember – we got to this formula out of necessity rather than choice. With the proper use of efficiencies and technologies, sometimes it really is possible to work smarter, not harder (or longer hours, or with more people). Some of the “smart” changes we have implemented are listed below.

Monitoring and diagnostics. We standardized our monitoring of different technologies with a single tool, Oracle Enterprise Manager, which offers multiple plug-ins for different flavors of databases. This provides a single, streamlined process for monitoring and diagnostics regardless of the technology behind it.

Seamless response to production support emergencies. We integrated Oracle Enterprise Manager’s alert system with a SaaS product (PagerDuty). The service receives the alert and contacts the on-call resource with the technical knowledge required to resolve the issue. PagerDuty has the entire team’s contact information, from the Database Administrators all the way to up the Senior Director of Database Engineering. If an alert does not receive a response within a set timeframe, the alert continues up the chain of command until a response is received. This ensures all alerts are responded to in a timely manner and encourages everyone to watch for alerts.

Night and weekend support. With such a lean team, having everyone present during business hours is essential in order to keep up with the changing demands of the business. It would be humanly impossible for the team to support night incidents and show up for work the next day, and working every weekend is a great way to burn out a good employee. And unfortunately, the systems aren’t considerate enough to only break during office hours. We solved this problem by outsourcing our off-hours support to RDX. They are a US company who assigned named resources with skills and experience equivalent to the internal team. They are capable of working independently to solve problems, and they communicate what went wrong and what steps were taken to remedy the situation. We’ve even used their resources for overflow work during regular business hours. I am honestly impressed by the quality and capabilities of the resources at RDX.

Standardize as much as possible. Having clear and well documented standard procedures is not only is a best practice - it also makes the diagnostic process a lot easier. We try to make documentation as detailed as possible, and this allows us to request help from helpdesk engineers that have no database knowledge. No matter who is debugging an issue, every resource is familiar with the environment.

Encourage and promote proactive support. If a problem happened once in a single database, and you have a standardized ecosystem, then it is extremely likely that same problem will happen again in a different database. Proactively preventing problems from recurring in the rest of the enterprise is where I want my team to spend most of their time.

Communication. Having consistent and recurrent communication with the application support team is priceless. They are often challenged to make changes in order to meet the demands of the business, and we have avoided numerous incidents that would have created extended downtimes by having open communication and helping build the vision of what is to come in their world.

Keep up with innovation. Although we can’t afford to patch every quarter, we ensure that every time we perform a database upgrade we apply the latest and greatest patches. While it’s important to take advantage of the latest technology available, you want to be on the leading edge, not the bleeding edge. It’s important to have access to and understand the latest-and-greatest technology, but you don’t want the risk and downtime associated with becoming the debugging guinea pig.

Take ownership. We are often challenged with having to solve issues related to MS Access databases. These databases are usually created by smart business users as a necessary workaround to accommodate software limitations. However, when that smart user is no longer with the company, suddenly no one knows how to maintain a system that is vital to the business operation, and often times no one knows the database exists until it breaks. Needless to say, it is a very bad idea to create this kind of dependency outside IT. Work with the business units, and provide tools that are under IT governance which allow them to have the freedom to customize their own reports and create basic applications. If you don’t take ownership of this problem, then you can expect a proliferation of Excel sheets and decentralized unsupported information systems, which will turn into an information security and support nightmare.

Attempt to reduce the technical gap. Nowadays it is impossible to be an expert in every new technology. While it is true that database systems are becoming smarter, it is also true that additional features require additional skills, and the traditional role of the Database Administrator has evolved. A lot more is expected from a DBA today compared to 10 years ago. Having a small team precludes any possibility of having technical redundancy, so cross-training is essential. One way to do this is through periodic lunch-and-learn meetings and a training-of-trainers model (ToT). This often lifts the morale of participants, promotes teamwork, and works well if you have a scarce (or non-existent) training budget.

The list above is just a compilation of what has worked for us, and since every situation is unique they may not work for you. When faced with a challenge, keep in mind that unconventional situations likely can’t be resolved with conventional solutions. It requires thinking outside the box.

Rafael Orta