Site Reliability Engineer/Production Engineer with 18 years of experience designing and automating complex systems on both Linux and Solaris and a degree in Computer Science and Computer Engineering.
I have a deep interest in open source, automation, systems engineering and security. I also find teaching and mentoring others to grow in these areas rewarding.
I have a strong presence in the open source community, having not only my own software (iptstate, concordance, pius, check_x509, mime_dump), but also contributing to many projects including Chef and the Linux kernel. I pride myself on always solving problems the right way - not the easiest or quickest way. Putting forth the effort to do something properly the first time may take a little longer initially, but reduces downtime and cost while increasing manageability, reliability, and scalability in the long run. Lastly, I'm very proactive; I have the drive to seek out projects that need to be done and tackle them.
Thank you for your time,
Phil Dibowitz
415-935-1312
phil@ipom.com
WORK EXPERIENCE
Facebook (2010 - present - Menlo Park, CA)
Production Engineer: Operating Systems (tech lead) (2012-present)
Production Engineer: Traffic Team (2011-2012)
Production Engineer (2010-2011)
Technical and Organizational Vision
- Drove effort and design for rebuilding configuration management w/focus on being able to scale number of systems independent of team size. Built and lead a team to implement it
- Lead team to drive adoption of aforementioned system first through organic growth and eventually strict policy creating both a unified system and a positive migration experience
- Identified additional infra areas needing improvement or lacking ownership (security updates, virtualization, package building/distribution, automated change testing) and built a plan to transition to the team and improve them
- Champion an open-source upstream-first mentality within the team: staying close to upstream, pushing features/fixes, releasing tooling wherever possible, etc.
- Identified future scaling needs would depend on influence of and understanding of community direction. In response, built on existing open source mentality to help team reach out and build relationships with strategic upstream open-source teams including the Anaconda, systemd, and RPM and others. Making this a core part of the team work was a big bet which enabled collaboration with upstream on various long-term visions
Management, Team Leadership & Cultural Leadership
- Grew the team in response to additional responsibility (see point 3 above), including building onboarding documentation and individualized growth plans
- Identify and build key cross-functional relationships between team and others that allow us to collaborate and build better solutions faster with less stress.
- Travel regularly to remote offices to teach technical and non-technical classes to reinforce cultural consistency/growth across the org and ensure remote employees feel connected
- Individual team member development for team of 8 including, weekly 1:1s focused on on Lindividual growth, career growth, and project prioritization, co-writing bi-yearly reviews and defending them in org-wide calibrations, etc.
- Plan and drive bi-yearly road mapping for team. Cross functional collaboration on team roadmap to ensure org-wide alignment. Socialize road map and previous half review
Technical
- Designed a system to route configuration management alarms to the right team, reducing oncall load on central team
- Wrote automated tooling to sync CentOS updates and roll them out safely on a 2- week cycle ensuring consistent timely security updates
- Built an extensible Chef APIs for a variety of complicated uses cases including managing storage devices, mounts, and complex service configs, most of which are now open source.
- Worked with auditors to have clean/easy audit reports, and built transparency into tooling to aid in yearly audits
- Led the OS & Load balancer side of project to bring full-parity IPv6 support to Facebook
- Worked with upstream kernel to fix and upstream new features and fixes to the ip6_tunnel module
- One of two primary authors of the automation system used to configure and converge hardware load balancers
- Automate bootstrapping of new clusters enabling infra to keep up with product growth
- Rebuild internal LDAP infrastructure improving engineer development problems
- Write tooling for new engineers to opt-into accounts; write tooling for reaping of unused accounts (for current employees)
Google (2008 - 2010 - Zurich, Switzerland)
Site Reliability Engineer, Gmail
- Planned and tested migration of Gmail to next-generation internal storage infrastructure, including training of other team members
- Oncall duties for Gmail's infrastructure including web frontend, imap/pop frontend, backend, storage, delivery, anti-spam, anti-abuse components
- Worked with developers to productionize next-generation anti-abuse and anti-spam systems
- Near-complete re-write of Gmail-specific machine-management software
- Developed scripts to ensure correct load balancing configurations
- Extended existing configuration management systems for new products and needs
- Developed new procedures for integrating with other teams and core Google infrastructure
- Restructure how new releases get their first production traffic to provide greater flexibility, monitoring, and reliability
- Wrote software to audit and correct file permissions issues
- Wrote and organized documentation for many of Gmail's existing and upcoming systems
- Taught classes for new employees and engineers transferring to SRE
Ticketmaster (2005 - 2008 - Los Angeles, CA)
Senior UNIX Systems Administrator (2006 - 2008)
UNIX Systems Administrator (2005 - 2006)
- Managed ~3000 Linux systems
- Architected and implemented a large-scale PKI infrastructure using RSA Keon software for more than 60,000 certificates spanning more than than 16 certificate authorities (CAs) including writing policy and training staff
- Co-designed the PKI-based authentication system for web-services project for interfacing with partners
- Developed a plug-in to the preexisting system configuration software to effectively handle Identity, User, and Access management (Perl)
- Developed dynamic pluggable software for provisioning, modifying, and decommissioning DNS, NFS storage, and VMWare (GSX) virtual machines (Perl)
- Developed daemon to report and graph incoming sessions across load-balancing layer (Perl)
- Developed utility to generate utilization reports for on-sale periods (Perl)
- Part of the team that developed and maintained in-house system configuration and other software (Perl, C, Ruby)
- Wrote various scripts such as Netscaler configuration generator, monitoring aggregator, and others to improve team efficiency (Perl and Ruby)
- Rolled out hardware, OS, and configuration for several new projects such as TicketExchange and Web Services
- Worked directly with application developers to debug various production problems (C++)
- Rolled out keepalived to single-point-of-failure systems to ensure redundancy and reliability
- Trained new staff on our systems, software, and policy
- Wrote documentation for various systems, products, and software
Previous positions left off for brevity
EDUCATION
- University of Southern California
B.S. in Computer Engineering Computer Science
SKILLS
- UNIX: Linux, Solaris
- Services/Software: chef, yum/dnf/rpm, deb/apt, ipfilter, iptables, apache, varnish, ngnix mysql, bind, kerberos, linux mdraid
- Languages: Ruby, Python, Perl, Shell, C++, C
PERSONAL/OSS PROJECTS
References available on request.
Phil Dibowitz
415-935-1312
phil@ipom.com