Some years ago, I accidentally cooked the microcontroller of a commercial grade uninterruptible power supply, containing $100s worth of power electronics. I bought another one of course, but I wasn't happy with throwing out the first one, or simply using it for parts. So I ended up taking the second one apart too, and trying not to do dumb stuff a second time as I figured out how to fix the first one. I ended up breaking the second one, but then fixed them both. This is the epic story.
This is the APC Smart UPS 1500, model SMT1500I. It's a 230V uninterruptible power supply running on two 12V 17Ah lead acid batteries, with a 50/60Hz transformer based inverter and an add-in card slot that can support remote monitoring. These devices have been around in some form or other since the 1990s, but the SMT1500I is a more recent model in the series with an LCD display, improved power efficiency and better monitoring capabilities, first released in the early 2010s. Its successor, the SMT1500IC, features a 'SmartConnect' cloud-connected monitoring system and thus is quite different internally, so the following isn't really applicable to that model.
Suffice it to say, I was curious with my unit (bought used), and mistakes were made involving multimeter probes. There are plenty of headers inside such a device which have small gaps between the pins, and multimeter probes slip off of those pins quite easly and short them together. I was trying to figure out how to quiet a noisy switch mode power supply, and in measuring its output, I managed to short the main battery voltage rail inside the unit (27.5V) directly to one of the I/O pins of the microcontroller, cooking it instantly.
As far as options for actually fixing such a device when this happens, the average consumer or technician doesn't have (m)any. That's because the microntroller in question - one of two in the unit - is a PLCC44 chip soldered to the board, with proprietry firmware on it. Once you cook that, there's no going back - and it didn't help that the chip used in the particular unit I had looked to be some custom APC part with a part number I couldn't find reference to anywhere.
I decided to try pulling the sticker off of the microcontroller. What did I have to lose? Not an awful lot, the thing was already dead. Guess what I found underneath?
The part in question was actually a NXP P89CV51RC2FA - a standard microcontroller, not a custom APC ASIC of some sort. Hahahaha. I'm sure they'll tell you that they put the stickers on for logistical reasons, but I think they're also there to obfuscate. Anyone who knows anything about 8051 microncontroller part numbers will recognise that as an 8051 series 8 bit microcontroller. In this case, the "P" indicates the manufacturer, Philips (who renamed their semiconductor division as NXP) and the second "C" indicates 32kb of flash memory. I found, to my disappointment, that it was discontinued at the end of 2011, so I couldn't get hold of a new one.
An 8051 microcontroller, in a post-2010 UPS? Why are they using it still? Well because if it ain't broke, don't fix it. APC have been building these UPSs on 8051 microcontrollers pretty much since the beginning - decades ago - and in the SMT series they added a second microcontroller, in my case a much more modern STM32 ARM Cortex M3 based chip, as a communications processor which interfaces with the main microcontroller via UART and handles external communications and the LCD front panel of the unit. They finally upgraded to more modern, powerful devices in the SmartConnect series I mentioned earlier, but there's a downside. One of the key advantages of NOT having a SmartConnect model is that with these older ones, you can still hack into the UART communications between the two processors and adjust settings like battery charge voltage, even though they are not supposed to be user accessible, because the 8051 internally uses the same UPSLink protocol that APC's older UPSs used externally.
I managed to get hold of a document from Atmel (now owned by Microchip) that cross referenced the discontinued NXP part number to an Atmel drop-in replacement - the AT89C51RC2 - which was still in production and available! And since this all happened well before COVID and the semiconductor shortage, I was able to actually get hold of one. Lol.
Getting a replacement microcontroller is easy, and soldering it in is just a matter of logistics. The real problem for me was the firmware.
APC offers firmware upgrade files for this unit, but they're not the sort of thing you can just chuck straight onto the flash of the microcontroller and expect it to work! Through a process of research and experimentation, I was able to figure out that the firmware files APC provides are encrypted, and contain firmware both for the STM32 communications processor, and the firmware I needed, for the 8051 main processor. The communications processor runs a bootloader that accepts the file from a host computer over a USB or serial interface, and decrypts its own firmware and updates itself, before sending new firmware to the main processor as well. I could see, inside the file, a 32kB block of data labeled 'MCU 05.0' for the main processor - but it wasn't in readable format. Long story short, it was a dead end - I'm no hacker (well, that's what they all say isn't it) and I didn't want to get lost in countless hours of trying to figure out how APC encrypted their firmware files.
So instead of being a software hacker, I committed intellectual property theft (well, it's pretty minor isn't it) with hardware. I'd already bought a second UPS of the same model, so at the risk of stuffing that too and losing everything, I plugged my very untrustworthy programming setup into the working UPS's microcontroller and purloined the firmware right out of it. That's right, because they forgot to enable readout protection. APC, if you're reading this, given that you're going to the extent of encrypting your field update firmware files, you really should enable readout protection on your chips. Seriously. :)
But I'm not complaining. I managed to corrupt the firmware on the STM32 communications microcontroller as well, and used exactly the same trick to get that back too - because they didn't enable readout protection there either. Nice.
It should have been easy from then on, but it wasn't. I bought a replacement microcontroller, and went through the laborious process of soldering it in (not easy when it has 44 pins and I have only one soldering iron and two hands). I can't say the soldering looked neat - I damaged many pads in my various attempts to solder it on and that means that the final result involves many 'bodges'. However, after closely inspecting everything to ensure there were no dry joints or shorts, I was fairly satisfied that it wouldn't fail on me.
I flashed the firmware, this time using standard microcontroller programming tools and the un-encrypted firmware image I'd obtained. Low and behold it worked. Well, that is to say, the microcontroller worked. I was lucky that the EEPROM chip, separate from the microcontroller, survived the entire process with its calibration values intact, however in the general repair process I managed to screw up some other parts of the UPS and had a lot of work to do in fixing it up.
I went through a long and gruelling process of tracking down some complicated problems with the 24V DC to 230V AC inverter, during which the main 12V SMPS of the unit decided to call it quits and produce 25V on the output! This didn't help matters: it fried almost everything on the 12V rail, i.e. a good handful of ICs and ASICs, some of them proprietary. I'd fixed the original problem, but at the same time, completely botched up the rest of the repair! Also, I was using the working UPS I had for comparison purposes to troubleshoot problems, and managed to stuff that up too - soldering a probe wire onto a SMD resistor caused that to fail due to mechanical stress, and I then desoldered an IC to troubleshoot the problem and killed it due to ESD. While testing the broken IC in-circuit, the inverter operated incorrectly and I suspect inductive spikes from the main transformer killed one of the MOSFETs in the H bridge, because nek minnit there was a hell of a lot of smoke.
However, I was able to replace everything with the help of some Asian eBay stores which somehow had supplies of 'genuine ,high quality ! [sic]' replacements for the custom inverter ASICs. I theorise that they had gotten hold of rejected (e.g. due to high failure rate) batches of those ICs from the OEM, because several of the ICs I recieved were open circuit on all pins while some worked perfectly. Suffice it to say, I just found ones that worked and soldered them in, and fixed all the other problems.
With all that done, there was a heroic moment when the thing finally produced a 230V sinusoidal output for the first time in several years. Soon after, it gained enough trust to be plugged into the mains(!) and some time after that I considered it fully fixed, and my second one got fixed during that process as well.
However, there was one thing still wrong. It wouldn't estimate its runtime correctly. These things are supposed to give you a number of minutes that they could supply battery power for, at the present load. But it would calculate exactly 168.0 minutes no matter the load on it. Why?
It wasn't a big issue so I left it, but a couple of years later I did some digging, knowing that the estimate was being passed from the main processor to the communications processor and then out to the network card. What I found was that the main processor was actually estimating 9999 minutes - the highest value it could - but the communications processor was suffering from some kind of overflow and displaying something else, clearly not expecting 9999 minutes (normal runtime estimates top out at less than 400 minutes).
Why? Well after digging in the main processor's configuration for a while, I found that the thing thought it had no less than 85 external battery packs connected! I mean, full credit to APC for making something that can, at least in software, support up to 85 external battery packs, but really? It seems like the EEPROM did somehow get just a little bit corrupted during my work on the PCB, and that was the result. With that fixed - the number set to zero - the problem was gone.
Annoyingly, in early 2021 I found that all of a sudden the UPS wasn't charging its batteries properly. Upon investigation, the problem was a failing SMPS transformer - a custom made device I could not replace. I assumed this had happened due to some historical torture I'd put it through, and managed to scavenge a replacement power supply daughter board off of the internet to fix the problem (it was literally the one and only one out there). But when the same thing happened on the second UPS later in the year (and there were no more replacement boards available on the internet), I realised it was more than a random failure and decided to investigate. I found that there was some kind of insulation breakdown occuring between layers of windings in the transformers, which I suspect is going to be a systematic failure in a lot of these UPSs once they reach about 10 years old. I learnt how to actually re-wind high frequency transformers, and fixed the second UPS by winding a new transformer to replace the failing original, by buying the appropriate materials and copying the turns ratio and winding structures of the original transformer. Importantly, I placed extra transformer tape between ajacent layers of windings so that the enamel insulation on the wire only sees the turn to turn voltage, not the layer to layer voltage. The fact that this wasn't done on the original transformers may be the cause of the failure there. Other factors like heat and loading may also play a role, so my recommendation is to not mod in larger external batteries with the SMT series like some people like to - buy a UPS designed to use external batteries.
I mentioned earlier that the 12V SMPS in the unit blew up. At the time, I fixed this by replacing it with two off-the-shelf DC to DC modules, to produce 12V and -9V from the 24V battery voltage. The reason I never fixed the orignal was because I'd damaged the custom made SMPS transformer it required - but now knowing how to actually rebuild those, in late 2021 I finally went back and fixed the last piece of the puzzle by winding a new custom transformer for the 12V SMPS based on the original, and restoring the original power supply circuitry. What originally happened, I now realise, is that the power supply rail was shorted elsewhere on the board, leading to hiccuping and eventually killing a diode - and then in troubleshooting that, I broke one of the output feedback resistors, so when I eventually cleared the output short, it could no longer see its output voltage and put out 25V instead of the desired 12V. With the feedback resistors fixed, a new controller IC and a new transformer, it works as intended again.
Over four years later, the thing still sits on my desk doing exactly what it's supposed to do - both of them do. Looking on from the outside, you wouldn't have a clue how much I've messed with them.
Apart from the follow up repairs mentioned above, all I've had to do is battery replacements. The thing I like about electronics, as opposed to mechanical devices, is that once you fix it and you get it right, very little maintenance is needed. The actual repair work I did on both UPSs hasn't failed subsequently.
This all goes to show that sometimes you get lucky, and things are repariable even when it looks pretty hopeless to begin with. I'm glad I didn't assume it would be totally impossible, or simply replace the whole motherboard of the UPS. Aside from the environmental and financial value of fixing something instead of throwing it out, this project taught me a lot of practical things about electronics that University simply doesn't cover.
One thing that triggers me is people who call for these devices to be locked down, such that a repair like what I describe above wouldn't be possible. This is called for in the name of embedded security, but in a way that smells like BS to me. Read my rant on APC, UPSs, and the TLStorm set of vulnerabilities here.
None yet!
Display Name
Email (optional, not displayed)
Website (optional, displayed, include "https://")
Comment (up to 1000 characters)