As for why the problem took so long to correct, Amazon says that some of its server systems haven't been restarted in "many years." Given how much the S3 system has expanded, "the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."
Amazon has apologized and promises to do better in the future, at least, saying it has altered the at-fault tool (the code, not the employee) so it removes capacity slower. Beyond that, it is adding measures to stop so many being taken offline at once.