So, you might not have noticed, but I still haven't published my BIPs. That's mainly because I've been having too much fun with Julian who has been hacking on the benchmark code as we refine the BIP.
In particular, OP_ROLL. BIP-143 already notes this opcode can be slow, and indeed, moving every stack element by one is bad enough when you can have 1000 of them. If we want to increase that to 32k, which we'd like to do so you can push every output onto the stack, for example, we can no longer ignore this problem.
This is the only case where stack manipulation itself causes a significant overhead. For every other opcode we can treat it as the cost of interpreting opcode and it is not addressed by varops. Annoying!
So we have benchmarks which show how much we should charge for it to limit the damage it can cause. But everything else is derived from a "bytes manipulated" model, and so I would like to extend the model a little to take into account this case. Not just for bitcoind as it is today, but for any reasonable implementation in the future.