If you consider W the window size, and N the input size, this is still O(N * W²), as TFA states. (TFA's implementation is generic over W, yours is constant and essentially -funroll'd in place. But it's the same, for big-O; as you mention in your reply later, you could even make a macro, generic over W.)