If you can screen tokens against your grammar fast enough, you can build a bitmask over the entire token vocabulary and apply it right before sampling. As vocabulary sizes grow, this gets more complex to do in real time, but we (and other libraries) have found several optimizations to do this extremely quickly (eg for guidance, we detail some optimizations here https://github.com/guidance-ai/llguidance/blob/main/docs/opt...).
Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)
Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)