Without digging into the source - and I intend to do that, if someone more familiar with this doesn’t chime in - it appears that it’s targeted at reducing resource consumption.
UTF-8 can encode emoji fine. Consider (“ grinning face with smiling eyes”), which is `\xF0\x9F\x98\x81` in bytes. That’s four bytes. From the pg-emoji Readme:
> The input data is split into 10 bit fragments, mapped to the corresponding emojis.
If my understanding is correct thus far, then instead of storing four bytes for each emoji, you’d only need 10 bits.
I don’t know where this would be worthwhile.
I’m further confused by the purpose of `to_text()` and `from_text()`. Their example shows a string composed of mostly Latin characters being encoded into a string of emoji and back.
> I’m further confused by the purpose of `to_text()` and `from_text()`. Their example shows a string composed of mostly Latin characters being encoded into a string of emoji and back.
This is meant to be used if you want to pass some text containing escape characters or perhaps JSON. Note also that the first emoji is a checksum, which might be useful if you want to make sure a user correctly copy/pasted a string, as opposed to sending a raw text string (without checksum).
> This is meant to be used if you want to pass some text containing escape characters or perhaps JSON.
I guess I don’t understand how this is an improvement. Perhaps it’s because I typically interact with the DB through a language-specific library/protocol like Python’s DB API, which handles escaping strings and parameterization without my really having to think about it.
Could you provide a specific example of when this might solve a real-world problem?
One intended use-case for the from_text()/to_text() functions is when information is manually copied from some place and pasted somewhere else, where you are worried the user might make a mistake and select the wrong piece of text.
For instance, if you instruct the user to copy "this text string" and paste it somewhere, some users might copy the text string with the double-quotes and some without them.
By instead emoji encode the string, the receiver of the copied emoji string can detect if not all emojis were copied.
This strikes me as data validation, which should reside in the application layer - I don't see how pg-emoji helps in any way.
Further... if the receiver can validate the encoded string itself, they implicitly already have the string. Why require the user to copy/paste at all? If you meant "Ensure that the user hasn't copied quotation marks as well", then we're back to it being application logic.
If I'm understanding correctly that the primary benefit is that there is a checksum, then there are already many solutions for this in common use - base58checksum, as used to ensure the validity of Bitcoin addresses, comes immediately to mind. I wrote an implementation of that quite a while ago: https://github.com/lyndsysimon/cryptocoin/blob/primary/crypt...
Please don't misunderstand, I'm in no way intending to be argumentative. I don't understand the practical use of this project, which leads me to believe that there is a problem being solved that I lack the context to identify.
In my case, PostgreSQL is the application layer, the application is written in database functions, and I’m using PostgREST to expose it to my front-end.
I believe it is an encoding for data into base 1024, using emoji as the symbol set. Similar in concept to Base64, meant to encode data into a format which can be sent anywhere ASCII is acceptable. This, one could think, would allow you to do the same thing more efficiently but with systems which accept emoji.
I'd expect that most places where emoji are accepted and reliably preserved, you could also use things like Han ideographs, which would give you a much larger symbol set to work with.
One reason to pick emoji is visual distinctiveness and user familiarity. While admittedly there are large populations familiar with CJK ideographs and their construction/deconstruction, there are many more people familiar with emoji at this point. In the case of an encoding error or trying to visually "diff" two encodings, many audiences will spot emoji differences and/or problems with badly encoded emoji (much easier than they might spot differences in CJK ideographs).
(Admittedly there are still issues within the emoji space such as some of the "faces" are quite similar in appearance in many fonts and still easily confused. Plus in the larger emoji space the subtle differences of skin color/gender can be easily confused if you have to rely on them for distinction. Restricting to only 1024 emoji and fewer ZWJ sequence variations presumably takes care of most of those issues.)
That would certainly make more sense than anything I’ve been able to glean from it.
I’ve definitely used Postgres text columns to store user-provided text values that included emoji in the past. They’re part of my standard test case for any user input.
The idea is to encode binary strings in a visually shorter form than e.g. hex, and also make it easier to visually detect differences. It’s also possibly easier to remember a bunch of emojis than a hex string.
Not really. Do you really notice whether someone uses "Grinning face with smiling eyes" or just plain "Grinning face"? Or was it "Grinning face with big eyes", or maybe "Beaming face with smiling eyes". Or were they "squinting" eyes? Maybe the face was just "smiling", not "grinning". Sheesh.
If you select a good set of clearly ambiguous emojis I could see the use for it. See for example the Matrix spec which recommends using emojis for verifying E2EE signatures.
Yes, a carefully selected set of 64 symbols would be much more sensible from that point of view. This project, though, apparently uses "the first 1024 emojis from [https://unicode.org/Public/emoji/13.1/emoji-test.txt]", which is an entirely different matter.
It's a novelty fun project, it serves no practical purpose. It's an encoding scheme similar to base64 or URL encoding, but one that does nothing useful.