Pg-Emoji

jszymborski · on Jan 21, 2021

So, am I right in my understanding that this is meant to add support for emojis not present in UTF8?

Ancapistani · on Jan 21, 2021

I don’t think so.

Without digging into the source - and I intend to do that, if someone more familiar with this doesn’t chime in - it appears that it’s targeted at reducing resource consumption.

UTF-8 can encode emoji fine. Consider (“ grinning face with smiling eyes”), which is `\xF0\x9F\x98\x81` in bytes. That’s four bytes. From the pg-emoji Readme:

> A lookup-table is constructed from the first 1024 emojis from [https://unicode.org/Public/emoji/13.1/emoji-test.txt], where each emoji maps to a unique 10 bit sequence.

> The input data is split into 10 bit fragments, mapped to the corresponding emojis.

If my understanding is correct thus far, then instead of storing four bytes for each emoji, you’d only need 10 bits.

I don’t know where this would be worthwhile.

I’m further confused by the purpose of `to_text()` and `from_text()`. Their example shows a string composed of mostly Latin characters being encoded into a string of emoji and back.

JoelJacobson · on Jan 21, 2021

> I’m further confused by the purpose of `to_text()` and `from_text()`. Their example shows a string composed of mostly Latin characters being encoded into a string of emoji and back.

This is meant to be used if you want to pass some text containing escape characters or perhaps JSON. Note also that the first emoji is a checksum, which might be useful if you want to make sure a user correctly copy/pasted a string, as opposed to sending a raw text string (without checksum).

Ancapistani · on Jan 21, 2021

> This is meant to be used if you want to pass some text containing escape characters or perhaps JSON.

I guess I don’t understand how this is an improvement. Perhaps it’s because I typically interact with the DB through a language-specific library/protocol like Python’s DB API, which handles escaping strings and parameterization without my really having to think about it.

Could you provide a specific example of when this might solve a real-world problem?

JoelJacobson · on Jan 21, 2021

One intended use-case for the from_text()/to_text() functions is when information is manually copied from some place and pasted somewhere else, where you are worried the user might make a mistake and select the wrong piece of text.

For instance, if you instruct the user to copy "this text string" and paste it somewhere, some users might copy the text string with the double-quotes and some without them. By instead emoji encode the string, the receiver of the copied emoji string can detect if not all emojis were copied.

Ancapistani · on Jan 21, 2021

This strikes me as data validation, which should reside in the application layer - I don't see how pg-emoji helps in any way.

Further... if the receiver can validate the encoded string itself, they implicitly already have the string. Why require the user to copy/paste at all? If you meant "Ensure that the user hasn't copied quotation marks as well", then we're back to it being application logic.

If I'm understanding correctly that the primary benefit is that there is a checksum, then there are already many solutions for this in common use - base58checksum, as used to ensure the validity of Bitcoin addresses, comes immediately to mind. I wrote an implementation of that quite a while ago: https://github.com/lyndsysimon/cryptocoin/blob/primary/crypt...

Please don't misunderstand, I'm in no way intending to be argumentative. I don't understand the practical use of this project, which leads me to believe that there is a problem being solved that I lack the context to identify.

JoelJacobson · on Jan 21, 2021

In my case, PostgreSQL is the application layer, the application is written in database functions, and I’m using PostgREST to expose it to my front-end.

Ancapistani · on Jan 24, 2021

Ah, that makes more sense then.

MrStonedOne · on Jan 21, 2021

this is not an emoji support system or a system for storing emojis.

Its a system for encoding data as emojis. "this is a string" => some emojis. ie baseemoji or base1024

sfeng · on Jan 21, 2021

I believe it is an encoding for data into base 1024, using emoji as the symbol set. Similar in concept to Base64, meant to encode data into a format which can be sent anywhere ASCII is acceptable. This, one could think, would allow you to do the same thing more efficiently but with systems which accept emoji.

jfk13 · on Jan 21, 2021

I'd expect that most places where emoji are accepted and reliably preserved, you could also use things like Han ideographs, which would give you a much larger symbol set to work with.

WorldMaker · on Jan 21, 2021

One reason to pick emoji is visual distinctiveness and user familiarity. While admittedly there are large populations familiar with CJK ideographs and their construction/deconstruction, there are many more people familiar with emoji at this point. In the case of an encoding error or trying to visually "diff" two encodings, many audiences will spot emoji differences and/or problems with badly encoded emoji (much easier than they might spot differences in CJK ideographs).

(Admittedly there are still issues within the emoji space such as some of the "faces" are quite similar in appearance in many fonts and still easily confused. Plus in the larger emoji space the subtle differences of skin color/gender can be easily confused if you have to rely on them for distinction. Restricting to only 1024 emoji and fewer ZWJ sequence variations presumably takes care of most of those issues.)

roywiggins · on Jan 21, 2021

There's always base65536

https://github.com/qntm/base65536

alex_duf · on Jan 21, 2021

I think it's a tongue in cheek project that isn't meant to be used in any production system.

I'm pretty sure you can already put emojis in the text fields of postgres. (or at least I'd be surprised if you couldn't)

Ancapistani · on Jan 21, 2021

> I think it's a tongue in cheek project

That would certainly make more sense than anything I’ve been able to glean from it.

I’ve definitely used Postgres text columns to store user-provided text values that included emoji in the past. They’re part of my standard test case for any user input.

JoelJacobson · on Jan 21, 2021

The idea is to encode binary strings in a visually shorter form than e.g. hex, and also make it easier to visually detect differences. It’s also possibly easier to remember a bunch of emojis than a hex string.

jfk13 · on Jan 21, 2021

Not really. Do you really notice whether someone uses "Grinning face with smiling eyes" or just plain "Grinning face"? Or was it "Grinning face with big eyes", or maybe "Beaming face with smiling eyes". Or were they "squinting" eyes? Maybe the face was just "smiling", not "grinning". Sheesh.

jeltz · on Jan 21, 2021

If you select a good set of clearly ambiguous emojis I could see the use for it. See for example the Matrix spec which recommends using emojis for verifying E2EE signatures.

https://matrix.org/docs/spec/client_server/latest#sas-method...

jfk13 · on Jan 21, 2021

Yes, a carefully selected set of 64 symbols would be much more sensible from that point of view. This project, though, apparently uses "the first 1024 emojis from [https://unicode.org/Public/emoji/13.1/emoji-test.txt]", which is an entirely different matter.

cmeacham98 · on Jan 21, 2021

It's a novelty fun project, it serves no practical purpose. It's an encoding scheme similar to base64 or URL encoding, but one that does nothing useful.

jeltz · on Jan 21, 2021

A similar technique but with a different set of emojis is used by the Element chat client to verify signatures for end to end encryption.

https://matrix.org/docs/spec/client_server/latest#sas-method...

mathiasrw · on Jan 21, 2021

This is genius.

Data encoded in base1024 (here using the 1024 safe chars represented by emojis) gives much more efficient storage usage.

16 kB raw data encoded in base64

    ceil(16*1024/3)*4 = 21848 bytes long ~= 21.8kB.

16 kB raw data encoded in base1024

    ceil(16*1024/9)*10 = 18210 bytes long ~= 18.21kB.

So base64 needs about 19.7% more data storage than base1024 and both can be used anywhere utf8 is supported.

Let the baseEmoji revolution begin...

jasperry · on Jan 21, 2021

I was disappointed when I saw it wasn't emojis drawn to look like Paul Graham.

sillysaurusx · on Jan 21, 2021

I've been waiting for this moment for literally years.

Long ago, I made http://github.com/strayptr/memes

If you scroll down to Kappa, you'll see a Lambda, which is pg in the style of twitch.tv's Kappa emote.

https://cloud.githubusercontent.com/assets/12214175/7581578/...