Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Lazy evaluation, however, has the benefit that, at least in theory, if conversion turns out to be unnecessary, one can skip conversion, and never pay the prize of conversion.

One could have an abstract 'String' type with concrete subclasses (ANSIString, UTF8String, UTF16String, EBCDICString, etc)

Assuming that any to-be-handled character strings can be round-tripped through UTF-8 (and that probably is a workable assumption), any function working with strings could initially be implemented as:

- convert input strings to some encoding that is known to be able to encode all strings (UTF8 or UTF16 are obvious candidates)

- do its work on the converted strings

- return strings in any format it finds most suitable

Profiling, one would soon discover that certain operations (for example, computing the length of a string) can be sped up by working on the native formats. One then could provide specific implementations for the functions with the largest memory/time overhead.

The end result _could_ be that one can write, say, a grep that can work with EBCDIC, UTF8 or ISO8859-1, without ever converting strings internally. For systems working with lots of text, that could decrease memory usage significantly.

Among the disadvantages of such an approach are:

- supporting multiple encodings efficiently will take significant time that, perhaps, is better spent elsewhere.

- the risk of obscure bugs increases ('string concatenation does not quite work if string a is EBCDIC, and string b is ISO8859-7, and a ends with rare character #x; somehow, the first character of b looses its diacritics in the result')

- a program/library that has that support will be larger. If a program works with multiple encodings internally, its working set will be larger.

- depending on the environment, the work (CPU time and/or programmer time) needed to call the 'correct for the character encoding' variant of a function can be too large (in particular, for functions that take multiple strings, it may be hard to choose the 'best' encoding to work with; if one takes function chains into account, the problem gets harder)

- it would not make text handling any easier, as programmers would, forever, have to keep specifying the encodings for the texts they read from, and write to, files and the network.

[That last one probably is not that significant, as I doubt we will get at the ideal world where all text is Unicode soon (and even there, one still has to choose between UTF8 and UTF16, at the least)]

I am not aware of any system that has attempted to take this approach, but would like to be educated on them.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: