kxc42's comments

kxc42 · on Aug 26, 2023

I just don't get the need to figure out a system that represents the true way of taking notes. Everybody is different: different in skills, needs and working styles. Some want a diary, others a 2nd brain. It doesn't matter which system you pick, the result will be: it fits somewhere between 50% and 90%.

To take clothes as analogy: when you buy a shirt off-the-shelf, you can (usually) choose between XXS to XXL. Let's assume you want to wear the shirt for the rest of your life, wouldn't the natural conclusion be: a tailor-made shirt?

The real solution is to take one note taking system as starting point and adapt it to your needs. There is no shortcut, it will take a while till you've figured out what your needs are and how you can adapt the system you chose.

kxc42 · on Jan 9, 2023

I'm not sure why gzip still pops up for FASTQ data, as it is quite easy to bin the quality scores, align it against a reference genome and compress it as e.g. CRAM [1,2].

With 8 bins, the variant calling accuraccy seems to be preserved, while drastically reducing the file size.

[1]: https://en.wikipedia.org/wiki/CRAM_%28file_format%29

[2]: https://lh3.github.io/2020/05/25/format-quality-binning-and-...

jefftk · on Jan 9, 2023

You don't necessarily have a reference genome to align to. For example, I've recently been working with wastewater metagenomics where (a) the sample consists of a very large number of organisms and (b) we don't have reference genomes for most of these organisms anyway.

kxc42 · on Jan 9, 2023

That can be a challenge, but you can also build an "artificial" reference genome. You just use it for compression, not for any real analyses. This would allow you to still use alignment-based compression.

But I agree with you: it really depends on the type of the data.

dekhn · on Jan 9, 2023

It would be nice also that the artificial reference represented global population structure- for example, the larger the genetic distance between an individual who is sequenced, and the identity of the person who makes up the reference (an amalgam of several individuals from a common US population), the less compression you get. Instead, it seems like you could create the "genome that is the shortest distance to all other genomes" (a centroid of cluster centroids) and then the standard deviation of your compressed sizes should be much smaller.

v8xi · on Jan 9, 2023

Well I think the issue with wastewater and other screening tech is that there is no global average reference genome. In that case they're sequencing everything from phages, viruses (human and plant), bacteria, fungi, plants/animals and human...its an everything soup.

dekhn · on Jan 9, 2023

oh. From what I can tell, the total world storage for non-human genome data is trivially small (a few petabytes and not growing rapidly). Human is huge- O(petabytes)/year for a single org is not out of the question.

v8xi · on Jan 9, 2023

Thats true, but we do tremendous amounts of human DNA sequencing for certain causes at scale(e.g. understanding/treating cancer) whereas environmental sequencing is usually done to monitor/search for things at a much lower sample rate(e.g. disease load in wastewater, biodiversity from environmental samples, and looking for natural products produced by the zillions of bacteria/archaea in the oceans). From e.g. a wastewater sample perspective the latter type is going to be the majority of data, we just filter out the stuff of interest and analyze it in situ - but theres no reason to store 1B E coli genomes whereas this is necessary if we want to understand cancer evolution.

jefftk · on Jan 9, 2023

If you want to use untargeted metagenomics to detect novel human viruses you're going to be generating petabytes all by yourself: https://arxiv.org/pdf/2108.02678.pdf

dekhn · on Jan 10, 2023

I can't see any reason why you would need to save petabytes. Remember- at that scale, people think really hard about whether to pay the long-term storage and associated costs (the value of having this system should exceed its costs). The case for this already exists in (for example) cancer and other pharma.

jefftk · on Jan 10, 2023

The storage is massively cheaper than the sequencing. At some point it could be worth going back and trying to figure out how much of the raw data you can safely discarded, but at least at first there are so many more other things that are more urgent.

(The paper I linked describes more or less what I'm currently working on)

asdff · on Jan 9, 2023

It might be because some popular bioinformatic tools support using gzipped data directly

kxc42 · on Nov 29, 2022

That answer somehow reminds me of an article in logicmag: An Interview with an Anonymous Data Scientist [1].

[1]: https://logicmag.io/intelligence/interview-with-an-anonymous...

kxc42 · on Nov 25, 2022

Funny coincidence: just one week ago I and a colleague of mine started with "pytest-arch" [1], a pytest plugin to test for architectural constraints. On purpose we kept it very simple. It is already usable and works well, at least for our use cases.

You can use it to check e.g. if your domain model is importing stuff that it should not import.

We are planning to publish it soon on pypi.

[1]: https://github.com/jwbargsten/pytest-arch

kxc42 · on Nov 4, 2022

In terms of practical application, I saw (e.g. leap[1]) and enjoyed using fennel[2] for writing neovim plugins.

[1]: https://github.com/ggandor/leap.nvim [2]: https://fennel-lang.org/

funklute · on Nov 4, 2022

Do you feel that fennel was merely a slightly nicer way of writing neovim plugins, or did it in fact give you a significant boost in productivity/capabilities?

kxc42 · on Nov 4, 2022

Difficult to say, as I did not measure anything. From the code I've written I get the "feeling" that it is more compact compared to plain lua, reducing (my) cognitive load.

Fennel transpiles to lua, it doesn't give more capabilities, I would say. The concept of productivity (and capability) is anyway confounded by so many factors, making the choice of programming language negligible (unless you pick one of the extremes, such as Brainfuck, of course).

jhbadger · on Nov 4, 2022

Fennel can also be used to write games in the TIC-80 fantasy console (an open source project similar to PICO-8).

kxc42 · on June 12, 2021

I was surprised that nobody mentioned the Collective Code Construction Contract of zeromq/Pieter Hintjens [1]. It tries to minimise the friction created by maintaining & contributing to open source projects.

It is not perfect of course, but at least it is a good start. Especially the "value-" & opinion-based discussions can be reduced considerably.

[1]: https://rfc.zeromq.org/spec/42/

kxc42 · on May 27, 2021

Basically you have three approaches to tackle code samples in markdown files:

1. run the code with some kind of plugin as part of your doc pipeline

2. generate documentation from your code

3. take some kind of hybrid approach

I went for 3., annotate snippet "areas" in the source code of a project (mainly in tests) and extract the snippets to a folder, e.g. into the mkdocs folder. I commit them to the (docs) repo. If the project changes, usually I fix the tests and update the snippets in mkdocs. This way I can be sure that the code in the documentation is actually working and people can copy&paste it. To scratch my own itch, I (surprise, surprise) created a script and even packaged it[1].

[1]: https://pypi.org/project/snex/