Fathom: a framework for understanding web pages

unabst · on April 27, 2017

> The browser could recognize a Log In link, follow it in the background, and log you in,

As a web developer, this is exactly what I don't want. Just the other day I got bit by font boosting on Mobile Chrome. Couldn't figure out for the life of me why my h1 was bigger than I had explicitly specified. No traces of anything going on on the desktop either. Turns out, my page was being tampered with because someone at Chrome had a "brilliant" idea for mobile. The fix is to specify max-height to something inconsequential like 1000000px. My page just had a header, but font boosting destroys navigation menus and tiles too. Thank you but no thank you.

There cannot be any conflict between browser and web developer. If we have to fight and hack one another, we're both failing.

Font boosting needs to be explicit. And nothing should be done without the explicit intent of the author of the page.

If the browser wants to provide automatic login, then great. Please outline exactly how to enable it, make it as easy and automated as possible, and help decrease my workload. Thank you and thank you.

flukus · on April 28, 2017

> There cannot be any conflict between browser and web developer. If we have to fight and hack one another, we're both failing.

It sounds like you're understanding is wrong, You're css is a polite suggestion to the browser, nothing more.

If you want pixel perfect rendering then publish PDF's.

pharrlax · on April 28, 2017

The problem is that when browsers interpret features as "polite suggestions", what tends to happen is either

A) Developers hack their way around to the result they want anyway, or

B) Users of that browser lose a chunk of the experience.

Either way, everyone loses.

flukus · on April 28, 2017

C) Users get a much better experience, like with reader mode.

When I read a site I want to read it in the colors and fonts I want, not what some designer things looks nice.

The web was designed to work this way.

TeMPOraL · on April 28, 2017

Yeah, this is the root of the conflict.

Web designers think web pages are like magazines, that they get to dictate layout. And most of the development in web browsers and standards lend themselves to that view.

As a user, OTOH, I think of web browser as my User Agent. I accept "polite suggestions" from the website, but ultimately I should have the last word about how it looks. The web was originally designed to work this way.

I'm not sure how to reconcile those views. I'm obviously biased towards the second, and I'd love web developers to be told to bite it. But this, unfortunately, won't work, because there are strong business pressures to the contrary.

zeckalpha · on April 28, 2017

> If you want pixel perfect rendering then publish PDF's.

I hope that's sarcasm. PDFs are similarly beholden.

unabst · on April 28, 2017

Code should be explicit. There should be no "in between the lines" and the browser should not be guessing what the developer or consumer wants, let alone override what was explicitly declared.

Then having to "politely suggest" max-height: 1000000px is not the right conversation.

derefr · on April 28, 2017

What does it mean to "be explicit" about how to render a page, when the UA is, say, a screen-reader for a blind person? Or Siri? Or an even more novel client-type with no directly-applicable CSS rules? They want to do a completely different thing with the page than what you think of as "rendering" it. They might take some of your CSS as hints on how to do that job, but they're not obeying that CSS; they're inferring their own rule set for their own rendering algorithm from CSS.

Perhaps there needs to be a distinction made between "standard web browsers in Standards Mode"—programs that nominally "obey" CSS, and should be chastised for deviating from it—and all other clients, on which you can place no such expectation.

(But even then, it's perfectly within the rules of CSS to apply UA styles at the beginning of the cascade. Usually those are per-browser, but there's no reason there couldn't be per-document ones generated from heuristics. Want to turn them off? Use a CSS Reset.)

unabst · on April 28, 2017

> when the UA is, say, a screen-reader for a blind person?

The blind person explicitly turns on blind mode. This is not a difficult problem. Trying to guess if the user is blind is a hard problem.

If the user wants bigger fonts, they can zoom. If font boosting is a feature they would like on, then have them turn it on. If the author makes a crappy page it should be on them. The best thing a browser could do is let them know. And to visitors, provide options. But instead, they automatically enable these secret brilliant features that break random things. These are not solutions. These are the cause of many problems.

Automatic login? Great. Have a button. Have a feature. Don't do it in the background. Just let me tell you what I want. And let the page authors embrace those features and tailor them, and not wind up hacking them to tame their functionality.

derefr · on April 28, 2017

I'm not talking about option-switches; I'm talking about purpose-built browsers or browser extensions, where using the browser was the choice you made to get these effects.

If someone wants to develop what's essentially "automatic login, even for sites that would hate you if you did that: the browser", is that wrong?

Hell, this is essentially the same argument as the one behind ad-blocking. If someone wants to build a browser that—by default—alters your page to remove ads (surprise: someone does!), and people want to use that browser (probably: everyone), I don't think you have the moral high ground to tell them to stop.

Sure, you might be able to at least insist that they do some sort of feature-negotiation, where they tell you (maybe with feature-headers? a piece-wise replacement for UA-string heuristics, how lovely) what they're going to do to the page, and your server can then choose to do things like just not serving pages to people who have {font boosting, ad-blockers, etc.}; or redirecting to a page telling them their browser is bad and they should feel bad.

unabst · on May 1, 2017

> feature-negotiation

It's words like these. We should not be looking to negotiate with anyone -- not as an author, and not as a consumer.

If someone builds an auto-login browser, or an auto-login feature, then fine. But when Mozilla decides to put this in by default without telling everyone, then not fine.

If font boosting was part of some cross browser mobile web standard then fine. But if you're Google and it's a feature specific for your browser, and you turn it on be default, then not fine.

Of course, it seems like the premise is always that your feature can do no harm. But in practice it's always the opposite. There are always edge cases where something is broken or not right, that leads to these features as a cause.

Login is actually extremely important. To tamper with the behavior of the browser when it comes to logins, and to do it automatically, seems extremely dangerous.

But either way, it's about communication, not negotiation. If the user wants to turn it on, great. If the author turns it on, great. If no one turns it on, but the browser developer just "thinks it's a good idea" for every site ever built, then not great.

deburo · on April 28, 2017

The author suggested having a standard browser UI for login. The website would only support it. Am I misunderstanding their intent? I don't see the problem here.

braveo · on April 28, 2017

I agree with you with the caveat that the customer themselves can, and should, be changing things to suit their own needs.

But at that point if the customer screws something up, they understand that's on them.

unabst · on April 28, 2017

Exactly. My rant was lopsided, but to be more accurate: The browser should facilitate the communication between the author and the reader of the page, and not contaminate it.

All these features are great. They just need to be made explicit and obvious. Inferring what anyone wants let alone overriding what they said they wanted should just be a standard no no.

The best thing a browser could do for the author and the reader would be to explicitly say why something is broken and how to fix it. For example, "X is broken because of extension Y".

Browsers already have great analysis tools but there could be one more FYI breakdown where the browser would just tell you what it knows about any conflicts or issues it sees, and provide options. Something designed for the average user, but that could be extremely useful for authors also by providing a simple list of issues the browser sees about their page -- from an objective, browser POV. Input is always valuable. Acting on it without permission, not so much.

yeukhon · on April 28, 2017

Wrll the polite suggestion becomes the "magic number" we are taught to avoid writing in our programming 101 class.

vertex-four · on April 28, 2017

CSS was never, ever meant to allow for pixel-perfect rendering. That's something that wound up being forced on the web when print designers got a hold of FrontPage, Dreamweaver, etc.

Your fix should be to ensure your site continues to work when some of your CSS is ignored.

nebabyte · on April 28, 2017

I don't get this seeming holy-war argument for what CSS 'should' and 'shouldn't' be used for that somehow attempts to discount the contrasting opinions those print designers (and their ecosystem) likely have.

CSS can be used for many things; that's part of the beauty of an open standard. I don't see what's wrong with those users pushing to maintain their ecosystem on a platform which clearly can support it to create works which benefit everyone.

TeMPOraL · on April 28, 2017

Ultimately there's a strong conflict here, about who gets to decide what gets rendered on my screen. I say it should be me (via my user agent, i.e. browser); web designers who treat webpages as print magazines think they should get to decide.

gavinpc · on April 27, 2017

We've been seeing more Datalog-inspired DSL's around here, and that's a good thing. Fathom surely has uses beyond those the OP mentioned.

But as for those use cases... well, it just makes me sad. Obviously people have perverse incentives to make the kind of noise that Mozilla is bemoaning, and those people will find a way to game any system --- especially a highly readable one!

grincho · on April 28, 2017

> Fathom surely has uses beyond those the OP mentioned.

(Author here.) In fact, Fathom isn't particularly coupled to the DOM, apart from the dom() call that acts as the initial source of data and some of its optional utility procedures. With a few tweaks, you could use it for any score-and-rank problem.

throwaway2016a · on April 27, 2017

I like this post but I disagree with the premise... the whole "webpages don't implement microformats and RDF so we're going to take away all their control" doesn't settle right with me.

If stores using microformats and semantic markup is important then give value added to the places that support it and people will start using. If readability is important start penalizing sites with bad readability indexes. But please don't take over my UX. I can't think of a faster way to stifle innovation.[1]

Try the documentation, much less political https://mozilla.github.io/fathom/intro.html#why

[1] That's an exaggeration. I can think of lots of better ways.

CapacitorSet · on April 28, 2017

>If readability is important start penalizing sites with bad readability indexes.

That's relatively easy to do when you're Google and you develop both a browser and the most used search engine; not so much when you're Mozilla and have no apparent mean to apply pressure to websites.

r3bl · on April 27, 2017

I agree with you on one hand that a browser shouldn't mess with the layout, but it could provide an alternative layout of the content (kind of like the reader mode does) in the click of a button, and I'll be more than happy to use it (the same way I got used to reading articles in a single, clean layout.

But, how do you think that a browser could theoretically penalize a website?

throwaway2016a · on April 27, 2017

> But, how do you think that a browser could theoretically penalize a website?

Easier to do on a search engine. But for a browser they already do: many browsers are or plan to call out sites that don't use HTTPs.

Or as another example: the warning that happens when you try to access a page that uses a self-signed certificate.

Granted the penalty needs to be proportional to the crime. So maybe the penalty and value add go together. If the value add is a useful feature that everyone wants and demands then not having it becomes a penalty to the sites that don't offer it.

TeMPOraL · on April 28, 2017

> But please don't take over my UX.

The thing is, it's not your UX, it's my UX. The more power I have to easily clean out all the "design" crap websites throw at me, the better my experience and convenience in using the web.

throwaway2016a · on April 28, 2017

> The thing is, it's not your UX, it's my UX. The more power I have to easily clean out all the "design" crap websites throw at me, the better my experience and convenience in using the web.

In my opinion, absolutely not true. If a UX annoys you so much that you don't want to use a site... stop using that site. That is the power you have. Vote with your wallet. You have the power to deprive the site of your business. That's it. Not to remove ads, not to change the UX, just to stop using it.

TeMPOraL · on April 28, 2017

But why? You're sending bytes to my computer, I get to do with them whatever I want. If you want to restrict me from doing that, DRM-ed protocols are the way to go.

throwaway2016a · on April 28, 2017

You are receiving bytes that I pretty clearly own the copyright to. You have an implied limited license to the content. You are not buying the content.

But to be clear. I actually lean towards agreeing with you up to a point. And I think the topic shifted slightly from my original intend.

But it is my opinion that the point you are pushing it on other people (by bundling it with your browser) is where I draw the line. I'm okay if you, TeMPOraL do some post processing of the data but if Mozilla (a browser) or Comcast (an ISP) does it on your behalf without your say and without the content creators say.

Especially considering that I am responsible for customer service on my own site.

TeMPOraL · on April 29, 2017

> but if Mozilla (a browser) or Comcast (an ISP) does it on your behalf without your say and without the content creators say.

> Especially considering that I am responsible for customer service on my own site.

That's a reasonable position I haven't considered. Thanks.

hnruss · on April 28, 2017

I looked at mozilla/activity-stream a bit to try to find some examples of fathom usage, but didn't find any. Then I figured out that it doesn't depend directly on fathom-web, it depends on page-metadata-parser (which then depends on fathom-web).

Here's the code in that project which uses Fathom: https://github.com/mozilla/page-metadata-parser/blob/master/...

dlwdlw · on April 28, 2017

Often times, trying to reduce illegibility kills the ecosystem. Centralized economies for example, though easier to understand and control, often destroyed economies.

Certain choices cause divergence, they increase your optionality in the future. Other choices cause convergence to one thing. Convergence is only desired if and only if the one true thing is the one true thing.

So neither divergent nor convergent thinking is good in itself unless you apply another filter on how you see the future. You either see messiness as indicative of the future having possibility, or you see the present as a broken world in need of fixing. (Or isolated areas of perfection needing protection and isolation)

untangle · on April 28, 2017

Prev HN Comments:

https://news.ycombinator.com/item?id=12060787

andy_ppp · on April 27, 2017

This looks incredible for starting to understand web content in useful ways. Will definitely give this a try for a product I'm building.

vinceguidry · on April 27, 2017

Would using something like this run you afoul of the CFAA for rules prohibiting web scraping?

throwaway2016a · on April 27, 2017

I haven't read of CFAA and scraping so I had to do some research and it is fascinating. I just read through http://www.sociallyawareblog.com/2014/07/21/data-for-the-tak...

It seems that even though people keep trying to use CFAA for scraping it almost always gets thrown out as long as the source is publicly accessible. Even if the ToS prohibits scraping explicitly.

Seems people have had more luck with just plain old copyright.

But back to this tool in particular... depends entirely on what you are using the data for. Don't use it to gain access to data you wouldn't normally have or give other people access to data they wouldn't normally have and I can't see why CFAA applies.

derefr · on April 28, 2017

Is it scraping if you're doing it on a client device, to present to said client?

acdha · on April 27, 2017

Previous discussion: https://github.com/mozilla/fathom

nebabyte · on April 28, 2017

> That scores within 7% of Readability’s output on a selection of its own test cases

"on a selection of" its own test cases = lol

> Fathom is a data-flow language like Prolog, so data conveniently “turns up” when there are applicable rules that haven’t yet seen it

Why are you trying to explain the concept of declarative programming without just telling people about declarative programming? Just because it's not common in a scene doesn't make it some proprietary concept you've just introduced there.

> The best part is that Fathom rulesets are data

That's actually kinda nice

> In 70 lines,

Insert that comic about 'just a few lines' masking the function calls that actually don't 'replace' a system, just change where work is being done when flexibility is not needed

grincho · on April 28, 2017

You might prefer the more technical introduction at https://mozilla.github.io/fathom/intro.html#specific-areas-w.... It talks about Fathom in terms of declarativeness.

> "on a selection of"

One has to start somewhere! At this early stage, my aim is to demonstrate that Fathom has value for simplifying the implementation of recognizers, not to claim a polished, production-ready Readability alternative.

Though, frankly, getting to the latter would be a fun project for someone: just write more features (in the ML sense), and add more tuning data. There are lots of low-hanging TODOs in the code around https://github.com/mozilla/fathom/blob/master/examples/reada....

TeMPOraL · on April 28, 2017

> Insert that comic about 'just a few lines' masking the function calls that actually don't 'replace' a system, just change where work is being done when flexibility is not needed

Do you have link for that comic? I think haven't seen it. The concept you describe is very important though - abstractions like that are usually about moving complexity around (compare Turing tarpit - all Turing-complete languages are in principle equivalent; the difference is in their distribution of complexity along the axis of programs you want to write).