The whole point of this post is that an error returned from file.Close DOES matt...

foldr · on Sept 10, 2024

I think from a Go point of view, the lesson to be drawn from that is "don't defer a function call if you need to check its error value", rather than "defer needs to support checking of function return values".

In the example at hand, it really makes more sense to call Close() as soon as possible after the file is written. It's more of an issue with the underlying OS file API making error checking difficult.

In 99% of cases, the solution to this problem will be to use a WriteFile function that opens, writes and closes the file and does all the error handling for you.

randomdata · on Sept 10, 2024

> the lesson to be drawn from that is "don't defer a function call if you need to check its error value"

Isn't the lesson here: If you must have a Close method that might fail in your API, ensure it can safely be called multiple times?

As long as that is true, you can approach it like you would any other API that has resources that might need to be cleaned up.

    f, _ := os.Create(...)
    defer f.Close()
    // perform writes and whatever else
    if err := f.Close(); err != nil {
        // recover from failure
    }

(os.File supports this, expectedly)

> the solution to this problem will be to use a WriteFile function

If it were the solution you'd already be using os.WriteFile. It has a time and place, but often it is not suitable. Notably because it requires the entire file contents to be first stored in memory, which can become problematic.

Certainly you could write a custom WriteFile function that is tuned to your specific requirements, but now you're back to needing to be familiar with the intricacies of a lower-level API in order to facilitate that.

foldr · on Sept 10, 2024

Sure, that's an alternative, although it means there will be some code paths where the error returned by f.Close() becomes the error returned by the entire function and others where it is ignored (though you could easily log it). That might be fine, but you also might want to handle all the cases explicitly and return a combined error in a case where, say, a non-file-related operation fails and then the file also fails to close.

randomdata · on Sept 10, 2024

> becomes the error returned by the entire function

If you find the error returned by f.Close to be significant, are you sure returning again it is the right course of action? Most likely you want to do something more meaningful with that state, like retrying the write with an alternate storage device.

Returning the error is giving up, and giving up just because a file didn't close does not make for a very robust system. Not all programs need to be robust, necessarily, but Go is definitely geared towards building systems that are intended to be robust.

foldr · on Sept 11, 2024

It obviously depends on the context. There's no general right or wrong answer to that question.

randomdata · on Sept 10, 2024

You seem confused. The article is about writing a file where it does matter, but the comment example, which is what we're talking about, only reads a file. If close fails after read, who gives a shit? What difference is it going to make? All your read operations are complete already. Close isn't going to trigger a time machine that goes back and time and undos the reads you've performed. It is entirely inconsequential.

jrockway · on Sept 10, 2024

> If close fails after read, who gives a shit?

ulimit -n

You ignore errors on close, and one morning you wake up with your app in CrashLoopBackoff with the final log message "too many files". How do you start debugging this?

Compare the process to the case where you do log errors, and your log is full of "close /mnt/some-terrible-fuse-filesystem/scratch.txt: input/output error". Still baffling of course, but you have some idea where to go next.

randomdata · on Sept 10, 2024

To start, you need to figure out why Kubernetes isn't retaining your stack trace/related metadata when the app crashes. That is the most pressing bug. Which is probably best left to the k9s team. You outsourced that aspect of the business of good reason, no doubt.

After they've fixed what they need to fix you need to use the information now being retained to narrow down why your app is crashing at all. Failing to open a file is expected behaviour. It should not be crashing.

Then maybe you can get around to looking at the close issue. But it's the least of your concerns. You've got way bigger problems to tackle first.

jrockway · on Sept 10, 2024

The app crashes because "too many files" includes the fd accept(2) wants to allocate so your app can respond to the health check.

randomdata · on Sept 10, 2024

A file not able to opened is expected, always! accept is no exception here. Your application should not be crashing because of it.

If I recall, Kubernetes performs health checks over HTTP, so presumably your application is using the standard library's http server to provide that? If so, accept is full abstracted away. So, if that's crashing, that's a bug in Go.

Is that for you to debug, or is it best passed on to the Go team?

jrockway · on Sept 11, 2024

There isn't a bug, it's resource exhaustion. You open a bunch of files and they fail to close. You don't log errors on the close, so you have no idea it's happening. Now your app is failing to open new file descriptors to accept HTTP connections. You get a fixed number of fds per app; ulimit -n. If you don't close files you've read, the descriptor is gone.

The bug in this case is in the filesystem that hangs on close. It happens on network filesystems. You can't return the fd to the kernel if your filesystem doesn't let you.

randomdata · on Sept 11, 2024

The bug of which we speak is in that your app is crashing. Exhausting open file handles is expected behaviour! Expected behaviour should not lead to a crash. Crashing is only for exceptional behaviour.

The filesystem hanging is unlikely to be a bug. The filesystems you'd realistically use in conjunction with Kubernetes are pretty heavily tested. More likely it is supposed to hang under whatever conditions has lead that to happen.

And, sure, maybe you'll eventually want to determine why the filesystem has moved into that failure state, but most pressing is that your app is crashing. All that work you put into gracefully handling the failing situation going to waste.

jrockway · on Sept 11, 2024

You're really hung up on Kubernetes but it was an incidental comment in a hypothetical story.

"You wake up and find out that Heroku's staff is anxiously awaiting your departure from your apartment to tell you that your app is down."

randomdata · on Sept 11, 2024

Kubernetes is really here nor there. It's the crashing of the app that is our focus. An app should not be crashing on expected behaviour.

That's clearly a bug, and the bug you need to fix first so that you can have your failsafes start working again. You asked where to start and that's the answer, unquestionably.

jrockway · on Sept 11, 2024

The app doesn't crash, it's deadlocked. It can't do any more work because to do future work it needs to accept TCP connections. It can't do that because it has hit a resource limit. It hit the resource limit because it didn't correctly close files. It can't close files because of a bug in the filesystem. You don't know this because you didn't log the errors.

I really don't know how I can make my explanation simpler.

randomdata · on Sept 11, 2024

> The app doesn't crash

You literally said that it crashes:

   The app crashes because "too many files" includes the fd accept(2) wants to allocate so your app can respond to the health check.

https://news.ycombinator.com/item?id=41505892

> I really don't know how I can make my explanation simpler.

Not making up some elaborate story that you now are trying to say didn't even happen would be a good start. What you are actually trying to communicate is not complicated at all. It didn't need a story. Not sure what you were thinking when you decided fiction writing was a good idea, but I certainly had fun making fun of you for it! So, at least it was not all for not.

Cthulhu_ · on Sept 10, 2024

Only if you can safely assume the OS, file system, or std lib cleans up any open file handles that failed to close; I'm 99% sure this is the case in 99% of cases, but there may be edge cases (very specific filesystems or hardware?) where it does matter? I don't know.

randomdata · on Sept 10, 2024

You can't safely assume that, but what are you going to do about it when it does fail? There is nothing you can do. There isn't usually a CloseClose function to use when Close fails. If Close fails, that's it. You're otherwise out of luck.

Certainly, in the write failure case where there is write failure you'd want to try writing to something else (ideally), notify someone that an operation didn't happen (as a last resort), or something to that effect in order to recover.

But in this case there is no need to try again and nobody really cares. Everything you needed the resources for is already successfully completed. If there is failure when releasing those resources, so what? There is nothing you can do about it.