I have the impression the implied conclusion is that under the situation described it would be better to consult different LLM models, than a specific one, but that is not what they demonstrate:
to demonstrate this you measure the compute / cost of running and human-verifying the output.
the statistics provided don't at all exclude the possibility that instead of giving the top 5 models each a single opportunity to propose a solution, it may be more efficient to give the 5 opportunities to solve the problem to the best scoring model:
at 24% win rate the null hypothesis (what a usual researcher ought to predict based on common sense) would be that the probability of a loss is 76%, and the probability that it loses N times is (0.76 ^ N), and so the probability of it winning in N attempts is ( 1 - (0.76 ^ N ) ).
So consulting the best scoring model twice (2 x top-1) I would expect: 42.24% better than the giving the 2 top scoring models each a single try ( 1 x top-2 ) as that resulted in 35%
Same for 3x top-1 vs 1x top-3: 56.10% vs 51%
Same for 4x top-1 vs 1x top-4: 66.63% vs 66%
Same for 5x top-1 vs 1x top-5: 74.64% vs 73%
Same for 6x top-1 vs 1x top-6: 80.73% vs 83%
Same for 7x top-1 vs 1x top-7: 85.35% vs 90%
Same for 8x top-1 vs 1x top-8: 88.87% vs 95%
I can't read the numerical error bars on the top-1 model win rate, we could calculate a likelihood from to see if the deviation is statistically significant.
This post measures `1x top-N` (one attempt each from N models), not `Nx top-1` (N attempts from the best-scoring model). We should make that more clear.
Part of why we chose `1x top-N` is that we expect lower error correlation compared to `Nx top-1`, which is also why the iid baseline is likely optimistic.
That said, a direct comparison (`Nx top-1` vs `1x top-N`, with the same review/compute budget) would be useful!
not exactly, not at all even in term of the way the llm are trained.
In RL it can be that you are not getting meaningful data anymore because you are 'too good' and dont get anymore the "this is a bad answer" signal so you can't estimate the gradient.
What causes LiDAR to fail harder than normal cameras in bad weather conditions? I understand that normal LiDAR algorithms assume the direct paths from light source to object to camera pixel, while a mist will scatter part of the light, but it would seem like this can be addressed in the pixel depth estimation algorithm that combines the complex amplitudes at the different LiDAR frequencies.
I understand that small lens sizes mean that falling droplets can obstruct the view behind the droplet, while larger lens sizes can more easily see beyond the droplet.
I seldom see discussion of the exact failure modes for specific weather conditions. Even if larger lenses are selected the light source should use similar lens dimensions. Independent modulation of multiple light sources could also dramatically increase the gained information from each single LiDAR sensor.
Do self-driving camera systems (conventional and LiDAR) use variable or fixed tilt lenses? Normal camera systems have the focal plane perpendicular to the viewing direction, but for roads it might be more interesting to have a large swath of the horizontal road in focus. At least having 1 front facing camera with a horizontal road in focus may prove highly beneficial.
To a certain extend an FSD system predicts the best course of action. When different courses of action have similar logits of expected fitness for the next best course of action, we can speak of doubt. With RMAD we can figure out which features or what facets of input or which part of the view is causing the doubt.
A camera has motion blur (unless you can strobe the illumination source, but in daytime the sun is very hard to outshine), it would seem like an interesting experiment to:
1. identify in real time which doubts have the most significant influence on the determination of best course of action
2. have a camera that can track an object to eliminate motion blur but still enjoy optimal lighting (under the sun, or at night), just like our eyes can rotate
3. rerun the best course of action prediction and feed back this information to the company, so it can figure out the cost-benefit of adding a free tracking camera dedicated to eliminating doubts caused by motion blur.
you reference an ICE article "currently" on the front page, I think this comment would benefit from an explicit link to that discussion since it is ephemeral and I am unable to make sure I find the right one.
> Space is a vacuum. i.e. The lack-of-a-thing that makes a thermos great at keeping your drink hot.
1) The heat can be transported by a heat carrier conducting heat standing still.
2) The heat can be transported by a heat carrier in motion.
3) The heat can be transported by thermal radiation.
The first 2 require massive particles, the latter are spontaneous photons.
A thermos bottle does not simply work by eliminating the motile mass particles.
Lets consider room temperature as the outer thermos temperature and boiling hot water as the inner temperature, that is roughly 300 K and 400 K.
Thermal radiation is proportional to the fourth power of temperature and proportional to emissivity (which is between 0 and 1).
Lets pretend you are correct and thus thermally blackened glass (emissivity 1) inside the vacuum flask would be fine according to you. That would mean that the radiation from your tea to the room temperature side would be proportional to 400^4 while the thermal radiation from room temperature to the tea would be proportional to 300^4. Since (400/300) ^ 4 = 3.16 that means the heat transport from hot tea to room temperature is about 3 times higher.
If on the other hand the glass was aluminized before being pulled vacuum the heat transports are proportional to 0 * 400 K ^ 4 and 0 * 300 K ^ 4 . So the heat transport in either direction would be 0 and no net heat transport remains.
If you believe the shiny inside of your thermos flask is an aesthetic gimmick, think again.
You are making a non-comparison.
Imagine comparing a diesel engine car to an electric car, but first removing the electric motor. Does that make a fair comparison???
I literally quote a person attributing the thermal insulation capabilities of vacuum flasks mono-causally to the lack of gas. I didn't imagine this, its right there to verify. Reminding people of laws of nature that have been known for 150 years and have withstood the test of time of investigation by physicists isn't "dunking on" anything, just reminding people of how the universe works.
The original vacuum flasks were made of glass, and a lot of laboratory grade vacuum flasks and Dewars are still made of glass. The consumer level Thermoses eventually switched to stainless steel.
Lets assume an electrical consumption of 1 MW which turned into heat and a concommitant 3 MW which was a byproduct of acquiring 1 MW of electrical energy.
So the total heat load if 4 MW (of which 1 MW was temporarily electrical energy before it was used by the datacenter or whatever).
Let's assume a single planar radiator, with emissivity ~1 over the thermal infrared range.
Let's assume the target temperature of the radiator is 300 K (~27 deg C).
What size radiator did you need?
4 MW / (5.67 * 10 ^ -8 W / ( m ^2 K ^4 ) * 300 K ^4) = 8710 m ^2 = (94 m) ^2
so basically 100m x 100m. Thats not insanely large.
The solar panels would have to be about 3000 m ^2 = 55m x 55m
The radiator could be aluminum foil, and something amounting to a remote controlled toy car could drive around with a small roll of aluminum wire and locally weld shut small holes due to micrometeorites. the wheels are rubberized but have a magnetic rim, on the outside theres complementary steel spheres so the radiator foil is sandwiched between wheel and steel sphere. Then the wheels have traction. The radiator could easily weigh less than the solar panels, and expand to much larger areas. Better divide the entire radiator up into a few inflatable surfaces, so that you can activate a spare while a sever leak is being solved.
It may be more elegant to have rovers on both inside and outside of the radiator: the inner one can drop a heat resistant silicone rubber disc / sheet over the hole, while the outside rover could do the welding of the hole without obstruction of the hole by a stopgap measure.
As I've pointed it out to you elsewhere -- how do you couple the 4MW of heat to the aluminum foil? You need to spread the power somewhat evenly over this massive surface area.
Low pressure gas doesn't convect heat well and heat doesn't conduct down the foil well.
It's just like how on Earth we can't cool datacenters by hoping that free convection will transfer heat to the outer walls.
Lets assume you truly believe the difficulty is the heat transport, then you correct me, but I never see you correct people who believe the thermal radiation step is the issue. It's a very selective form of correcting.
Lets assume you truly believe the difficulty is the heat transport to the radiator, how is it solved on earth?
> Lets assume you truly believe the difficulty is the heat transport, then you correct me, but I never see you correct people who believe the thermal radiation step is the issue
It's both. You have to spread a lot of heat very evenly over a very large surface area. This makes a big, high-mass structure.
> how is it solved on earth?
We pump fluids (including air) around to move large amounts of heat both on Earth and in space. The problem is, in space, you need to pump them much further and cover larger areas, because they only way the heat leaves the system is radiation. As a result, you end up proposing a system that is larger than the cooling tower for many nuclear power plants on Earth to move 1/5th of the energy.
The problem is, pumping fluids in space around has 3 ways it sucks compared to Earth:
1. Managing fluids in space is a pain.
2. We have to pump fluids much longer distances to cover the large area of radiators. So the systems tend to get orders of magnitude physically larger. In practice, this means we need to pump a lot more fluid, too, to keep a larger thing close to isothermal.
3. The mass of fluids and all their hardware matters more in space. Even if launch gets cheaper, this will still be true compared to Earth.
I explained this all to you 15 hours ago:
> If this wasn't a concern, you could fly a big inflated-and-then-rigidized structure and getting lots of area wouldn't be scary. But since you need to think about circulating fluids and actively conducting heat this is much less pleasant.
You may notice that the areas, etc, we come up with here to reject 70kW are similar to those of the ISS's EATCS, which rejects 70kW using white-colored radiators and ammonia loops. Despite the use of a lot of exotic and expensive techniques to reduce mass, the radiators mass about 10 tonnes-- and this doesn't count all the hardware to drive heat to them on the other end.
So, to reject 105W on Earth, I spend about 500g of mass; if I'm as efficient as EATCS, it would be about 15000g of mass.
By saying that something is impossible to do cost-effectivey, one is implicitly claiming they have rigorously combed through the whole problem space, all possible configurations and materials, and exhaustively concluded it is not possible cost-effectively.
Imagine now instead of a pyramid, a cone. Imagine the cone is spinning along its symmetry axis. One now has a local radial pseudoforce, a fake gravitational force along the radial direction (away from the symmetry axis).
Now any fluid with a liquid-gas phase transition above the desired radiator temperature but below the intended maximum compute operating temperature (and there is a lot of room for play for fluid choice because the pressure is a free parameter) can be chosen to operate in heat-pipe fashion. Suppose you place the compute at a certain point along the outer rim of the cone, and fluid that condenses on the cone wall will flow to the circular rim at the base. the compute is inside a kind of "chimney" and the lower half of the chimney (and the compute in it) are submerged by the fluid. The fluid boils and vaporizes, and rises up the chimney and is piped to the central axis and flows out in a controlled distributed fashion. all of the pipes could be floppy aluminum foil (or mylar etc.) pipes, since they are all pressurized during normal operation.
Some of the liquid phase could be pumped up to the central axis at the base and cool the rear side of the solar panels as well. I don't see the problem. The power density of solar panel heating (and thus power density on the cone surface) are very similar and perfectly manageable with phase-transition cooling /condensing.
At some point you are just prodding until people hand you working designs on a silver platter.
Then you picked the wrong thread to insert yourself, it's literally about that.
Which is funny, there are multiple other replies to you, explaining at length that while your ideas are physically possible, they are completely impractical. And yet you think they still could be "minor".
to demonstrate this you measure the compute / cost of running and human-verifying the output.
the statistics provided don't at all exclude the possibility that instead of giving the top 5 models each a single opportunity to propose a solution, it may be more efficient to give the 5 opportunities to solve the problem to the best scoring model:
at 24% win rate the null hypothesis (what a usual researcher ought to predict based on common sense) would be that the probability of a loss is 76%, and the probability that it loses N times is (0.76 ^ N), and so the probability of it winning in N attempts is ( 1 - (0.76 ^ N ) ).
So consulting the best scoring model twice (2 x top-1) I would expect: 42.24% better than the giving the 2 top scoring models each a single try ( 1 x top-2 ) as that resulted in 35%
Same for 3x top-1 vs 1x top-3: 56.10% vs 51%
Same for 4x top-1 vs 1x top-4: 66.63% vs 66%
Same for 5x top-1 vs 1x top-5: 74.64% vs 73%
Same for 6x top-1 vs 1x top-6: 80.73% vs 83%
Same for 7x top-1 vs 1x top-7: 85.35% vs 90%
Same for 8x top-1 vs 1x top-8: 88.87% vs 95%
I can't read the numerical error bars on the top-1 model win rate, we could calculate a likelihood from to see if the deviation is statistically significant.
reply