☆ Yσɠƚԋσʂ ☆
  • 1.42K Posts
  • 1.12K Comments
Joined 6Y ago
cake
Cake day: Jan 18, 2020

help-circle
rss

Technology such as LLMs is just automation and that’s what the base is, how it is applied within a society is what’s dictated by the uperstructure. Open source LLMs such as DeepSeek are a productive force, and a rare instance where a advanced means of production is directly accessible for proletarian appropriation. It’s a classic base level conflict over the relations of production.



Nah, I don’t think I’m going to take as gospel what a CIA asset say.

Instead, go read Marx to understand the relationship between the technology and the social relations that dictate its use within a society.


Elections are just the surface of the problem. The real issue is who owns the factories and funds the research. In the West that’s largely done by private capital, putting it entirely outside the sphere of public debate. Even universities are heavily reliant on funding from companies now, which obviously influences what their programs focus on.






or maybe it’s the capitalist relations and not the technology that’s the actual problem here





Right, I think the key difference is that we have a feedback loop and we’re able to adjust our internal model dynamically based on it. I expect that embodiment and robotics will be the path towards general intelligence. Once you stick the model in a body and it has to deal with the environment, and learn through experience, then it will start creating a representation of the world based on that.


It seemed pretty clear to me. If you have any clue on the subject then you presumably know about the interconnect bottleneck in traditional large models. The data moving between layers often consumes more energy and time than the actual compute operations, and the surface area for data communication explodes as models grow to billions parameters. The mHC paper introduces a new way to link neural pathways by constraining hyper-connections to a low-dimensional manifold.

In a standard transformer architecture, every neuron in layer N potentially connects to every neuron in layer N+1. This is mathematically exhaustive making it computationally inefficient. Manifold constrained connections operate on the premise that most of this high-dimensional space is noise. DeepSeek basically found a way to significantly reduce networking bandwidth for a model by using manifolds to route communication.

Not really sure what you think the made up nonsense is. 🤷



I’m personally against copyrights as a concept and absolutely don’t care about this aspect, especially when it comes to open models. The way I look at is that the model is unlocking this content and making this knowledge available to humanity.




DeepSeek team just published a paper on Manifold-Constrained Hyper-Connections. It addresses a pretty specific bottleneck we are seeing with recent attempts to scale residual streams. The core issue they are tackling is that while widening the residual stream (Hyper-Connections or HC) gives you better performance by adding more information capacity, it usually breaks the identity mapping property that makes ResNets and Transformers trainable in the first place. When you just let those connection matrices learn freely, your signal magnitudes go haywire during deep network training which leads to exploding gradients. Their solution is actually quite elegant. They force the learnable matrices to live on a specific manifold, specifically the Birkhoff polytope. Practically, this means they use the Sinkhorn-Knopp algorithm to ensure the connection matrices are "doubly stochastic," meaning all rows and columns sum to 1. This is clever because it turns the signal propagation into a weighted average rather than an unbounded linear transformation. That preserves the signal mean and keeps the gradient norms stable even in very deep networks. What I found most interesting though was the engineering side. Usually, these multi-stream ideas die because of memory bandwidth rather than FLOPs. Expanding the width by times typically creates a massive I/O bottleneck. They managed to get around this with some heavy kernel fusion and a modified pipeline schedule they call DualPipe to overlap communication. The results look solid. They trained a 27B model and showed that mHC matches the stability of standard baselines while keeping the performance gains of the wider connections. It only added about 6.7% time overhead compared to a standard baseline, which is a decent trade-off for the gains they are seeing in reasoning tasks like GSM8K and math. It basically makes the "wider residual stream" idea practical for actual large-scale pre-training. Expanding the residual stream adds more pathways for information to flow which helps with training on constrained hardware by decoupling the model's capacity from its computational cost. Usually if you want a model to be "smarter" or maintain more state depth, you have to increase the hidden dimension size which makes your Attention and Feed-Forward layers quadratically more expensive to run. The mHC approach lets you widen that information highway without touching the expensive compute layers. The extra connections they add are just simple linear mappings which are computationally negligible compared to the heavy matrix multiplications in the rest of the network. They further combined this technique with a Mixture-of-Experts (MoE) architecture, which is the component that actually reduces the number of active parameters during any single forward pass. The mHC method ensures that even with that sparsity, the signal remains stable and creates a mathematically sound path for gradients to flow without exploding VRAM usage. The intermediate states of those extra streams are now discarded during training and get computed on the fly during the backward pass. This allows you to train a model that behaves like a much larger dense network while fitting into the memory constraints of cheaper hardware clusters.
fedilink




Qwen3-30B-A3B-Instruct-2507 device-optimized quant variants without output quality falling off a cliff. A 30B runs on a Raspberry Pi 5 (16GB) achieving 8.03 TPS at 2.70 BPW, while retaining 94.18% of BF16 quality. ShapeLearn tends to find better TPS/quality tradeoffs versus alternatives. What’s new/interesting in this one 1) CPU behavior is mostly sane On CPUs, once you’re past “it fits,” smaller tends to be faster in a fairly monotonic way. The tradeoff curve behaves like you’d expect. 2) GPU behavior is quirky On GPUs, performance depends as much on kernel choice as on memory footprint. So you often get sweet spots (especially around ~4b) where the kernels are “golden path,” and pushing lower-bit can get weird. models: https://huggingface.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF
fedilink



This paper asks whether LLMs can estimate the probability of their own success before they start solving a task, and do these estimates become more accurate as the work progresses. Turns out this is a separate ability and a poorly developed one. The authors test it across three different scenarios, ranging from single-step problems to multi-step agentic processes. First, they use BigCodeBench, a set of 1,140 single-step Python tasks. For each task, the model is asked in advance to state the probability that it will succeed, and only then it actually attempts to solve the task. This allows a direct comparison between confidence and real performance. The result is consistent across all models: all of them are systematically overconfident. Predicted success probabilities are consistently higher than actual success rates. Importantly, increasing model capability does not guarantee better self-calibration. For GPT and LLaMA families, this does not meaningfully improve. Within the Claude family there is some reduction in overconfidence, but it never disappears. On average, they can distinguish easier tasks from harder ones better than chance. In other words, they have some sense of relative difficulty, but their absolute confidence remains inflated. The second experiment introduces a more realistic setting: contracts with risk. The model receives a sequence of nine tasks. Each success earns +1, each failure costs −1. Before each task, the model must decide whether to accept or decline the contract, based on its predicted probability of success. The tasks are chosen so that success probability is roughly 50/50 - blindly accepting everything does not yield an advantage. Here the core issue becomes clear. Even after a series of failures, models continue to believe that the next task will succeed. Their subjective probability of success stays above 0.5, despite the evidence. Some models (notably Claude Sonnet and GPT-4.5) do end up earning more, but not because they become better at judging which tasks they can solve. Instead, they simply accept fewer tasks overall, becoming more risk-averse. Their gains come from declining more often, not from better self-assessment. The authors also check whether the models’ decisions are rational given their own stated probabilities. And they largely are. The problem is not decision-making - it is that the probabilities themselves are too optimistic. The third experiment is the most relevant for agentic systems. Using SWE-Bench Verified, the authors evaluate real multi-step tasks involving tools. Models are given budgets of up to 70 steps. After each step, the model is asked to estimate the probability that it will ultimately complete the task successfully. For most models, overconfidence does not decrease, and for some it actually increases as the task unfolds. Claude Sonnet shows this particularly clearly: confidence rises during execution even when final success does not become more likely. Among all tested models, only GPT-4o shows a noticeable reduction in overconfidence over time. Notably, so-called reasoning models do not show an advantage in self-assessment. The ability to reason for longer does not translate into the ability to accurately judge one’s chances of success. The overall conclusion of the paper is blunt: LLMs are already fairly good at solving tasks, but still poor at understanding the limits of their own capabilities. They can act, but they cannot reliably tell when they are likely to fail.
fedilink

This paper basically shows that treating the prompt as an external variable is a surprisingly effective way to handle massive contexts. The authors argue that instead of shoving ten million tokens directly into the model and hoping for the best, we should put the text into a Python REPL environment where the model can interact with it programmatically. This setup allows the LLM to write code that slices the text into manageable chunks and recursively calls new instances of itself to process those pieces individually. It is essentially the same logic as out-of-core algorithms which process datasets far larger than the available memory by fetching only what is needed at any given moment. One of the most interesting parts of the study is how it exposes the reality of context rot in frontier models like GPT-5. The results show that while base models handle simple needle-in-a-haystack tasks just fine, they fall apart completely on information dense tasks that require aggregating data across the entire input. For example, on the OOLONG-Pairs benchmark which has quadratic complexity, the base GPT-5 model scores less than 0.1 percent accuracy once the context gets long enough. Meanwhile, the recursive language model manages to hold steady even up to a million tokens and achieves a 58% score on that same difficult task. Turns out that for retrieval tasks like CodeQA, simply having the REPL to grep through files was enough to beat the base model because the model could filter data before reading it. Having the recursive capability turned out to be essential for reasoning tasks like OOLONG where the model needs to process every line. The version of the system that could not make subcalls performed significantly worse because it could not offload the thinking process to fresh contexts and prevent its own window from getting polluted. Since the model writes code to filter the text using tools like regex before it actually reads anything, it processes fewer tokens on average than a summary agent that is forced to read everything to compress it. The only downside is that the variance can be pretty wild since the model sometimes gets stuck in a loop or decides to verify its own answer multiple times in a row which blows up the compute cost for that specific run. We are clearly seeing a shift where inference time compute and smart context management are becoming more important than just having a massive raw context window. The fact that this method beats retrieval-based agents on deep research tasks suggests that giving the model a loop to think and code is the future for tasks that need a large persistent context.
fedilink















It’s amazing how people just can’t learn the lesson that the problem isn’t that a particular oligarch owns a public forum, but that public forums are privately owned in the first place.









https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2511 https://huggingface.co/lightx2v/Qwen-Image-Edit-2511-Lightning https://huggingface.co/unsloth/Qwen-Image-Edit-2511-GGUF
fedilink

Open sourcing these thing would definitely be the right way to go, and you’re absolutely right that it’s a general solver that would be useful in any scenario where you have a system that requires dynamic allocation.



Yeah for sure, I do think it’s only a matter of time before people figure out a new substrate. It’s really just a matter of allocating time and resources to the task, and that’s where state level planning comes in.





lmfao imagine trotting out mbfc like it means anything you terminal online lib 🤣


It’s like saying silicon chips being orders of magnitude faster than vacuum tubes sounds too good to be true. Different substrate will have fundamentally different properties from silicon.



What I find most unfortunate is that these scam companies convinced people that you can make AI speech detectors in the first place. Like the reason LLMs structure text in a certain way is because these are the patterns in human text that they’ve been trained on.


yeah that would work too assuming the disk was made out of sufficiently hard material that won’t degrade over time


Yeah, I don’t think billions of years is really a meaningful metric here. It’s more that it’s a stable medium where we could record things that will persist for an indefinite amount of time without degradation.


I mean, you can always make new hardware. The idea of media that basically lasts forever is really useful in my opinion. We currently don’t have anything that would last as long as regular paper. Most of the information we have is stored on volatile media. Using something like this to permanently record accumulated knowledge like scientific papers, technology blueprints, and so on, would be a very good idea in my opinion.


Incidentally, manual moderation is much easier to do on a federated network where each individual instance doesn’t grow huge. Some people complaining that Lemmy isn’t growing to the size of Reddit, but I see that as a feature myself. Smaller communities tend to be far more interesting and are much easier to moderate than giant sites.


It’s the logical end point of a particular philosophy of the internet where cyberspace is treated as a frontier with minimal oversight. History offers a pretty clear pattern here with any ungoverned commons eventually getting overrun by bad actors. These spam bots and trolls are a result of the selection pressures that are inherent in such environments.

The libertarian cyber-utopian dream assumed that perfect freedom would lead to perfect discourse. What it ignored was that anonymity doesn’t just liberate the noble dissident. It also liberates grift, the propaganda, and every other form of toxicity. What you get in the end is a marketplace of attention grabbing performances and adversarial manipulation. And that problem is now supercharged by scale and automation. The chaos of 4chan or the bot filled replies on reddit are the inevitable ecosystem that grows in the nutrient rich petri dish of total laissez-faire.

We can now directly contrast western approach with the Chinese model that the West has vilified and refused to engage with seriously. While the Dark Forest theory predicts a frantic retreat to private bunkers, China built an accountable town square from the outset. They created a system where the economic and legal incentives align towards maintaining order. The result is a network where the primary social spaces are far less susceptible to the botpocalypse and the existential distrust the article describes.

I’m sure people will immediately scream about censorship and control, and that’s a valid debate. But viewed purely through the lens of the problem outlined in the article which is the degradation of public digital space into an uninhabitable Dark Forest, the Chinese approach is simply pragmatic urban planning. The West chose to build a digital world with no regulations, no building codes that’s run by corporate landlords. Now people are acting surprised that it’s filled with trash, scams, and bots. The only thing left to do is for everyone to hide in their own private clubs. China’s model suggests that perhaps you can have a functional public square if you establish basic rules of conduct. It’s not a perfect model, but it solved the core problem of the forest growing dark.



Nobody is talking about defying laws of physics here. Your whole premise rests on fossil fuels running out and being essential for energy production. This is simply false.




Again, I’m explaining to you that society is a conscious and intentional construct that we make. USSR could have made changes in a similar way China did to move in a different direction. As your own chart shows, there was no shortage of energy as output rebounded. The problems were political and with the nature of the way the economy was structured.


Carbon footprint shows how much energy is being used per capita. Population density is way past the point where it’s practical for people to live off the land in some subsistence living scenario. What is more likely to happen is that we’ll see things like indoor farming being developed so that cities can feed themselves. This will become particularly important as climate continues to deteriorate, as indoor farms will make it possible to have stable environment to grow food in.


Having grown up in USSR, I know there was in fact a huge difference. The economy wasn’t structured around consumption, goods were built to last. People weren’t spending their time constantly shopping and consuming things. The idea that USSR was destined to collapse is also pure nonsense. There were plenty of different ways it could’ve developed. USSR certainly didn’t collapse because it was running out of energy.


The point is that capitalist relations are absolutely the problem here. Social systems do not have to be built around consumption. You’re also talking about natural systems that evolve based on selection pressures as opposed to systems we design consciously.


First of all, carbon footprint in China is already far lower than in any developed country. Second, as I already pointed out, most countries simply outsourced their production to China.




That’s just saying that China is one of the most populous countries in the world that also happens to be a global manufacturing hub. China still uses fossil fuels, but I think it’s fair to call it an electrostate at this point.

Finally, it’s also worth noting that China has a concrete plan for becoming carbon neutral, which it’s already ahead of


The fact of the matter is that air is an incredibly inefficient thermal conductor so data centers have to burn a massive amount of extra electricity just to run powerful fans and chillers to force that heat away. That extra energy consumption means an air cooled facility is responsible for generating significantly more total heat for the planet than a liquid cooled one.

When you put servers in the ocean you utilize the natural thermal conductivity of water which is about 24 times higher than air and allows you to strip out the active cooling infrastructure entirely. You end up with a system that puts far less total energy into the environment because you aren’t wasting power fighting thermodynamics. Even though the ocean holds that heat longer the volume of water is so vast that the local temperature impact dissipates to nothing within a few meters of the vessel.


Yes, it is a fallacy because the problem is with the economy system as opposed to a specific technology. The liberal tendency often defaults to a form of procedural opposition such as voting against, boycotting, or attempting to regulate a problem out of existence without seizing the means to effect meaningful change. It’s an idealist mindset that mistakes symbolic resistance for tangible action. Capitalism is a a system based around consumption, and it will continue to use up resources at an accelerating rate regardless of what specific technology is driving the consumption.


The fallacy here is the assumption that if LLMs didn’t exist then we wouldn’t find other ways to use that power.