☆ Yσɠƚԋσʂ ☆
  • 1.44K Posts
  • 1.14K Comments
Joined 6Y ago
cake
Cake day: Jan 18, 2020

help-circle
rss

What’s even funnier is that meta literally spent millions on each one of them.


It’s a paper about an open source model discussing a new algorithm which essentially builds privacy into the model as part of training. Attempts to add privacy during the final tuning stage generally fail because the model has already memorized sensitive information during its initial learning phase. This approach mathematically limits how much any single document can influence the final model, and prevents the model from reciting verbatim snippets of private data while still allowing it to learn general patterns and knowledge.




It’s really spurring Chinese companies to make LLMs that don’t need a lake of water to tell you how many r’s there are in strawberry. 🤣



The paper argues that we have been wasting a lot of expensive GPU cycles by forcing transformers to relearn static things like names or common phrases through deep computation. Standard models do not have a way to just look something up so they end up simulating memory by passing tokens through layer after layer of feed forward networks. DeepSeek introduced a module called Engram which adds a dedicated lookup step for local N-gram patterns. It acts like a new way to scale a model that is separate from the usual compute heavy Mixture of Experts approach. The architecture uses multi head hashing to grab static embeddings for specific token sequences which are then filtered through a context aware gate to make sure they actually fit the current situation. They found a U shaped scaling law where the best performance happens when you split your parameter budget between neural computation and this static memory. By letting the memory handle the simple local associations the model can effectively act like it is deeper because the early layers are not bogged down with basic reconstruction. One of the best bits is how they handle hardware constraints by offloading the massive lookup tables to host RAM. Since these lookups are deterministic based on the input tokens the system can prefetch the data from the CPU memory before the GPU even needs it. This means you can scale to tens of billions of extra parameters with almost zero impact on speed since the retrieval happens while the previous layers are still calculating. The benchmarks show that this pays off across the board especially in long context tasks where the model needs its attention focused on global details rather than local phrases. It turns out that even in math and coding the model gets a boost because it is no longer wasting its internal reasoning depth on things that should just be in a lookup table. Moving forward this kind of conditional memory could be a standard part of sparse models because it bypasses the physical memory limits of current hardware.
fedilink

I think slop should really be defined by the purpose of the art rather than the medium. Any piece of advertisement is inherently far more slop than a piece of genAI art somebody made because they just had an idea in their had they wanted to express.









Most people in the field know that models usually fall apart after a few hundred steps because small errors just keep adding up until the whole process is ruined. The paper proposes a system called MAKER which uses a strategy they call massively decomposed agentic processes. Instead of asking one big model to do everything they break the entire task down into the smallest possible tiny pieces so each microagent only has to worry about one single move. For their main test they used a twenty disk version of the Towers of Hanoi puzzle which actually requires over a million individual moves to finish. They found that even small models can be super reliable if you set them up correctly. One of the main tricks they used is a voting system where multiple agents solve the same tiny subtask and the system only moves forward once one answer gets a specific number of votes more than the others. This acts like a safety net that catches random mistakes before they can mess up the rest of the chain. Another interesting part of their approach is red flagging which is basically just throwing away any response that looks suspicious or weird. If a model starts rambling for too long or messes up the formatting they just discard that attempt and try again because those kinds of behaviors usually mean the model is confused and likely to make a logic error. By combining this extreme level of task breakdown with constant voting and quick discarding of bad samples they managed to complete the entire million step process with zero errors. And it turns out that you do not even need the most expensive or smartest models to do this since relatively small ones performed just as well for these tiny steps. Scaling up AI reliability might be more about how we organize the work rather than just making the models bigger and bigger. They even did some extra tests with difficult math problems like large digit multiplication and found that the same recursive decomposition and voting logic worked there as well.
fedilink


Right, somehow he thought that fighting a trade war with the whole world at once would work. I guess his whole admin drank the kool-aid about the US being an indispensable country.






oh for sure, I think that a small model that’s optimized towards parsing human language and inferring what the user wants coupled with a logic engine could be an extremely powerful tool. Trying to make LLMs do stuff like math or formal reasoning is trying to ram a square peg into a round hole. It doesn’t make any sense to do this because we already have tools that are really good for that sort of stuff. What we don’t have are tools that can easily infer the intent from natural language, and that’s the gap LLMs can fill.


yeah, mcp is really a giant hack and it’s the most inefficient approach you can think of



They’re absolutely not useless. The trick is to figure out how to use them effectively. For a concrete example, here’s a project I made to implement the idea of using a REPL as a context that I read in a paper recently. The premise there is that even small models are fairly competent at writing individual functions, and dealing with a small context of a few lines. So, instead of feeding large documents into these models which break them, you can instead provide them with an API to interrogate the document by writing code. And sure enough, the idea works. I managed to get qwen2.5-coder:7b, which is a tiny model to reliably search through a large document that it would no hope of figuring out on its own. Here’s what a run of it looks like:

npx tsx src/index.ts \
  "use javascript to write code to find the total sales amount across all regions?" \
  ./test-fixtures/scattered-data.txt \
  --verbose

  Configuration:
  Provider: ollama
  Model: qwen2.5-coder:7b
  Max turns: 10
  Timeout: 30000ms

[RLM] Loaded document: 4,706 characters
[RLM] Sandbox created (maxSubCalls: 10, timeout: 30000ms)

──────────────────────────────────────────────────
[Turn 1/10] Querying LLM...
[Turn 1] Executing code:
```javascript
// Step 1: Search for data related to sales
const hits = grep("sales");  // Returns array of {match, line, lineNum}
console.log(JSON.stringify(hits, null, 2));

// Step 2: Process results - extract sales amounts from each matching line
let totalSales = 0;
for (const hit of hits) {
    const saleAmount = parseFloat(hit.line.match(/\d+(\.\d{1,2})?/)[0]);
    if (!isNaN(saleAmount)) {
        totalSales += saleAmount;
    }
}

// Step 3: Store the result in memory
memory.push({ key: "total_sales", value: totalSales });

// Final output
console.log("Total sales across all regions:", totalSales);
```
[Turn 1] Console output:
  [
  ...
  {
    "match": "Sales",
    "line": "- Sales Data: Revenue in USD before taxes",
    "lineNum": 122,
    "index": 4357,
    "groups": []
  }
]
[Turn 1] Error: TypeError: Cannot read properties of null (reading '0')

──────────────────────────────────────────────────
[Turn 2/10] Querying LLM...
[Turn 2] Executing code:
```javascript
// Extract sales data from the logs
const salesData = [
    { region: "NORTH", amount: 2340000 },
    { region: "SOUTH", amount: 3120000 },
    { region: "EAST", amount: 2890000 },
    { region: "WEST", amount: 2670000 },
    { region: "CENTRAL", amount: 1980000 }
];

// Calculate total sales
const totalSales = salesData.reduce((acc, curr) => acc + curr.amount, 0);

// Output the total sales
console.log("Total Sales:", totalSales);
```
[Turn 2] Console output:
  Total Sales: 13000000

──────────────────────────────────────────────────
[Turn 3/10] Querying LLM...
[Turn 3] Executing code:
```javascript
// Output the total sales
console.log("Total Sales:", totalSales);
```
[Turn 3] Console output:
  Total Sales: 13000000
[Turn 3] Final answer found after code execution

[RLM] Sandbox disposed
The total sales are 13000000.

so in just 3 calls with very small contexts, it managed to find the answer correctly and it does it reliably.

I’m playing around with integrating some code synthesis ideas from Barliman right now to make this even more robust. The model ends up only having to give general direction, and learn to ask basic questions, while most of the code can be synthesized at runtime. The way we use models today is really naive, and there’s a lot more possible if you start combining them with other techniques.



In my view, this is the exact right approach. LLMs aren't going anywhere, these tools are here to stay. The only question is how they will be developed going forward, and who controls them. Boycotting AI is a really naive idea that's just a way for people to signal group membership. Saying I hate AI and I'm not going to use it is really trending and makes people feel like they're doing something meaningful, but it's just another version of trying to vote the problem away. It doesn't work. The real solution is to roll up the sleeves and built an a version of this technology that's open, transparent, and community driven.
fedilink




This paper is one of the more interesting takes on context extension I have seen in a while because it challenges the assumption that we need explicit positional encodings during inference. The authors make a case that embeddings like RoPE act more like scaffolding during construction rather than a permanent load bearing wall. The idea is that these embeddings are crucial for getting the model to converge and learn language structure initially, but they eventually turn into a hard constraint that prevents the model from generalizing to sequence lengths it has never seen before. The methodology is surprisingly straightforward since they just take a pretrained model and completely drop the positional embeddings before running a very quick recalibration phase. This process essentially converts the architecture into a NoPE or No Positional Embedding model where the attention mechanism has to rely on the latent positioning it learned implicitly. It turns out that once you remove the explicit constraints of RoPE the model can extrapolate to context windows significantly longer than its training data without the perplexity explosions we usually see. It is pretty wild to see this outperform techniques like YaRN on benchmarks like Needle In A Haystack while using a fraction of the compute. I think this suggests that Transformers are much better at understanding relative positions from semantic cues than we give them credit for. If this holds up it means we might be wasting a lot of resources trying to engineer complex interpolation methods when the answer was just to take the training wheels off once the model knows how to ride.
fedilink


You might want to learn what words like reactionary actually mean before using them. We are discussing an open source tool, which by its nature lacks the built-in constraints you are describing. Your argument is a piece of sophistry designed to create the illusion of expertise on a subject you clearly do not understand. You are not engaging with the reality of the technology, but with a simplified caricature of it.


Technology such as LLMs is just automation and that’s what the base is, how it is applied within a society is what’s dictated by the uperstructure. Open source LLMs such as DeepSeek are a productive force, and a rare instance where a advanced means of production is directly accessible for proletarian appropriation. It’s a classic base level conflict over the relations of production.



Nah, I don’t think I’m going to take as gospel what a CIA asset say.

Instead, go read Marx to understand the relationship between the technology and the social relations that dictate its use within a society.


Elections are just the surface of the problem. The real issue is who owns the factories and funds the research. In the West that’s largely done by private capital, putting it entirely outside the sphere of public debate. Even universities are heavily reliant on funding from companies now, which obviously influences what their programs focus on.






or maybe it’s the capitalist relations and not the technology that’s the actual problem here





Right, I think the key difference is that we have a feedback loop and we’re able to adjust our internal model dynamically based on it. I expect that embodiment and robotics will be the path towards general intelligence. Once you stick the model in a body and it has to deal with the environment, and learn through experience, then it will start creating a representation of the world based on that.


It seemed pretty clear to me. If you have any clue on the subject then you presumably know about the interconnect bottleneck in traditional large models. The data moving between layers often consumes more energy and time than the actual compute operations, and the surface area for data communication explodes as models grow to billions parameters. The mHC paper introduces a new way to link neural pathways by constraining hyper-connections to a low-dimensional manifold.

In a standard transformer architecture, every neuron in layer N potentially connects to every neuron in layer N+1. This is mathematically exhaustive making it computationally inefficient. Manifold constrained connections operate on the premise that most of this high-dimensional space is noise. DeepSeek basically found a way to significantly reduce networking bandwidth for a model by using manifolds to route communication.

Not really sure what you think the made up nonsense is. 🤷



I’m personally against copyrights as a concept and absolutely don’t care about this aspect, especially when it comes to open models. The way I look at is that the model is unlocking this content and making this knowledge available to humanity.




DeepSeek team just published a paper on Manifold-Constrained Hyper-Connections. It addresses a pretty specific bottleneck we are seeing with recent attempts to scale residual streams. The core issue they are tackling is that while widening the residual stream (Hyper-Connections or HC) gives you better performance by adding more information capacity, it usually breaks the identity mapping property that makes ResNets and Transformers trainable in the first place. When you just let those connection matrices learn freely, your signal magnitudes go haywire during deep network training which leads to exploding gradients. Their solution is actually quite elegant. They force the learnable matrices to live on a specific manifold, specifically the Birkhoff polytope. Practically, this means they use the Sinkhorn-Knopp algorithm to ensure the connection matrices are "doubly stochastic," meaning all rows and columns sum to 1. This is clever because it turns the signal propagation into a weighted average rather than an unbounded linear transformation. That preserves the signal mean and keeps the gradient norms stable even in very deep networks. What I found most interesting though was the engineering side. Usually, these multi-stream ideas die because of memory bandwidth rather than FLOPs. Expanding the width by times typically creates a massive I/O bottleneck. They managed to get around this with some heavy kernel fusion and a modified pipeline schedule they call DualPipe to overlap communication. The results look solid. They trained a 27B model and showed that mHC matches the stability of standard baselines while keeping the performance gains of the wider connections. It only added about 6.7% time overhead compared to a standard baseline, which is a decent trade-off for the gains they are seeing in reasoning tasks like GSM8K and math. It basically makes the "wider residual stream" idea practical for actual large-scale pre-training. Expanding the residual stream adds more pathways for information to flow which helps with training on constrained hardware by decoupling the model's capacity from its computational cost. Usually if you want a model to be "smarter" or maintain more state depth, you have to increase the hidden dimension size which makes your Attention and Feed-Forward layers quadratically more expensive to run. The mHC approach lets you widen that information highway without touching the expensive compute layers. The extra connections they add are just simple linear mappings which are computationally negligible compared to the heavy matrix multiplications in the rest of the network. They further combined this technique with a Mixture-of-Experts (MoE) architecture, which is the component that actually reduces the number of active parameters during any single forward pass. The mHC method ensures that even with that sparsity, the signal remains stable and creates a mathematically sound path for gradients to flow without exploding VRAM usage. The intermediate states of those extra streams are now discarded during training and get computed on the fly during the backward pass. This allows you to train a model that behaves like a much larger dense network while fitting into the memory constraints of cheaper hardware clusters.
fedilink




Qwen3-30B-A3B-Instruct-2507 device-optimized quant variants without output quality falling off a cliff. A 30B runs on a Raspberry Pi 5 (16GB) achieving 8.03 TPS at 2.70 BPW, while retaining 94.18% of BF16 quality. ShapeLearn tends to find better TPS/quality tradeoffs versus alternatives. What’s new/interesting in this one 1) CPU behavior is mostly sane On CPUs, once you’re past “it fits,” smaller tends to be faster in a fairly monotonic way. The tradeoff curve behaves like you’d expect. 2) GPU behavior is quirky On GPUs, performance depends as much on kernel choice as on memory footprint. So you often get sweet spots (especially around ~4b) where the kernels are “golden path,” and pushing lower-bit can get weird. models: https://huggingface.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF
fedilink


It’s amazing how people just can’t learn the lesson that the problem isn’t that a particular oligarch owns a public forum, but that public forums are privately owned in the first place.



Open sourcing these thing would definitely be the right way to go, and you’re absolutely right that it’s a general solver that would be useful in any scenario where you have a system that requires dynamic allocation.



Yeah for sure, I do think it’s only a matter of time before people figure out a new substrate. It’s really just a matter of allocating time and resources to the task, and that’s where state level planning comes in.





lmfao imagine trotting out mbfc like it means anything you terminal online lib 🤣


It’s like saying silicon chips being orders of magnitude faster than vacuum tubes sounds too good to be true. Different substrate will have fundamentally different properties from silicon.



What I find most unfortunate is that these scam companies convinced people that you can make AI speech detectors in the first place. Like the reason LLMs structure text in a certain way is because these are the patterns in human text that they’ve been trained on.


yeah that would work too assuming the disk was made out of sufficiently hard material that won’t degrade over time


Yeah, I don’t think billions of years is really a meaningful metric here. It’s more that it’s a stable medium where we could record things that will persist for an indefinite amount of time without degradation.


I mean, you can always make new hardware. The idea of media that basically lasts forever is really useful in my opinion. We currently don’t have anything that would last as long as regular paper. Most of the information we have is stored on volatile media. Using something like this to permanently record accumulated knowledge like scientific papers, technology blueprints, and so on, would be a very good idea in my opinion.


Incidentally, manual moderation is much easier to do on a federated network where each individual instance doesn’t grow huge. Some people complaining that Lemmy isn’t growing to the size of Reddit, but I see that as a feature myself. Smaller communities tend to be far more interesting and are much easier to moderate than giant sites.


It’s the logical end point of a particular philosophy of the internet where cyberspace is treated as a frontier with minimal oversight. History offers a pretty clear pattern here with any ungoverned commons eventually getting overrun by bad actors. These spam bots and trolls are a result of the selection pressures that are inherent in such environments.

The libertarian cyber-utopian dream assumed that perfect freedom would lead to perfect discourse. What it ignored was that anonymity doesn’t just liberate the noble dissident. It also liberates grift, the propaganda, and every other form of toxicity. What you get in the end is a marketplace of attention grabbing performances and adversarial manipulation. And that problem is now supercharged by scale and automation. The chaos of 4chan or the bot filled replies on reddit are the inevitable ecosystem that grows in the nutrient rich petri dish of total laissez-faire.

We can now directly contrast western approach with the Chinese model that the West has vilified and refused to engage with seriously. While the Dark Forest theory predicts a frantic retreat to private bunkers, China built an accountable town square from the outset. They created a system where the economic and legal incentives align towards maintaining order. The result is a network where the primary social spaces are far less susceptible to the botpocalypse and the existential distrust the article describes.

I’m sure people will immediately scream about censorship and control, and that’s a valid debate. But viewed purely through the lens of the problem outlined in the article which is the degradation of public digital space into an uninhabitable Dark Forest, the Chinese approach is simply pragmatic urban planning. The West chose to build a digital world with no regulations, no building codes that’s run by corporate landlords. Now people are acting surprised that it’s filled with trash, scams, and bots. The only thing left to do is for everyone to hide in their own private clubs. China’s model suggests that perhaps you can have a functional public square if you establish basic rules of conduct. It’s not a perfect model, but it solved the core problem of the forest growing dark.



Nobody is talking about defying laws of physics here. Your whole premise rests on fossil fuels running out and being essential for energy production. This is simply false.