☆ Yσɠƚԋσʂ ☆
  • 1.64K Posts
  • 1.33K Comments
Joined 6Y ago
cake
Cake day: Jan 18, 2020

help-circle
rss


For sure, I think they genuinely think if AI got good enough they could just cull all the pesky workers at that point, and live like gods.


For sure, they’ve probably dropped more significant papers in the past year than any other groups. It does seem like the mindset in China is very different overall though. In the states, it’s basically a cult at this point where they’re trying to build a god with AGI. In China, it’s just treated like another tool for automation and companies see it as common infrastructure, akin to Linux, that people will build interesting things on. Hence why pretty much all the models in China re developed on open basis. Everybody there seems to realize that there’s no real path towards monetizing the models themselves.


It looks like you can run a low quant version on a 125gb machine, and apparently performance is still really good. https://github.com/makepad/llama_antirez_deepseek



The hardware efficiency gains are honestly the most interesting part of the paper. The main reason DeepSeek-V4 is so cheap to run comes down to how they completely bypassed the quadratic cost of standard attention for massive context windows. They built a hybrid attention architecture that interleaves Compressed Sparse Attention and Heavily Compressed Attention. Standard models keep every single token in the KV cache which absolutely kills memory. CSA fixes this by compressing the KV cache of multiple tokens into a single entry and then uses a sparse routing mechanism to only compute attention over the top-k most relevant compressed blocks. HCA takes it a step further by compressing an even larger number of tokens into one entry but computes dense attention over them. So, a 1.6T parameter Pro model only uses a third of the compute FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2 at a one million token context. They also aggressively pushed low-precision formats applying FP4 quantization-aware training to the Mixture-of-Experts weights and the attention Query-Key paths. MoE models are notoriously memory bound because you have to constantly shuttle massive expert weights into the GPU cores. Dropping these to FP4 slashes the memory bandwidth bottleneck and lets the model run way faster during inference without ruining accuracy since they handle the quantization dynamically during training. On the infrastructure side they wrote a custom fused kernel using TileLang that overlaps communication and computation. When running expert parallelism across multiple GPUs you usually hit a wall waiting for the network. DeepSeek slices the experts into micro-waves so the GPU is crunching matrix math on the first wave while the network is simultaneously pulling the data for the second wave. They basically hid the network latency behind the compute time which means you do not need super expensive interconnects to get peak hardware utilization out of the cluster.
fedilink









I’m expecting we’ll see the same dynamic we saw with solar panels and EVs. Once Chinese companies ramp up production, there’s going to be a flood of affordable GPUs. The big question is whether Western countries will ban them for ‘national security’ reasons.








I do love how steam engines remain peak energy production technology that humanity has managed to come up with.



I’m using 3.6 at 27bln and q8 on a M1 with 64gb.


I think so yeah, searxng is definitely the most privacy focused option.


I’ve just been using the builtin, but searxng might be better. Seems like a lot of people prefer it.


I’ve stopped bothering using an editor with LLMs. I just get the model to make a phased plan, write using TDD, and tell it to do staged commits for each feature. Then I just review the diffs after.


I ended up settling on opencode, but I find all of them work more or less the same nowadays. Pi is an interesting one which is very minimalist.


I haven’t tried comparing them myself, I guess you just kind of have to gauge if it works well enough. :)


It’s honestly incredible how good the local stack is nowadays. It’s literally better than any frontier model you could’ve rented like a year ago.


It’s entirely possible we might see fairly capable models that can be run with 16 gigs of RAM in the near future. Qwen 3.5 came out in February, and you needed a server with hundreds of gigs of memory to run a 397bln param model. Fast forward to a couple of weeks ago and 3.6 comes out with a 27bln param version beating the old 397bln param one in every way. Just stop and think about how phenomenal that is https://qwen.ai/blog?id=qwen3.6-27b

So, it’s entirely possible people will find ways to optimize this stuff even further this year or the next, and we’ll get an even smaller model that’s more capable.


I’m sure it’s applicable outside AI as well, but I guess they’re developing these chips to run and train models.


Mainly data sovereignty. Running a local model means all your data stays on your machine. Any time you use a service you’re sending whatever the model is working on to the company. Another advantage is the price. With services you have to pay a subscription, with local models you get to run them for the price of electricity.


16gb is a bit low unfortunately. You could run a 2 bit quant of latest Qwen, but that’s going to be a severely degraded performance. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Might be worth trying though to see if it does what you need.









A GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B. ~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining. If you have CUDA 12+ and an NVIDIA GPU like RTX 3090 / 4090 / 5090, then all you need to do is clone the repo cd lucebox-hub/dflash cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build build --target test_dflash -j fetch target (~16 GB) hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/ matched 3.6 draft is gated: accept terms + set HF_TOKEN first hf download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/ run DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/run.py --prompt "def fibonacci(n):" That's it. No Python runtime in the engine, no llama.cpp install, no vLLM, no SGLang. Luce DFlash will: 1. Load Qwen3.6-27B Q4_K_M target weights (~16 GB) plus the matched DFlash bf16 draft (~3.46 GB) and run DDTree tree-verify speculative decoding (block size 16, default budget 22, greedy verify). 2. Compress the KV cache to TQ3_0 (3.5 bpv, ~9.7x vs F16) and roll a 4096-slot target_feat ring so 256K context fits in 24 GB. Q4_0 is the legacy path and tops out near 128K. 3. Auto-bump the prefill ubatch from 16 to 192 for prompts past 2048 tokens (~913 tok/s prefill on 13K prompts). 4. Apply sliding-window flash attention at decode (default 2048-token window, 100% speculative acceptance retained) so 60K context still decodes at 89.7 tok/s instead of 25.8 tok/s. 5. Serve over an OpenAI-compatible HTTP endpoint or a local chat REPL. Running on RTX 3090, Qwen3.6-27B UD-Q4_K_XL (unsloth Dynamic 2.0) target, 10 prompts/dataset, n_gen=256: Bench AR tok/s DFlash tok/s AL Speedup HumanEval 34.90 78.16 5.94 2.24x Math500 35.13 69.77 5.15 1.99x GSM8K 34.89 59.65 4.43 1.71x Mean 34.97 69.19 5.17 1.98x
fedilink




I’m still convinced this was just pure theatre to present them as the good and independent company that’s totally not working with the fascist state.




Yeah, I expect it’s going to be rolled out a lot more sanely in China than it will be in the west.


I mean if I was paralyzed I’d take my chances with the chip.



There’s plenty to understand. Synthetic apertures allow getting a far higher resolution than you could otherwise which mean that you can place satellites in higher orbit to get the same coverage you’d get with more satellites in a lower orbit. This is what the article says, but you clearly have no clue regrading the subject and just need to argue for the sake of arguing. Go touch some grass.


Sounds like you need to read up on what synthetic aperture radar is to understand the article.




Huawei Ascend supernode to support Deepseek V4
There is no longer any CUDA dependency anywhere in its stack, which is probably the biggest deal of all. For those who don't know, CUDA is Nvidia's software layer which is the foundation nearly every frontier AI model in the world is built on. Except, as of today, DeepSeek V4, which can run entirely on Huawei Ascend chips via Huawei's CANN framework. China now has its own domestic AI stack, top to bottom.
fedilink









I absolutely hate the fact that every chat app now has a proprietary protocol and you have to install a bunch of apps to talk to different people. It’s such a travesty.





true, but only as prototypes that never went into mass production


Perhaps they mean commercially produces ones. To my knowledge these have never gone past experimental prototype stages.


Ground effect vehicles are basically airplanes that are forced to fly really, really low. They take off from water and cruise just a few meters above the surface. At that altitude, the air gets compressed between the wing and the ground or water, which creates a huge cushion of extra lift. This lets the vehicle carry way more weight than a normal plane of the same size and power, making it incredibly efficient for hauling cargo over water. The trick is that it only works over flat surfaces like oceans or lakes, and the piloting can be tricky because you’re skimming the waves at high speed without actually being able to climb to a higher altitude. It’s a neat piece of engineering that trades operational flexibility for raw lifting power.




You may not realize this, but there was more than one robot at the competition. Westie cope is truly taking on epic proportions now.




He singlehandedly killed the petrodollar and made clean energy a global imperative. He truly may be the best president the empire has had.



Exactly, I really hope that renewables start being reframed as energy sovereignty.