How will people run LLaMa 3 405B locally by 2025?
Basic
11
Ṁ317
Jan 1
91%
Gaming GPUs + heavy quantization (e.g. 6x4090 @ Q2_0)
65%
Unified memory (e.g. Apple M4 Ultra)
60%
Tensor GPUs + modest quantization (e.g. 4xA100 2U rackmount)
60%
Distributed across clustered machines (e.g. Petals)
41%
Server CPU (e.g. AMD EPYC with 512TB DDR5)

"Cloud" is a boring answer. User base of interest is somewhere between hobbyists with a budget and companies with a couple of self-hosted racks.

Get
Ṁ1,000
and
S3.00
Sort by:

News:

My bet is locally on Apple CPU / GPU (by whatever name called).
And since this will still be expensive, the rest will run in a datacenter on server class GPU/inference chips (not sure what those look like as yet).

* Apple will find a way to compress/store weights on firmware such that you can work with say 64Gb RAM.

@VishalDoshi Have there been any "texture compression" decoders for LLM weights prototyped?

bought Ṁ10 NO

How many answers are you picking? I’m sure someone somewhere will do each of there

@MingweiSamuel Whatever looks Pareto-dominant based on vibes from Twitter, /g/, and r/LocalLLaMa. For example current 70B meta looks like multiple gaming GPUs or Apple unified memory with very rare DIY-adaptered A100 frankenracks.

If community doesn't settle on viable 405B solutions by EoY everything gets a NO.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules