Neural Magic (Acquired by Red Hat)
Software Development
Somerville, Massachusetts 18,286 followers
We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.
About us
Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.
- Website
- http://neuralmagic.com/
External link for Neural Magic (Acquired by Red Hat)
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- Somerville, Massachusetts
- Type
- Privately Held
- Founded
- 2018
- Specialties
- machine learning, deep learning, and artificial intelligence
Products
Locations
- PrimaryGet directions
55 Davis Sq
Floor 3
Somerville, Massachusetts 02144, US
Employees at Neural Magic (Acquired by Red Hat)
Updates
Neural Magic (Acquired by Red Hat) reposted this
Check out our latest work on LLM compression and efficient training & deployment!
Excited to share our latest preprint detailing our team's recent work at LinkedIn,https://lnkd.in/dWHTuKJm!Our focus has been on training and deploying efficient Large Language Models (LLMs) across various predictive and generative applications. Through techniques like knowledge distillation, model compression via pruning and quantization, and CUDA kernel optimization, we've successfully developed and deployed small language models that mostly maintain the quality of larger foundation models while offering significantly higher inference throughput and lower latency. Notably, we've achieved over a 20x reduction in model size with minimal impact on model quality.In our paper, we discuss the specifics of our approach towards model compression and efficiency, sharing practical insights gained along the way. Our paper touches upon both methodology and practice of efficient LLM deployment. Particularly, we demonstrate the power of model pruning through combinatorial optimization, adding to the growing list of real-world applications of discrete optimization. Read more about our work:Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications:https://lnkd.in/dWHTuKJmStructured pruning with OSSCAR:https://lnkd.in/d8emmFQM Model quantization with QuantEase:https://lnkd.in/dZna796n360Brew: A foundation model for personalized recommendation:https://lnkd.in/dUXydhaZKudos to our amazing team, and specially,Aman Gupta,Yun Dai, Qingquan Song andAta Fatahi who made this work possible!
[vLLM Office Hours #21] vLLM Production Stack Deep DiveJoin us for an overview of the components in the vLLM Production Stack (https://lnkd.in/gsSnNb9K) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.
[vLLM Office Hours #21] vLLM Production Stack Deep Dive
www.linkedin.com
Neural Magic (Acquired by Red Hat) reposted this
Friends from the East Coast! Join us on Tuesday, March 11 in Boston for the first ever East Coast vLLM Meetup. You will meet vLLM contributors fromNeural Magic (Acquired by Red Hat),Red Hat,Google, and more. Come share how you are using vLLM and see what's on the roadmap!
Neural Magic (Acquired by Red Hat) reposted this
📚 Our very ownYihua Cheng will be presenting in thevLLM office hour this 𝗧𝗵𝘂𝗿𝘀𝗱𝗮𝘆 withNeural Magic (Acquired by Red Hat)! ⏰ It is time to learn more about 𝗹𝗮𝗿𝗴𝗲-𝘀𝗰𝗮𝗹𝗲 𝗟𝗟𝗠 𝘀𝗲𝗿𝘃𝗶𝗻𝗴 with 𝘃𝗟𝗟𝗠 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻-𝗦𝘁𝗮𝗰𝗸!We will be discussing in depth about the design and functionality of the vLLM Production-Stack, an open-source framework to serve LLM models 𝗮𝘁 𝘀𝗰𝗮𝗹𝗲 with 𝗹𝗼𝘄 𝗰𝗼𝘀𝘁 and 𝗲𝗮𝘀𝘆 𝗺𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁. We will also share updates on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝘃𝗟𝗟𝗠 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻. Mark you calendar for 𝗠𝗮𝗿𝗰𝗵 𝟲, 𝟮:𝟬𝟬𝗣𝗠 𝗘𝗧 / 𝟭𝟭:𝟬𝟬𝗔𝗠 𝗣𝗧 and register at this link:https://lnkd.in/euF8m73qComment below for anything else you want us to talk about in the office hour!🔗 Code:https://lnkd.in/gsSnNb9K📝 Blog:https://lnkd.in/gdXdRhEj📚 Tutorials:https://lnkd.in/gWz7gW6T🎥 30-sec demo:youtu.be/RLk8zbQ-eqM#LLM#vLLM#opensource#K8s#AI#AIInfra#Systems
Neural Magic (Acquired by Red Hat) reposted this
🚀 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗲𝗱 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝗥𝟭 𝗟𝗟𝗠𝘀 – 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗳𝗼𝗿 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗮𝗻𝗱 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱 𝗳𝗼𝗿 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁!While many companies are still reasoning through the implications of the open-source DeepSeek R1 launch, we've been heads-down refining and compressing these models for deployment to maximize efficiency with quantization while maintaining accuracy.To our knowledge, this is the first comprehensive exploration of quantization for reasoning LLMs, spanning 𝘀𝗶𝘅 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗺𝗼𝗱𝗲𝗹𝘀, 𝗵𝘂𝗻𝗱𝗿𝗲𝗱𝘀 𝗼𝗳 𝘁𝗵𝗼𝘂𝘀𝗮𝗻𝗱𝘀 𝗼𝗳 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗲𝗱 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀, and 𝘁𝗵𝗼𝘂𝘀𝗮𝗻𝗱𝘀 𝗼𝗳 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝗲𝗱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀. Our findings offer key insights into model behavior, trade-offs, and performance improvements across different quantization techniques.📊 Specifically, we found:• 𝗟𝗮𝗿𝗴𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 𝗮𝗿𝗲 𝗲𝗮𝘀𝗶𝗲𝗿 𝘁𝗼 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗲𝗱: 7B+ models retained full accuracy at 8-bit (FP8, INT8) and 97% at 4-bit (INT4)• 𝗦𝗺𝗮𝗹𝗹𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 𝗻𝗲𝗲𝗱 𝗺𝗼𝗿𝗲 𝘁𝘂𝗻𝗶𝗻𝗴: After more thorough hyperparameter tuning, the 1.5B model recovered 97% at 8-bit and 94% at 4-bit• 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗯𝗼𝗼𝘀𝘁𝘀: Speedups averaged around 1.5X with some up to 4X, depending on the model size, hardware, and inference scenario🔍 Some interesting insights:• 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀 𝗵𝗮𝘃𝗲 𝗵𝗶𝗴𝗵 𝘃𝗮𝗿𝗶𝗮𝗻𝗰𝗲: AIME had up to a 7-point standard deviation in pass@1, and to address this generally, we ran 20 different random seeds for each eval• 𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗲𝗿 𝗶𝘀𝘀𝘂𝗲𝘀 𝗱𝗲𝗹𝗮𝘆𝗲𝗱 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: The DeepSeek tokenizer on Hugging Face was missing the <think> token in chat templates, leading to degraded accuracy even in baseline modelsIf you're interested in diving in more, check out our latestRed Hat research blog (our first piece not pushed solely throughNeural Magic (Acquired by Red Hat)!):https://lnkd.in/eENPT8xzOr get hands-on with the Hugging Face collection and start deploying now:https://lnkd.in/eQu_bhMRLet me know what you think and what you want to see next!
Accurately quantized DeepSeek-R1 is here!
🐋 Introducing state-of-the-art quantized DeepSeek-R1 reasoning models for blazingly fast inference! DeepSeek recently released a suite of distilled reasoning models across the Llama and Qwen families. These models demonstrated impressive performance on a wide range of reasoning benchmarks and applications. Given their potential, we set out to quantize them while preserving their reasoning capabilities—ensuring faster inference without compromising accuracy. 🔍 Key findings from quantization:- Larger models quantize exceptionally well, minimal tuning was required to retain near lossless accuracy.- Smaller models require careful tuning, so we had to employ techniques like MSE-optimal clipping and activation reordering to help stabilize quantization.📌 How do different quantization schemes compare?- FP W8A8: Practically lossless accuracy recovery.- INT W8A8: Competitive, recovering ~99% of the original accuracy.- INT W4A16: Modest drop on AIME & GPQA-Diamond, but strong on MATH-500. You can find the models fully open-sourced in our HuggingFace Hub at:https://lnkd.in/d3stvmviFor more details about evaluations and performance benchmarking, please check out ourNeural Magic (Acquired by Red Hat) blog at:https://lnkd.in/dr-D5WTB
[vLLM Office Hours #20] DeepSeek and vLLMDeepSeek is dropping a lot of exciting goodies this week during their Open Source Week, and we’re thrilled to spotlight them at our bi-weekly vLLM Office Hours! We'll dive into "DeepSeek on vLLM: New Features, Optimizations, and More," making it a hands-on learning opportunity for anyone curious about DeepSeek’s innovations and how they work with vLLM.DeepSeek's advancements aren’t just cool tech—they’re reshaping how we build and deploy AI. DeepSeek’s focus on efficiency means you can tackle bigger problems with fewer resources. Plus, with vLLM’s seamless integration, you get these benefits without the headache. This week’s Office Hours is your chance to learn how DeepSeek and vLLM team up. We’ll unpack DeepSeek's features, demo their vLLM integration, and brainstorm what’s next—together.
[vLLM Office Hours #20] DeepSeek and vLLM
www.linkedin.com
Neural Magic (Acquired by Red Hat) reposted this
Make sure to join the first vLLM East Coast meetup in Cambridge, MA! Great opportunity to learn more about production-grade inference serving with vLLM. We are excited to share project updates! The best feature requests are made in personhttps://lu.ma/7mu4k4xxvLLMRed HatNeural Magic (Acquired by Red Hat)
Neural Magic (Acquired by Red Hat) reposted this
It has been an amazing week for open source AI infrastructure as the DeepSeek team releases key components of their infrastructure stack that supports the complex and innovative V3/R1 architecture.https://lnkd.in/eWV7wQgrSo far, we have seen:- *FlashMLA*: an efficient MLA decoding kernel for Hopper GPUs, which helps to accelerate attention (the bottleneck for long context "reasoning-style" workloads)- *DeepEP*: the first open-source EP communication library for MoE model training and inference, which helps to enable more complex parallelism schemes for serving the 600B+ parameter model with 256 experts- *DeepGEMM*: an FP8 matmul library that supports both dense and MoE layers, helping to accelerate the first foundation model trained in Fp8- *EPLB*: an expert-parallel load balancer for V3/R1, enabling more complex expert parallel deployments for at scale workloadsAt 2pm ET during theNeural Magic (Acquired by Red Hat) Office Hours,Lucas Wilkinson,Tyler Michael Smith,Michael Goin,Simon Mo, and I will dive into these items and cover our progress integrating them intovLLM!Stop by to ask questions!Link to signup:https://lnkd.in/ePqDYgpT
Neural Magic (Acquired by Red Hat) reposted this
Multimodal Quantization with LLM Compressor!Compression isn't just for text models anymore. With the latest v0.4.0 release of LLM Compressor, we now support multi-modal models, including vision-language (Llama 3.2 Vision) and audio (Whisper Large V2) models! Underlying this enablement are the same productized, state-of-the-art compression algorithms powering smaller, faster models without sacrificing accuracy. Key highlights:🔹 Vision & audio data pipeline support added.🔹 Model tracing abilities for vision & audio models for performant GPTQ compression.🔹 Verified implementations showing >99% accuracy recovery.🔹 Open-source, fully customizable, and natively supported through vLLM for seamless deployments.Want to try it out? Dive into the open-source examples:https://lnkd.in/eSQgHS2mWant to learn more? Read through our blog post:https://lnkd.in/e8MNy8kNWhat would you like us to work on next? Let me know in the comments!
Join now to see what you are missing
- Find people you know at Neural Magic (Acquired by Red Hat)
- Browse recommended jobs for you
- View all updates, news, and articles
Similar pages
Red Hat
Software Development
Raleigh, NC
Deci AI (Acquired by NVIDIA)
Software Development
Cerebras Systems
Computer Hardware
Sunnyvale, California
Nebius
Technology, Information and Internet
Pillar VC
Venture Capital and Private Equity Principals
Boston, Massachusetts
Anthropic
Research Services
Roboflow
Software Development
Anyscale
Software Development
San Francisco, California
Weights & Biases
Software Development
San Francisco, California
Hugging Face
Software Development
Browse jobs
Engineer jobs
555,845 open jobsDeveloper jobs
258,935 open jobsDirector jobs
1,220,357 open jobsService Consultant jobs
101,753 open jobsLegal Assistant jobs
23,617 open jobsVirtual Assistant jobs
47,766 open jobsSenior Director of Product Management jobs
7,226 open jobsMachine Learning Engineer jobs
148,937 open jobsMarketing Manager jobs
106,879 open jobsSpecialist jobs
768,666 open jobsManager jobs
1,880,925 open jobsFounder jobs
76,859 open jobsScientist jobs
48,969 open jobsOperations Consultant jobs
62,094 open jobsDirector of Product Management jobs
14,985 open jobsBiologist jobs
44,425 open jobsAccount Executive jobs
71,457 open jobsAnalyst jobs
694,057 open jobsSocial Media Director jobs
13,139 open jobsHead of Sales jobs
12,761 open jobs