Business Wire

Kinara Ara-2 Processor Hits 12 Tokens Per Second Running 7 Billion Parameter LLMs

Generative AI capabilities of this leading-edge AI processor are demonstrated in new video available on YouTube

SANTA CLARA, Calif.–(BUSINESS WIRE)–


Qwen, available as open source under the Apache 2.0 license and backed by Alibaba Cloud (Tongyi Qianwen), is like LLaMA2, and represents a series of models across diverse sizes (e.g., 0.5B, 4B, 7B, 14B, 32B, 72B) and various functions including chat, language understanding, reasoning, math, and coding. From a Natural Language Processing (NLP) perspective, Qwen can be used to process commands that a user performs in day-to-day operations on their computer. And unlike the voice command processing typically available in cars, Qwen and other Generative AI chat models are multilingual, accurate, and are not restricted to specific text sequences.

Beyond generating simple and complex output text prompts at 12 tokens per second, effectively running Qwen1.5-7B and any other LLM on the edge requires the Kinara Ara-2 to support three high-level features: 1) the ability to aggressively quantize LLMs and other generative AI workloads while still delivering near floating-point accuracy; 2) extreme flexibility and capability to run all LLM operators end-to-end without relying on the host (this includes all model layers and activation functions); and 3) sufficient memory size and bandwidth to effectively handle these extremely large neural networks.

“Running any LLM on a low-power edge AI processor is quite a feat but hitting 12 output tokens per second on a 7B parameter LLM is a major accomplishment,” said Wajahat Qadeer, Kinara’s chief architect. “However, the best is yet to come, as we are on target to hit 15 output tokens per second by applying advanced software techniques while leaving the model itself unmodified.”

With existing LLMs and new LLMs that become available on Hugging Face and elsewhere, Kinara can quickly bring up these models by leveraging its innovative software and architectural flexibility, executing these models with floating-point accuracy, while offering the low power dissipation of an integer processor. And beyond Generative AI applications, Ara-2 is very capable of handling 16-32+ video streams fed into edge servers for high-end object detection, recognition, and tracking, using its advanced compute engines to process higher resolution images quickly and with high accuracy. Ara-2 is available as a stand-alone device, a USB module, an M.2 module, and a PCIe card featuring multiple Ara-2’s.

Interested parties are invited to contact Kinara directly to see for themselves the Qwen1.5-7B and other LLM applications running on Ara-2.

About Kinara

Kinara provides the world’s most power- and price-efficient Edge AI inference platform supported by comprehensive AI software development tools. Enabling Generative AI and smart applications across retail, medical, industry 4.0, automotive, and smart cities, Kinara’s AI processors, modules, and software can be found at the heart of the AI industry’s most exciting and influential innovations. Kinara envisions a world of exceptional customer experiences, better manufacturing efficiency, and greater safety for all. Learn more at https://kinara.ai/

All registered trademarks and other trademarks belong to their respective owners.

Contacts

Kinara Contact

Napier Partnership:
Nesbert Musuwo, Account Manager, Napier B2B

Email Address: [email protected]

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Comment moderation is enabled. Your comment may take some time to appear.

Back to top button

Adblock detected

Please consider supporting us by disabling your ad blocker