Cost-Effective AI: Speeding Up Large Model Inference

Advertisements

The annual re:Invent conference, hosted by Amazon Web Services (AWS) in 2024, puts a spotlight on the rapidly evolving world of generative AIThe buzzwords that dominated discussions revolved around efficiency, cost reduction, and practical applications of AIThe industry is witnessing a notable shift—from a focus on all-encompassing, pre-trained large models to a more pragmatic approach that emphasizes reducing the scale of pre-training while enhancing model inference and applications.

According to insights shared by insiders, the future landscape of general-purpose large models is narrowing down significantly, anticipating that globally, there may only be up to fifty players capable of conducting pre-training for these modelsThis change indicates that many companies are pivoting towards the practical application of large models rather than focusing solely on expansive pre-training efforts

A key characteristic shared by these emerging players is their heightened focus on the return on investment (ROI) and how large models can help decrease costs while increasing efficiency.

However, the journey towards the real-world application of large models is no small featAs Matt Garman, the new CEO of Amazon Web Services, aptly noted, "Artificial intelligence is a race without a finish line, and it will continue indefinitely." This statement encapsulates the relentless pursuit of innovation within the generative AI arena, suggesting that the competition will only become fiercer over time.

As a leader in global cloud computing, AWS recognizes the immense potential in generative AI and is committed to harnessing it effectivelyTo cater to the growing demand for practical applications of large models, AWS showcased a series of systemic updates spanning computing, storage, databases, inference, AI, and generative AI applications at the re:Invent conference

They have built a comprehensive AI technology stack, ranging from foundational AI chips to intermediate large model platforms and upper-level generative AI applications.

The evolution from a relatively understated presence last year to a decisive offensive this year, culminating in a full-scale launch by the year-end, illustrates AWS’s strategic positioning within the boundless war of generative AI.

Moreover, during the conference, Andy Jassy, CEO of Amazon, reaffirmed AWS's ambition in the generative AI era, stating, "We will certainly prioritize technologies that genuinely matter to our customers, focusing resolutely on solving real problems." This client-centric approach underlines how AWS leverages its expertise across infrastructure, tools/models, and applications to deliver quick, cost-effective solutions, thereby solidifying its leadership in the cloud computing domain.

Central to this vision is AWS's state-of-the-art chip technology, which aims to make large model training and deployment more cost-effective

Andy Jassy emphasized the critical importance of computing costs once generative AI applications reach a significant scaleCurrently, most generative AI applications rely predominantly on specific types of chips for computation, prompting the industry to seek more cost-competitive solutions.

Peter DeSantis, a senior vice president at AWS, outlined the two foundational pillars of building AI infrastructure: the construction of more powerful servers and the establishment of larger and more efficient server clusters, both of which hinge critically on high-performance GPU chips.

Since its launch in 2020, Amazon’s Trainium AI chip has been celebrated for its training efficiency in AI modelsAt the re:Invent conference, AWS unveiled the Trainium2 chip and introduced the EC2 instance powered by Trainium2. Additionally, they announced the Trainium2 Server and Trainium2 UltraServer, enabling users to achieve enhanced performance and cost efficiency when training and deploying AI models.

For instance, the Amazon EC2 Trn2 instance integrates sixteen Trainium2 chips and utilizes high-bandwidth, low-latency NeuronLink technology for interconnection

alefox

A single node can deliver 20.8 pFLOPS of FP8 compute power, offering 30% to 40% better cost efficiency over GPU instances, specifically tailored for generative AI training and inferenceTesting revealed that when compared to similar products from other cloud providers, the Amazon Trn2 EC2 instance significantly outperformed others, demonstrating over three-fold improvements in token generation throughput for the Llama 3.1 405B model.

Notably, leading companies such as Adobe, Poolside, Databricks, and Qualcomm have begun to extensively utilize the Trainium2 chipAdditionally, an exciting announcement was made regarding the upcoming launch of the Trainium3 chip in 2025, which will be based on cutting-edge 3nm technology and is expected to offer twice the performance of Trainium2 along with a 40% increase in energy efficiency.

With the growing demand for large model training, single-chip solutions alone will not be sufficient

AWS has leveraged its proprietary NeuronLink technology to synchronize 64 Trainium2 chips into a single Ultra server, amplifying its computational capacity to five times that of current AI serversThe bandwidth reaches an impressive 2TB per second, with latency capped at just one microsecondAdditionally, Anthropic has announced that its next-generation Claude model will be trained within a Project Rainier cluster that comprises hundreds of thousands of Trainium2 chips.

AWS has also tackled crucial aspects related to storage and databases that influence large model training and inference during the conferenceThey introduced Amazon S3 Tables, a new storage class designed specifically for Iceberg, to accommodate the rapidly rising demand for data lakesThis innovation enhances the performance and scalability of all Iceberg tables, achieving three times the query performance and a tenfold increase in transaction volume for Parquet file types stored in S3.

As storage scales to the PB or EB level, metadata becomes increasingly crucial, providing organizations with insights into the objects stored within S3, enabling efficient data retrieval

To address this, AWS launched the Amazon S3 Metadata service, capable of automatically extracting metadata from objects and storing it in real time within new S3 Tables buckets, facilitating subsequent analysis without redundant infrastructure efforts.

In the realm of database systems, AWS unveiled Amazon Aurora DSQL—a distributed SQL database that boasts a maintenance-free experience, supports global cross-region deployments, and offers unlimited scalabilityIt is distinguished by its 99.999% multi-region availability and strong data consistency while maintaining low latency, marking it as the fastest distributed SQL database for global deployment, outperforming Google Spanner by four times.

Furthermore, AWS introduced a multi-region strong consistency feature for its NoSQL database, Amazon DynamoDB global tablesThis comprehensive offering ensures that AWS can deliver highly available, globally scalable databases with low read and write latencies, regardless of whether customers require SQL or NoSQL solutions.

Currently, the cloud service sector recognizes that computational power remains the most lucrative segment compared to AI large model services

However, moving forward, the relevance of AI large models will be paramountJassy noted in a financial review that this year, Amazon cloud services showed a clear growth trajectory, with bleeding-edge AI services generating billions of dollars in annual revenue.

AWS is not abandoning its efforts to develop proprietary large models, as underlined by Andy Jassy during his addressInternal developers at Amazon have presented a comprehensive range of requests, indicating desires for lower latency, reduced costs, fine-tuning capabilities, and enhanced image and video processing abilities within AI model applications.

To address these demands, AWS officially introduced the Amazon Nova, which encompasses a suite of large models facilitating text dialogue, image generation, and video creation, with aspirations to enable seamless multi-modal interactions in the future.

The newly launched Amazon Nova features four core models designed for various tasks—from high-efficiency Micro models for simple tasks to Lite models, Pro models, and Premier models aimed at complex inference tasks with the flexibility for custom distillation

These models support fine-tuning and distillation training, ensuring heightened operational efficiency while lowering costsNotably, Amazon Nova has synergies with AWS's existing knowledge base in Amazon Bedrock, enhancing its capability in responsive AI generation.

Additional exciting developments included the introduction of two new models: Amazon Nova Canvas, a high-quality image generation model, and Amazon Nova Reel, allowing users to produce six-second videos with upcoming capabilities to generate two-minute videos within a few months.

Looking ahead, AWS plans to roll out a voice-to-voice model around Q1 of 2025, allowing users to input voice and receive seamless voice output, and a cutting-edge model capable of multi-modal input to output by mid-2025, supporting various formats like text, voice, image, and video.

Beyond proprietary large models, AWS is devoted to nurturing an ecosystem of choices, as Swami Sivasubramanian remarked at the conference, stating, "A selection of foundational models at your fingertips!" With this commitment, AWS has recently launched the Amazon Bedrock Marketplace, providing users with effortless access to over a hundred leading large models.

This includes prominent models such as Poolside Assistant, Stable Diffusion 3.5, and Luma AI

As large models increasingly take shape in practical applications, inference is set to become a core element within the generative AI workflowMatt highlighted the significance of inference in AI model applications, particularly for sophisticated models like large language models, which require high computing power and low latency.

To accommodate ongoing customer needs for large model inference applications, AWS introduced numerous enhancements to Amazon Bedrock, granting users simple access to the hardware optimization capabilities offered by Inferentia and Trainium chips for inferenceAn illustration of this is the model distillation function, which enhances inference speeds by up to 500% while reducing costs by 75%. With just application example prompts, Amazon Bedrock automates the distillation process, ultimately producing a tailored distilled model characterized by specialized knowledge, reasonable latency, and cost-effectiveness.

AWS also addressed enterprise-grade applications by launching an automatic inference verification feature to aid in detecting inaccuracies within models and providing verifiable evidence of large language model accuracy

Additionally, a multi-agent collaboration feature was introduced, enabling complex workflows leveraging Amazon Bedrock technology, ensuring top-notch performance and low-latency optimized inference for users working with state-of-the-art large models.

Recent data indicates that tens of thousands of clients engage with Amazon Bedrock to create applications daily, representing a fivefold increase year-over-yearThrough Amazon Bedrock, AWS has forged solid partnerships with various model providers while integrating its proprietary large model products, ultimately offering users exceptional functionality and versatile options.

Although AWS has laid the groundwork for a remarkable cloud-based infrastructure and a wide array of large model choices, the journey to create high-quality generative AI applications still faces numerous challengesBeyond employing powerful models, developing a successful application necessitates robust security measures, effective communication, user-friendly interfaces, and optimal response times, as users expect seamless and rapid performance.

For Jassy, the misconception often lies in believing that creating a strong model is sufficient; in reality, there remains an estimated 30% of development work that requires attention

Customers are often intolerant of applications with lingering issues.

Consequently, the key to creating effective AI applications rests in cloud vendors providing ready-to-use AI development toolsAWS's AI application platform, Amazon Q, extends a comprehensive suite of AI application tools—such as Amazon Q Developer, which now offers three Agents for generating unit tests, documentation, and code reviews, addressing the complete development cycle of users.

According to Swami, "Amazon Q Developer has topped the SWE benchmark testing ranks! It solves 55.8% of software issues effortlessly, with numerous enterprises, including Bundesliga clubs, American Airlines, and British Telecom, leveraging its capabilities."

Additionally, Amazon Q Business enables companies to index their data and connect disparate business systems and data sources, improving the search experience and fostering interactions across various databases in a secure and privacy-conscious manner.

A significant milestone involves integrating the capabilities of QuickSight with Amazon Q Business, creating robust dashboards that quickly amalgamate data from multiple sources, such as Salesforce, enhancing the efficacy of QuickSight as a business intelligence tool.

A recent repositioning of Amazon SageMaker has transformed it into a central hub catering to data, analytics, and AI while driving the effort to streamline these processes

The newly launched Amazon SageMaker Unified Studio offers a cohesive data and AI development environment, granting clients access to their organization's entire dataset and leveraging the best-fit tools.

Swami stated that "Amazon SageMaker has emerged as a one-stop platform for all data analytics and AI needs, transforming complexity into simplicity and redefining the framework of generative AI." In increasingly competitive realms, global enterprises are pursuing efficiency and cost reduction through generative AI while simultaneously deploying it internally within operational scenarios.

This internal pursuit mirrors that of Amazon, where generative AI has been incorporated into various business scenarios, including improvements to the Alexa voice assistant, e-commerce advertising creation, palm payments, contactless retail areas, and prescription drug reading capabilities.

The abilities refined through complex internal digital ecosystems are poised to provide Amazon cloud technology users with an enriched experience

Share this Article