The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:
How we use machine learning to create our articlesIntroduction: The Need for Scalable Machine Learning Infrastructure
As the development and integration of artificial intelligence and machine learning applications continue to accelerate, one of the most significant bottlenecks in the development of machine learning systems has become apparent: infrastructure.
With the exponential growth of data and computational demands that come with it, organizations are facing mounting pressure to deploy and manage their neural networks efficiently, without sacrificing performance or accuracy. This has given rise to a pressing need for scalable machine learning infrastructure that can keep pace with the rapid evolution of said technologies1 or other features like Inference as a Service.
In this article, we’ll explore one innovative solution to this challenge: Ollama, a state-of-the-art platform designed specifically to host neural networks on demand.
Ollama as a Service for Neural Network Hosting
Ollama is a platform designed to deploy neural networks, providing a standarized infrastructure for machine learning applications. By utilizing this service, developers can effortlessly deploy, manage, and optimize their neural networks without relying on costly hardware or intricate infrastructural complexities.
While Ollama gives users the ability to reduce overall costs and increase performance, users are still required to bring some hardware along to make proper use of the service.
Key Features
Utilizing Ollama to perform neural network activities offers several key features that make it a powerful tool:
- Local Execution: Run language models on your own hardware, ensuring privacy and security. 2 3
- User-Friendly Installation: Ollama provides an easy setup process for Windows, macOS, and Linux. 3
- Wide Range of Models: Access to a diverse library of pre-trained models, including Llama 3.3, Mistral, and Phi 3. 3
- Custom Model Creation: Ability to customize and create your own models using Ollama’s lightweight Modelfile structure. 2 3
- GPU Acceleration: Ollama provides support for GPU acceleration to enhance performance and speed up inference. 3
- Integration with Popular Platforms: Seamless integration with platforms via a built-in REST APIs for web app interactions. 3
Running Ollama
In this section, we delve into the process of running Ollama, covering essential steps such as installation, download, and exemplary usage of a large language model. In order to ensure a seamless experience, we need to familiarise ourselves with these basic elements before exploring Ollama’s API. By understanding each step involved in running Ollama, we will be able to harness its full potential, ultimately enhancing your machine learning projects and applications.
Installation
To begin the installation process for Ollama, we need to go to this page. Here we will be presented with the installation guidelines.
From this point on, we assume that you have either installed CUDA on your system or ROCm in order to utilize your GPU. Skip this step, if you do not intend to use your GPU for faster inference.
A simple one-file installer is available for both Windows and Mac operating systems. For Linux, ollama recommends using curl to download the installation file and execute it directly.
After downloading and installing ollama using the one-file installer, we can now verify that ollama is installed. Open a terminal and run the following command to start ollama: 4
ollama serve
If your installation was successful, this command should start ollama and return no error codes. To check this step, open another terminal and run the following command:
ollama -v
This should return the ollama version your system is currently running.
Downloading & Running a Model
In order to download and run a model, we need to select a model that suits our needs. A list of all the models available for download from ollama can be found here. For this example, we will download Llama 3.2. By clicking on the entry in the list of available models or on the link we have provided, we will see the following page content:
The page now shows us all the contents of the llama3.2-3b repository, which we will download in order to use the model. Next to this, in the top right-hand corner, we see a command that allows us to run the model locally. Therefore, we open a terminal and run the following command to access the selected model:
ollama run llama3.2
Running this command will begin the download process for all necessary files and continue with running the model. Upon completion, we can begin to utilize the command line interface to interact with the model.
Accessing the API
Alongside a simple command interface, ollama provides an integration via a local API endpoint. 5 This endpoint can be used to expose the large language models that have been downloaded to your local network or integrate neural networks in different applications.
For the purpose of showcasing the API endpoint of ollama, we are going to utilize curl to make a web request.
If you do not have access to curl on your system and want to reproduce our results or are interested in experimenting with the API yourself, we can recommend the following alternative tools: Python Requests, Yet Another REST Client, Postman or VS Code with REST Client extension.
For our example, we are interested in requesting a joke revolving around penguins from Llama 3.2. For this, we open the terminal and enter the following command:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "Please generate a joke revolving around penguins."
}
],
"stream": false
}'
This sends a basic request to the ollama endpoint for execution inside the service. As a response, we get the following JSON:
{
"model":"llama3.2",
"created_at":"2024-12-14T12:07:07.948287554Z",
"message":
{
"role":"assistant",
"content":"Why did the penguin take his credit card to the Antarctic?\n\nBecause he wanted to freeze his assets! (get it?)"
},
"done_reason":"stop",
"done":true,
"total_duration":210528850,
"load_duration":34427940,
"prompt_eval_count":34,
"prompt_eval_duration":23000000,
"eval_count":26,
"eval_duration":151000000
}
This output gives us a the actual response from the large language model inside the “message” block, while also allowing us to see a variety of interesting statistics regarding the generation process of the response. For example, the “done” flag shows us that the message contains the entire output, which was computed in seconds. Blazingly fast!
Real-World Applications
Ollama can be used wherever large language models play a role. Its ability to perform incredibly fast computations while maintaining a standardized API interface makes it ideal for any kind of language processing. We will outline a few examples where this could be used to increase productivity or provide a second layer of verification alongside human supervision:
-
Financial Services: Banks and financial institutions can deploy large language models for personalized customer service, reducing latency and helping customers at times when customer service may already be stretched due to various reasons.
-
Retail: Retailers can implement Ollama for personalized marketing, optimizing stock levels and enhancing customer satisfaction.
-
Healthcare: Hospitals and medical research institutions can use large language models for tasks such as medical record analysis, patient monitoring and predictive diagnostics, ensuring privacy and security while improving the quality of patient care.
However, because ollama is limited to interacting with textual or numerical input, it cannot be used to generate any type of image file or other context.
TL;DR
Ollama is a service specifically designed to host and manage large-scale language models. It provides developers, researchers and enterprises with a robust platform that can handle the storage, processing and deployment of these complex models. With its user-friendly interface and advanced features such as its API endpoint, Ollama enables users to use their language models efficiently and effectively. By providing a reliable and flexible solution for managing large-scale language models, Ollama enables faster innovation, improved performance and reduced operational costs in natural language processing.
In our next post, we will explore the ways in which we can interact with Ollama outside of a command line interface, focusing on integration via its API endpoint.