Local AI assistant in vim

5 minute read

Local AI assistant in vim with codellama

The Problem

I have been using Bryley’s neoai.vim in combination with OpenAI’s gpt API for a while now, but mainly for playing around and creating regexes, to be honest..

I do most of my coding for my employer, and they are understandably sceptical when it comes to pushing all of their sensitive code and content to a black-box cloud service. Which means that gpt, copilot et al. are forbidden for sensitive usecases. Which I think is understandable. Additionally, even for personal projects, I don’t like the idea of not being in control over what is sent where, when and what’s going to happen with what is being sent.

Possible solution

There are large language models that can be executed/operated locally! However, it’s not as easy as visiting a website or “buying” an API key. And glueing it all together always looked a bit challenging.

I wanted to play around with llama for a long while now, and today finally was the day!

Solution

I created an integrated configuration/solution which

Uses nvim + a fork of neoai.vim
Starts a containerized instance of llama.cpp with an OpenAI compatible API
Uses the codellama language model
Sets everything up automatically

Automation

I have been using a dotfiles repository for a long time now, making use of tuning for the setup automation. Everything is in the code and should bootstrap automagically.

Downloading the model

The model is downloaded by running tuning, but using a little script.

Running the API Service

The llama API service runs in a podman container, orchestrated and managed by systemd. The quadlet configuration is part of the repo

With the configuration and model in place, the API can be brought up with a simple

$ systemctl --user start llama

After that, it listens at http://localhost:9741

However, the nvim integration starts it automatically, if needed.

nvim integration

There is a pretty great nvim plugin for the OpenAI API: neoai.nvim. I created a fork, which allows it to use an unauthenticated API on a configurable url, e.g. http://localhost:9741/v1/chat/completions. The first attempt was not nice and just added a hard-coded bit of lua to another hardcoded bit of lua, but now it’s configurable, the merge request is open.

Configuration and installation happens in the neoai.lua configuration file, with automated systemd start of the API server llama.

local utils = require("utils")

local function startLlama()
  os.execute('systemctl --user start llama')
end

local keymap_opts = { expr = true }

return {
  {
    'gierdo/neoai.nvim',
    branch = 'local-llama',
    cmd = {
      "NeoAI",
      "NeoAIOpen",
      "NeoAIClose",
      "NeoAIToggle",
      "NeoAIContext",
      "NeoAIContextOpen",
      "NeoAIContextClose",
      "NeoAIInject",
      "NeoAIInjectCode",
      "NeoAIInjectContext",
      "NeoAIInjectContextCode",
    },
    keys = {
      {
        mode = "n",
        '<A-a>',
        ':NeoAI<CR>',
        keymap_opts
      },
      {
        mode = "n",
        '<A-i>',
        ':NeoAIInject<Space>',
        keymap_opts
      },

      {
        mode = "v",
        '<A-a>',
        ':NeoAIContext<CR>',
        keymap_opts
      },
      {
        mode = "v",
        '<A-i>',
        ':NeoAIInjectContext<Space>',
        keymap_opts
      }
    },
    config = function()
      startLlama()
      require("neoai").setup({
        ui = {
          output_popup_text = "AI",
          input_popup_text = "Prompt",
          width = 45,               -- As percentage eg. 45%
          output_popup_height = 80, -- As percentage eg. 80%
          submit = "<Enter>",       -- Key binding to submit the prompt
        },
        models = {
          {
            name = "openai",
            model = "codellama",
            params = nil,
          },
        },
        register_output = {
          ["a"] = function(output)
            return output
          end,
          ["c"] = require("neoai.utils").extract_code_snippets,
        },
        inject = {
          cutoff_width = 75,
        },
        prompts = {
          context_prompt = function(context)
            return "I'd like to provide some context for future "
                .. "messages. Here is what I want to refer "
                .. "to in our upcoming conversations:\n\n"
                .. context
          end,
        },
        mappings = {
          ["select_up"] = "<C-k>",
          ["select_down"] = "<C-j>",
        },
        open_ai = {
          url = "http://localhost:9741/v1/chat/completions",
          display_name = "llama.cpp",
          api_key = {
            env = "OPENAI_API_KEY",
            value = nil,
            -- `get` is is a function that retrieves an API key, can be used to override the default method.
            -- get = function() ... end

            -- Here is some code for a function that retrieves an API key. You can use it with
            -- the Linux 'pass' application.
            -- get = function()
            --     local key = vim.fn.system("pass show openai/mytestkey")
            --     key = string.gsub(key, "\n", "")
            --     return key
            -- end,
          },
        },
        shortcuts = {
          --        {
          --            name = "textify",
          --            key = "<leader>as",
          --            desc = "fix text with AI",
          --            use_context = true,
          --            prompt = [[
          --                Please rewrite the text to make it more readable, clear,
          --                concise, and fix any grammatical, punctuation, or spelling
          --                errors
          --            ]],
          --            modes = { "v" },
          --            strip_function = nil,
          --        },
          --        {
          --            name = "gitcommit",
          --            key = "<leader>ag",
          --            desc = "generate git commit message",
          --            use_context = false,
          --            prompt = function()
          --                return [[
          --                    Using the following git diff generate a consise and
          --                    clear git commit message, with a short title summary
          --                    that is 75 characters or less:
          --                ]] .. vim.fn.system("git diff --cached")
          --            end,
          --            modes = { "n" },
          --            strip_function = nil,
          --        },
        },
      })
    end,
    dependencies = {
      'MunifTanjim/nui.nvim',
    }

  },
  'MunifTanjim/nui.nvim',
}

Limitations

The proposed solution only uses CPU vectorization and should (TM) be portable. I don’t have a machine with a GPU capable of doing shiny Cuda/hipBLAS/ROCm acceleration. This means that the performance is OK(ish) for one single parallel user. Which is exactly what the setup was intended for from the start. Be aware though that the running model needs 9 GiB of memory while idling and will make use of pretty much all of your CPU while working. That’s also why I set it up to only start when it’s really used.

If you’re done using it and you want to free up the resources again, shut it down with

$ systemctl --user stop llama

If your setup has a ROCm supported AMD graphics card or a CUDA supported Nvidia card, you will probably want to spend some time in getting llama.cpp to make use of it for a significant performance increase, at the price of removed portability and added maintenance.

Old versions of podman + systemd

If you have an older version of podman, prior to a version with quadlet support, the systemd unit for the llama service has to be created differently:

$ podman run -d --name llama -v ~/.local/lib/llama/models/:/models -e MODEL=/models/codellama-13b-instruct.Q4_K_M.gguf -p 9741:8000 ghcr.io/gierdo/dotfiles/llama-cpp-python-server:0.2.19
$ podman generate systemd llama > ~/.config/systemd/user/llama.service

Share on

Twitter Facebook LinkedIn

Dominikus Gierlach