Skip to content

A Kotlin-first Android library for running LLaMA models on-device using llama.cpp. Lightweight, easy-to-use API with coroutine support.

License

Notifications You must be signed in to change notification settings

it5prasoon/llama-kotlin-android

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ¦™ LLaMA Kotlin Android

License API Kotlin

A Kotlin-first Android library for running LLaMA models on-device using llama.cpp. Lightweight, easy-to-use API with full coroutine support following modern Android best practices.


πŸ“– Table of Contents


πŸ“Έ Screenshots from Sample App

Chat Interface

πŸš€ Quick Start

1. Add Dependency

dependencies {
    implementation("io.github.it5prasoon:llama-kotlin-android:0.1.0")
}

2. Download a Model

Recommended models for Android:

Model Size RAM Download
Phi-3.5-mini ⭐ ~2.4GB 4-6GB Download
TinyLlama-1.1B ~670MB 2GB Download
Llama-3.2-3B ~2GB 6-8GB Download
Qwen2.5-1.5B ~1GB 3-4GB Download

3. Basic Usage

import com.llamakotlin.android.LlamaModel

class MyActivity : AppCompatActivity() {
    private var model: LlamaModel? = null
    
    // Load model
    lifecycleScope.launch {
        model = LlamaModel.load("/path/to/model.gguf") {
            contextSize = 2048
            threads = 4
            temperature = 0.7f
        }
    }
    
    // Generate response (streaming)
    lifecycleScope.launch {
        model?.generateStream("Hello, how are you?")
            ?.collect { token ->
                textView.append(token)
            }
    }
    
    // Clean up
    override fun onDestroy() {
        super.onDestroy()
        model?.close()
    }
}

4. Run Sample App

# Clone and build
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android
./gradlew :sample:installDebug

# Open app, select model file, start chatting!

✨ Features

  • On-device inference - No internet required, complete privacy
  • Kotlin-first API - Idiomatic, DSL-style configuration
  • Full Coroutine Support - Flow<String> for streaming, structured concurrency
  • Conversation History - Sample app maintains multi-turn context
  • Multiple quantization - Q4_0, Q4_K_M, Q5_K_M, Q8_0 support
  • Auto-load models - Sample app remembers last used model
  • Small footprint - ~15 MB library size (without models)
  • Memory safe - Automatic resource cleanup with Closeable pattern

πŸ“š API Reference

LlamaModel

class LlamaModel : Closeable {
    
    companion object {
        // Load a GGUF model
        suspend fun load(
            modelPath: String, 
            config: LlamaConfig.() -> Unit = {}
        ): LlamaModel
        
        // Get library version
        fun getVersion(): String
    }
    
    // One-shot generation
    suspend fun generate(prompt: String): String
    
    // Streaming generation (recommended)
    fun generateStream(prompt: String): Flow<String>
    
    // Cancel ongoing generation
    fun cancelGeneration()
    
    // Check if model is loaded
    val isLoaded: Boolean
    
    // Clean up resources
    override fun close()
}

Configuration DSL

val model = LlamaModel.load(modelPath) {
    // Context
    contextSize = 2048          // Max context length
    batchSize = 512             // Batch size for prompt processing
    
    // Threading
    threads = 4                 // Number of threads
    threadsBatch = 4            // Threads for batch processing
    
    // Sampling
    temperature = 0.7f          // Randomness (0.0 - 2.0)
    topP = 0.9f                // Nucleus sampling
    topK = 40                   // Top-K sampling
    repeatPenalty = 1.1f       // Repetition penalty
    
    // Generation limits
    maxTokens = 512            // Max tokens to generate
    seed = -1                  // Random seed (-1 = random)
    
    // Memory options
    useMmap = true             // Memory-map model file
    useMlock = false           // Lock model in RAM
    gpuLayers = 0              // GPU layers (0 = CPU only)
}

Exception Handling

try {
    val model = LlamaModel.load(path)
} catch (e: LlamaException.ModelNotFound) {
    // File doesn't exist
} catch (e: LlamaException.ModelLoadError) {
    // Failed to load (invalid format, OOM, etc.)
} catch (e: LlamaException.GenerationError) {
    // Generation failed
}

πŸ’¬ Chat Templates

Different models require different prompt formats. Here are examples:

Llama 3.2 / 3.1 Format

fun formatLlama3Prompt(system: String, user: String): String {
    return buildString {
        append("<|begin_of_text|>")
        append("<|start_header_id|>system<|end_header_id|>\n\n")
        append(system)
        append("<|eot_id|>")
        append("<|start_header_id|>user<|end_header_id|>\n\n")
        append(user)
        append("<|eot_id|>")
        append("<|start_header_id|>assistant<|end_header_id|>\n\n")
    }
}

Phi-3 Format

fun formatPhi3Prompt(system: String, user: String): String {
    return "<|system|>\n$system<|end|>\n<|user|>\n$user<|end|>\n<|assistant|>\n"
}

ChatML Format (Qwen, etc.)

fun formatChatML(system: String, user: String): String {
    return buildString {
        append("<|im_start|>system\n$system<|im_end|>\n")
        append("<|im_start|>user\n$user<|im_end|>\n")
        append("<|im_start|>assistant\n")
    }
}

πŸ“¦ Supported Models

Recommended for Mobile

Model Size Quality Speed Best For
Phi-3.5-mini 2.4GB ⭐⭐⭐⭐ ⭐⭐⭐ General use
TinyLlama-1.1B 670MB ⭐⭐ ⭐⭐⭐⭐⭐ Testing, low-end devices
Qwen2.5-1.5B 1GB ⭐⭐⭐ ⭐⭐⭐⭐ Coding, reasoning
Llama-3.2-3B 2GB ⭐⭐⭐⭐⭐ ⭐⭐ High quality chat
Gemma-2B 1.2GB ⭐⭐⭐ ⭐⭐⭐⭐ Google alternative

Quantization Formats

Format Size Quality When to Use
Q4_0 Smallest Lower Memory constrained
Q4_K_M Small Good Recommended
Q5_K_M Medium Better Quality priority
Q8_0 Large Best Accuracy critical

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Your Application               β”‚
β”‚    (Activity/ViewModel using LlamaModel)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Kotlin API Layer                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ LlamaModel  β”‚  β”‚   LlamaConfig DSL   β”‚  β”‚
β”‚  β”‚  (suspend)  β”‚  β”‚   LlamaException    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         Coroutines + Flow                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              JNI Bridge                     β”‚
β”‚         llama_jni.cpp + Wrappers           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Native Layer                   β”‚
β”‚    llama.cpp (GGML, GGUF, Inference)       β”‚
β”‚         ARM NEON optimizations              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

llama-kotlin-android/
β”œβ”€β”€ app/                          # πŸ“¦ Library module
β”‚   β”œβ”€β”€ src/main/
β”‚   β”‚   β”œβ”€β”€ cpp/                  # C++ native code
β”‚   β”‚   β”‚   β”œβ”€β”€ llama.cpp/        # llama.cpp submodule
β”‚   β”‚   β”‚   β”œβ”€β”€ llama_jni.cpp     # JNI bridge
β”‚   β”‚   β”‚   └── llama_context_wrapper.cpp
β”‚   β”‚   └── java/com/llamakotlin/
β”‚   β”‚       β”œβ”€β”€ LlamaModel.kt     # Main API
β”‚   β”‚       β”œβ”€β”€ LlamaConfig.kt    # Configuration
β”‚   β”‚       └── exception/        # Exceptions
β”‚
β”œβ”€β”€ sample/                       # πŸ“± Sample app
β”‚   └── src/main/
β”‚       β”œβ”€β”€ java/.../MainActivity.kt
β”‚       └── res/layout/activity_main.xml
β”‚
└── README.md

πŸ› οΈ Building from Source

Prerequisites

  • Android Studio Ladybug+
  • NDK 27.3.13750724
  • CMake 3.22.1+

Build Steps

# Clone with submodules
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android

# Update submodule if needed
git submodule update --init --recursive

# Build library
./gradlew :app:assembleRelease

# Build and install sample
./gradlew :sample:installDebug

Build Outputs

  • AAR: app/build/outputs/aar/app-release.aar
  • Sample APK: sample/build/outputs/apk/debug/sample-debug.apk

πŸ“‹ Requirements

Requirement Minimum Recommended
Android API 24 (7.0) 26+ (8.0+)
RAM 3 GB 6+ GB
Storage 1 GB 4+ GB
Architecture arm64-v8a arm64-v8a

Supported ABIs

  • βœ… arm64-v8a (64-bit ARM)
  • ❌ armeabi-v7a (disabled - llama.cpp incompatible)
  • βœ… x86_64 (Emulator)

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push (git push origin feature/amazing)
  5. Open Pull Request

πŸ“„ License

MIT License - see LICENSE file.


πŸ™ Acknowledgments

  • llama.cpp - Incredible C++ inference engine
  • ggml - Tensor library for ML
  • Meta AI - LLaMA models

Made with ❀️ for the Android community

About

A Kotlin-first Android library for running LLaMA models on-device using llama.cpp. Lightweight, easy-to-use API with coroutine support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published