🦙 LLaMA Kotlin Android

A Kotlin-first Android library for running LLaMA models on-device using llama.cpp. Lightweight, easy-to-use API with full coroutine support following modern Android best practices.

📖 Table of Contents

Screenshots
Quick Start
Features
API Reference
Chat Templates
Supported Models
Architecture
Building from Source
Requirements
Contributing
License

📸 Screenshots from Sample App

Chat Interface

🚀 Quick Start

1. Add Dependency

dependencies {
    implementation("io.github.it5prasoon:llama-kotlin-android:0.1.0")
}

2. Download a Model

Recommended models for Android:

Model	Size	RAM	Download
Phi-3.5-mini ⭐	~2.4GB	4-6GB	Download
TinyLlama-1.1B	~670MB	2GB	Download
Llama-3.2-3B	~2GB	6-8GB	Download
Qwen2.5-1.5B	~1GB	3-4GB	Download

3. Basic Usage

import com.llamakotlin.android.LlamaModel

class MyActivity : AppCompatActivity() {
    private var model: LlamaModel? = null
    
    // Load model
    lifecycleScope.launch {
        model = LlamaModel.load("/path/to/model.gguf") {
            contextSize = 2048
            threads = 4
            temperature = 0.7f
        }
    }
    
    // Generate response (streaming)
    lifecycleScope.launch {
        model?.generateStream("Hello, how are you?")
            ?.collect { token ->
                textView.append(token)
            }
    }
    
    // Clean up
    override fun onDestroy() {
        super.onDestroy()
        model?.close()
    }
}

4. Run Sample App

# Clone and build
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android
./gradlew :sample:installDebug

# Open app, select model file, start chatting!

✨ Features

On-device inference - No internet required, complete privacy
Kotlin-first API - Idiomatic, DSL-style configuration
Full Coroutine Support - Flow<String> for streaming, structured concurrency
Conversation History - Sample app maintains multi-turn context
Multiple quantization - Q4_0, Q4_K_M, Q5_K_M, Q8_0 support
Auto-load models - Sample app remembers last used model
Small footprint - ~15 MB library size (without models)
Memory safe - Automatic resource cleanup with Closeable pattern

📚 API Reference

LlamaModel

class LlamaModel : Closeable {
    
    companion object {
        // Load a GGUF model
        suspend fun load(
            modelPath: String, 
            config: LlamaConfig.() -> Unit = {}
        ): LlamaModel
        
        // Get library version
        fun getVersion(): String
    }
    
    // One-shot generation
    suspend fun generate(prompt: String): String
    
    // Streaming generation (recommended)
    fun generateStream(prompt: String): Flow<String>
    
    // Cancel ongoing generation
    fun cancelGeneration()
    
    // Check if model is loaded
    val isLoaded: Boolean
    
    // Clean up resources
    override fun close()
}

Configuration DSL

val model = LlamaModel.load(modelPath) {
    // Context
    contextSize = 2048          // Max context length
    batchSize = 512             // Batch size for prompt processing
    
    // Threading
    threads = 4                 // Number of threads
    threadsBatch = 4            // Threads for batch processing
    
    // Sampling
    temperature = 0.7f          // Randomness (0.0 - 2.0)
    topP = 0.9f                // Nucleus sampling
    topK = 40                   // Top-K sampling
    repeatPenalty = 1.1f       // Repetition penalty
    
    // Generation limits
    maxTokens = 512            // Max tokens to generate
    seed = -1                  // Random seed (-1 = random)
    
    // Memory options
    useMmap = true             // Memory-map model file
    useMlock = false           // Lock model in RAM
    gpuLayers = 0              // GPU layers (0 = CPU only)
}

Exception Handling

try {
    val model = LlamaModel.load(path)
} catch (e: LlamaException.ModelNotFound) {
    // File doesn't exist
} catch (e: LlamaException.ModelLoadError) {
    // Failed to load (invalid format, OOM, etc.)
} catch (e: LlamaException.GenerationError) {
    // Generation failed
}

💬 Chat Templates

Different models require different prompt formats. Here are examples:

Llama 3.2 / 3.1 Format

fun formatLlama3Prompt(system: String, user: String): String {
    return buildString {
        append("<|begin_of_text|>")
        append("<|start_header_id|>system<|end_header_id|>\n\n")
        append(system)
        append("<|eot_id|>")
        append("<|start_header_id|>user<|end_header_id|>\n\n")
        append(user)
        append("<|eot_id|>")
        append("<|start_header_id|>assistant<|end_header_id|>\n\n")
    }
}

Phi-3 Format

fun formatPhi3Prompt(system: String, user: String): String {
    return "<|system|>\n$system<|end|>\n<|user|>\n$user<|end|>\n<|assistant|>\n"
}

ChatML Format (Qwen, etc.)

fun formatChatML(system: String, user: String): String {
    return buildString {
        append("<|im_start|>system\n$system<|im_end|>\n")
        append("<|im_start|>user\n$user<|im_end|>\n")
        append("<|im_start|>assistant\n")
    }
}

📦 Supported Models

Recommended for Mobile

Model	Size	Quality	Speed	Best For
Phi-3.5-mini	2.4GB	⭐⭐⭐⭐	⭐⭐⭐	General use
TinyLlama-1.1B	670MB	⭐⭐	⭐⭐⭐⭐⭐	Testing, low-end devices
Qwen2.5-1.5B	1GB	⭐⭐⭐	⭐⭐⭐⭐	Coding, reasoning
Llama-3.2-3B	2GB	⭐⭐⭐⭐⭐	⭐⭐	High quality chat
Gemma-2B	1.2GB	⭐⭐⭐	⭐⭐⭐⭐	Google alternative

Quantization Formats

Format	Size	Quality	When to Use
Q4_0	Smallest	Lower	Memory constrained
Q4_K_M	Small	Good	Recommended
Q5_K_M	Medium	Better	Quality priority
Q8_0	Large	Best	Accuracy critical

🏗️ Architecture

┌─────────────────────────────────────────────┐
│              Your Application               │
│    (Activity/ViewModel using LlamaModel)    │
└─────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────┐
│            Kotlin API Layer                 │
│  ┌─────────────┐  ┌─────────────────────┐  │
│  │ LlamaModel  │  │   LlamaConfig DSL   │  │
│  │  (suspend)  │  │   LlamaException    │  │
│  └─────────────┘  └─────────────────────┘  │
│         Coroutines + Flow                   │
└─────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────┐
│              JNI Bridge                     │
│         llama_jni.cpp + Wrappers           │
└─────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────┐
│              Native Layer                   │
│    llama.cpp (GGML, GGUF, Inference)       │
│         ARM NEON optimizations              │
└─────────────────────────────────────────────┘

Project Structure

llama-kotlin-android/
├── app/                          # 📦 Library module
│   ├── src/main/
│   │   ├── cpp/                  # C++ native code
│   │   │   ├── llama.cpp/        # llama.cpp submodule
│   │   │   ├── llama_jni.cpp     # JNI bridge
│   │   │   └── llama_context_wrapper.cpp
│   │   └── java/com/llamakotlin/
│   │       ├── LlamaModel.kt     # Main API
│   │       ├── LlamaConfig.kt    # Configuration
│   │       └── exception/        # Exceptions
│
├── sample/                       # 📱 Sample app
│   └── src/main/
│       ├── java/.../MainActivity.kt
│       └── res/layout/activity_main.xml
│
└── README.md

🛠️ Building from Source

Prerequisites

Android Studio Ladybug+
NDK 27.3.13750724
CMake 3.22.1+

Build Steps

# Clone with submodules
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android

# Update submodule if needed
git submodule update --init --recursive

# Build library
./gradlew :app:assembleRelease

# Build and install sample
./gradlew :sample:installDebug

Build Outputs

AAR: app/build/outputs/aar/app-release.aar
Sample APK: sample/build/outputs/apk/debug/sample-debug.apk

📋 Requirements

Requirement	Minimum	Recommended
Android API	24 (7.0)	26+ (8.0+)
RAM	3 GB	6+ GB
Storage	1 GB	4+ GB
Architecture	arm64-v8a	arm64-v8a

Supported ABIs

✅ arm64-v8a (64-bit ARM)
❌ armeabi-v7a (disabled - llama.cpp incompatible)
✅ x86_64 (Emulator)

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create feature branch (git checkout -b feature/amazing)
Commit changes (git commit -m 'Add amazing feature')
Push (git push origin feature/amazing)
Open Pull Request

📄 License

MIT License - see LICENSE file.

🙏 Acknowledgments

llama.cpp - Incredible C++ inference engine
ggml - Tensor library for ML
Meta AI - LLaMA models

Made with ❤️ for the Android community

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
gradle		gradle
sample		sample
screenshots		screenshots
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

License

it5prasoon/llama-kotlin-android

Folders and files

Latest commit

History

Repository files navigation