A Kotlin-first Android library for running LLaMA models on-device using llama.cpp. Lightweight, easy-to-use API with full coroutine support following modern Android best practices.
- Screenshots
- Quick Start
- Features
- API Reference
- Chat Templates
- Supported Models
- Architecture
- Building from Source
- Requirements
- Contributing
- License
| Chat Interface |
|---|
![]() |
dependencies {
implementation("io.github.it5prasoon:llama-kotlin-android:0.1.0")
}Recommended models for Android:
| Model | Size | RAM | Download |
|---|---|---|---|
| Phi-3.5-mini β | ~2.4GB | 4-6GB | Download |
| TinyLlama-1.1B | ~670MB | 2GB | Download |
| Llama-3.2-3B | ~2GB | 6-8GB | Download |
| Qwen2.5-1.5B | ~1GB | 3-4GB | Download |
import com.llamakotlin.android.LlamaModel
class MyActivity : AppCompatActivity() {
private var model: LlamaModel? = null
// Load model
lifecycleScope.launch {
model = LlamaModel.load("/path/to/model.gguf") {
contextSize = 2048
threads = 4
temperature = 0.7f
}
}
// Generate response (streaming)
lifecycleScope.launch {
model?.generateStream("Hello, how are you?")
?.collect { token ->
textView.append(token)
}
}
// Clean up
override fun onDestroy() {
super.onDestroy()
model?.close()
}
}# Clone and build
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android
./gradlew :sample:installDebug
# Open app, select model file, start chatting!- On-device inference - No internet required, complete privacy
- Kotlin-first API - Idiomatic, DSL-style configuration
- Full Coroutine Support -
Flow<String>for streaming, structured concurrency - Conversation History - Sample app maintains multi-turn context
- Multiple quantization - Q4_0, Q4_K_M, Q5_K_M, Q8_0 support
- Auto-load models - Sample app remembers last used model
- Small footprint - ~15 MB library size (without models)
- Memory safe - Automatic resource cleanup with Closeable pattern
class LlamaModel : Closeable {
companion object {
// Load a GGUF model
suspend fun load(
modelPath: String,
config: LlamaConfig.() -> Unit = {}
): LlamaModel
// Get library version
fun getVersion(): String
}
// One-shot generation
suspend fun generate(prompt: String): String
// Streaming generation (recommended)
fun generateStream(prompt: String): Flow<String>
// Cancel ongoing generation
fun cancelGeneration()
// Check if model is loaded
val isLoaded: Boolean
// Clean up resources
override fun close()
}val model = LlamaModel.load(modelPath) {
// Context
contextSize = 2048 // Max context length
batchSize = 512 // Batch size for prompt processing
// Threading
threads = 4 // Number of threads
threadsBatch = 4 // Threads for batch processing
// Sampling
temperature = 0.7f // Randomness (0.0 - 2.0)
topP = 0.9f // Nucleus sampling
topK = 40 // Top-K sampling
repeatPenalty = 1.1f // Repetition penalty
// Generation limits
maxTokens = 512 // Max tokens to generate
seed = -1 // Random seed (-1 = random)
// Memory options
useMmap = true // Memory-map model file
useMlock = false // Lock model in RAM
gpuLayers = 0 // GPU layers (0 = CPU only)
}try {
val model = LlamaModel.load(path)
} catch (e: LlamaException.ModelNotFound) {
// File doesn't exist
} catch (e: LlamaException.ModelLoadError) {
// Failed to load (invalid format, OOM, etc.)
} catch (e: LlamaException.GenerationError) {
// Generation failed
}Different models require different prompt formats. Here are examples:
fun formatLlama3Prompt(system: String, user: String): String {
return buildString {
append("<|begin_of_text|>")
append("<|start_header_id|>system<|end_header_id|>\n\n")
append(system)
append("<|eot_id|>")
append("<|start_header_id|>user<|end_header_id|>\n\n")
append(user)
append("<|eot_id|>")
append("<|start_header_id|>assistant<|end_header_id|>\n\n")
}
}fun formatPhi3Prompt(system: String, user: String): String {
return "<|system|>\n$system<|end|>\n<|user|>\n$user<|end|>\n<|assistant|>\n"
}fun formatChatML(system: String, user: String): String {
return buildString {
append("<|im_start|>system\n$system<|im_end|>\n")
append("<|im_start|>user\n$user<|im_end|>\n")
append("<|im_start|>assistant\n")
}
}| Model | Size | Quality | Speed | Best For |
|---|---|---|---|---|
| Phi-3.5-mini | 2.4GB | ββββ | βββ | General use |
| TinyLlama-1.1B | 670MB | ββ | βββββ | Testing, low-end devices |
| Qwen2.5-1.5B | 1GB | βββ | ββββ | Coding, reasoning |
| Llama-3.2-3B | 2GB | βββββ | ββ | High quality chat |
| Gemma-2B | 1.2GB | βββ | ββββ | Google alternative |
| Format | Size | Quality | When to Use |
|---|---|---|---|
| Q4_0 | Smallest | Lower | Memory constrained |
| Q4_K_M | Small | Good | Recommended |
| Q5_K_M | Medium | Better | Quality priority |
| Q8_0 | Large | Best | Accuracy critical |
βββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application β
β (Activity/ViewModel using LlamaModel) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Kotlin API Layer β
β βββββββββββββββ βββββββββββββββββββββββ β
β β LlamaModel β β LlamaConfig DSL β β
β β (suspend) β β LlamaException β β
β βββββββββββββββ βββββββββββββββββββββββ β
β Coroutines + Flow β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β JNI Bridge β
β llama_jni.cpp + Wrappers β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Native Layer β
β llama.cpp (GGML, GGUF, Inference) β
β ARM NEON optimizations β
βββββββββββββββββββββββββββββββββββββββββββββββ
llama-kotlin-android/
βββ app/ # π¦ Library module
β βββ src/main/
β β βββ cpp/ # C++ native code
β β β βββ llama.cpp/ # llama.cpp submodule
β β β βββ llama_jni.cpp # JNI bridge
β β β βββ llama_context_wrapper.cpp
β β βββ java/com/llamakotlin/
β β βββ LlamaModel.kt # Main API
β β βββ LlamaConfig.kt # Configuration
β β βββ exception/ # Exceptions
β
βββ sample/ # π± Sample app
β βββ src/main/
β βββ java/.../MainActivity.kt
β βββ res/layout/activity_main.xml
β
βββ README.md
- Android Studio Ladybug+
- NDK 27.3.13750724
- CMake 3.22.1+
# Clone with submodules
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android
# Update submodule if needed
git submodule update --init --recursive
# Build library
./gradlew :app:assembleRelease
# Build and install sample
./gradlew :sample:installDebug- AAR:
app/build/outputs/aar/app-release.aar - Sample APK:
sample/build/outputs/apk/debug/sample-debug.apk
| Requirement | Minimum | Recommended |
|---|---|---|
| Android API | 24 (7.0) | 26+ (8.0+) |
| RAM | 3 GB | 6+ GB |
| Storage | 1 GB | 4+ GB |
| Architecture | arm64-v8a | arm64-v8a |
- β arm64-v8a (64-bit ARM)
- β armeabi-v7a (disabled - llama.cpp incompatible)
- β x86_64 (Emulator)
Contributions welcome! Please:
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing) - Commit changes (
git commit -m 'Add amazing feature') - Push (
git push origin feature/amazing) - Open Pull Request
MIT License - see LICENSE file.
Made with β€οΈ for the Android community
