# Models for inference

### **Platform Capabilities**

Arkane Cloud enables experimentation and deployment of open-source AI models for various generative tasks, including conversational AI development, virtual assistant creation, and visual content generation.

### **Supported Model Categories**

The platform accommodates two primary model formats:

* Language-to-language processing
* Language-to-visual generation

View the complete model catalog within the Arkane Cloud interface.

### **Model Information Display**

Each model's details page presents key pricing metrics:

* Request token cost: Pricing for input tokens sent via API, calculated per million tokens in USD
* Response token cost: Pricing for tokens returned by the model, calculated per million tokens in USD
* Visual output cost: Per-image pricing for generated visuals, in USD

### **Configuration Options**

Customize model behavior through adjustable parameters that help optimize both performance and expenses. The platform incorporates the same parameter set found in the vLLM framework.

Different interfaces offer varying levels of control:

* Playground: Features commonly used settings
* API: Provides complete vLLM parameter access

### **Testing Workflow**

1. Browse the Models section and select your preferred option, then click **Go to Deploy**
2. Enter your test prompt and select **Submit**

Once you've evaluated various models through the playground interface, select your optimal choice and integrate it through API calls.

### **Performance Enhancement**

The language model inference system implements multiple acceleration methods to boost processing speed while preserving output quality:

1. Key-value storage: Maintains commonly used data pairs to minimize repeated calculations
2. Segmented processing: Breaks input into manageable sections for efficient memory utilization
3. Optimized attention: Modified computation approach that streamlines attention operations
4. Precision reduction: Decreases model parameter precision to lower resource requirements
5. Request grouping: Combines multiple inputs for processing efficiency
6. State preservation: Retains intermediate computations to avoid redundant processing

### **Quality Preservation**

These enhancement methods are engineered to preserve model accuracy. Comprehensive evaluation demonstrates that optimized versions retain roughly 99% of baseline model performance.

The enhanced models generate virtually identical outputs compared to standard versions, with minimal variation. Each optimization approach undergoes rigorous quality assessment to ensure combined techniques don't degrade overall performance. The objective is delivering high-speed inference while maintaining result accuracy and reliability with minimal resource consumption.

### **Advantages**

Performance optimization delivers multiple value propositions:

* Enhanced processing capacity: Optimized systems handle increased request volumes efficiently
* Faster response times: Reduced computational overhead enables quicker user interactions
* Better resource efficiency: Lower hardware requirements support broader deployment scenarios

These enhancements collectively enable a superior language model service that balances efficiency with effectiveness.
