Weave is a flexible framework for generating high-quality synthetic data using Language Models (LLMs). It provides a modular and extensible architecture that allows users to easily create, customize, and validate synthetic datasets for various applications.
Note: This project is in its very early stages and is being actively developed in public. Expect frequent changes and improvements.
GitHub Repository: https://github.com/ashikshafi08/weave.git
You can install weave directly from GitHub using pip:
pip install git+https://github.com/ashikshafi08/weave.git
For development, you can clone the repository and install it in editable mode:
git clone https://github.com/ashikshafi08/weave.git
cd weave
pip install -e .
- π Modular Architecture: Easily extend and customize components
- π€ Multiple LLM Support: Use OpenAI, Hugging Face, or custom LLM providers
- π Flexible Prompt Management: Customize and version prompts for different use cases
- β¨ Data Validation: Ensure quality and correctness of generated data
- π Pipeline-based Processing: Chain operations for complex data generation
- π§ Plugin System: Add custom functionality through plugins
The Weave framework consists of several core components that work together to generate synthetic data:
-
WeaveFramework (
framework.py
)- Central orchestrator for the entire process
- Manages configuration and component lifecycle
- Coordinates data generation pipeline
-
Pipeline (
pipeline.py
)- Defines data generation workflow
- Manages sequence of operations
- Handles data flow between components
-
DataSource (
data_source.py
)- Provides initial data or templates
- Supports multiple data sources (files, databases, APIs)
- Handles data loading and sampling
-
DataProcessor (
data_processor.py
)- Transforms and prepares data
- Implements data cleaning and normalization
- Supports custom processing logic
-
LLMInterface (
llm_interface.py
)- Manages LLM interactions
- Handles prompt submission and response processing
- Implements rate limiting and error handling
-
PromptManager (
prompt_manager.py
)- Manages prompt templates
- Supports dynamic prompt generation
- Handles template versioning
-
DataGenerator (
data_generator.py
)- Converts LLM outputs to structured data
- Implements parsing and formatting logic
- Ensures data structure consistency
-
DataValidator (
data_validator.py
)- Validates generated data
- Implements quality checks
- Enforces domain-specific rules
-
PluginManager (
plugin_manager.py
)- Manages framework extensions
- Handles plugin lifecycle
- Provides plugin discovery and loading
graph TD
A[WeaveFramework] --> B[Pipeline]
B --> C[DataSource]
B --> D[DataProcessor]
B --> E[LLMInterface]
B --> F[DataGenerator]
B --> G[DataValidator]
E --> H[PromptManager]
A --> I[PluginManager]
import asyncio
from weave import WeaveFramework
# Configure the framework
config = {
'pipeline': {
'type': 'default',
'stages': ['source', 'process', 'generate', 'validate']
},
'data_source': {
'type': 'json',
'path': 'templates.json'
},
'llm_interface': {
'type': 'openai',
'model': 'gpt-4',
'api_key': 'your-api-key'
}
}
# Initialize and run
async def generate_data():
framework = WeaveFramework(config)
dataset = await framework.generate_dataset(num_samples=10)
return dataset
if __name__ == "__main__":
asyncio.run(generate_data())
from weave.core.data_source import DataSource
from weave.core.decorators import register_module
@register_module('data_sources', 'custom')
class CustomDataSource(DataSource):
async def fetch(self, num_samples: int) -> List[Dict[str, Any]]:
# Implement custom data fetching logic
pass
async def load_data(self, source: str) -> None:
# Implement custom data loading logic
pass
from weave.core.data_processor import DataProcessor
from weave.core.decorators import register_module
@register_module('data_processors', 'custom')
class CustomProcessor(DataProcessor):
async def process(self, data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
# Implement custom processing logic
pass
Weave uses YAML configuration files to define component settings and pipeline stages:
# config.yaml
pipeline:
type: default
stages:
- source
- process
- generate
- validate
data_source:
type: json
path: templates.json
cache: true
llm_interface:
type: openai
model: gpt-4
api_key: ${OPENAI_API_KEY}
rate_limit: 60
validator:
type: default
rules:
- schema_validation
- quality_check
Weave supports plugins for extending functionality:
from weave.core.plugin_manager import PluginManager
# Register a plugin
plugin_manager = PluginManager()
plugin_manager.register_plugin('custom_processor', CustomProcessor)
# Use in framework
framework = WeaveFramework(config, plugin_manager=plugin_manager)
Here's a complete example of generating math olympiad problems:
from weave import WeaveFramework
from weave.data_sources import MathOlympiadSource
from weave.data_processors import MathProcessor
from weave.prompt_templates import MathPromptTemplate
config = {
'pipeline': {
'type': 'math_olympiad',
'stages': ['source', 'process', 'generate', 'validate']
},
'data_source': {
'type': 'math_olympiad',
'difficulty_range': [3, 5],
'topics': ['algebra', 'geometry', 'number_theory']
}
}
async def generate_math_problems():
framework = WeaveFramework(config)
problems = await framework.generate_dataset(num_samples=10)
return problems
pip install weave-framework
pytest tests/
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- π Documentation
- π¬ Discord Community
- π§ Email Support
As this project is in its early stages, contributions, suggestions, and feedback are highly welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
This project is under active development. APIs may change, and features may be added or removed. It's a learning project and is not intended for production use as of now.