Example: Synthetic Data Generation
This example generates diverse synthetic data samples from a text description. It first identifies a comprehensive set of sample cases, then creates detailed profiles for each one.
Get the code
What it demonstrates
- Concept refinement (
DataDescriptionrefinesText,NbOfSamplesrefinesNumber) - Rich structured concept with constrained choices (
Samplewithcurrent_performance,learns_best_with, etc.) - Two-step generation: identify diverse cases, then create a profile for each
- Batching over identified cases to generate samples in parallel
The Method: bundle.mthds
Sample concept
The Sample concept represents a student profile with constrained fields:
[concept.Sample]
description = "A complete synthetic data record representing a student profile."
[concept.Sample.structure]
student_name = { type = "text", description = "The name of the student", required = true }
current_performance = { type = "text", description = "The student's current performance level", choices = [
"Struggling", "Average", "Advanced",
], required = true }
learns_best_with = { type = "text", description = "The learning modality that works best", choices = [
"Visual examples", "Step-by-step text", "Hands-on practice", "Videos",
], required = true }
# ... pace, complexity, strengths, needs_help_with, etc.
Pipeline
[pipe.generate_synthetic_data_samples]
type = "PipeSequence"
inputs = { data_description = "DataDescription", nb_samples = "NbOfSamples" }
output = "Sample[]"
steps = [
{ pipe = "identify_sample_cases", result = "cases" },
{ pipe = "create_profile_for_case", result = "samples",
batch_over = "cases", batch_as = "case_characterization" },
]
The first step identifies diverse real-world scenarios to ensure comprehensive coverage (edge cases, common scenarios, important variations). The second step creates a full profile for each case.
How to run
pipelex run bundle examples/c_advanced/gen_synthetic_data/bundle.mthds \
-i examples/c_advanced/gen_synthetic_data/inputs.json
Related Documentation
- PipeLLM Operator - The core operator for LLM interactions
- PipeSequence Controller - Chain pipes into sequential workflows
- Understanding Multiplicity - How batching works
- Refining Concepts - How concept refinement works