Quickstart Guide

This guide will help you get started with using pandas-query-generator (pqg) for generating synthetic pandas DataFrame queries.

Installation

Install using pip:

pip install pqg

Basic Usage

Generate Queries via CLI

The simplest way to generate queries is using the command-line interface:

# Generate 100 queries using example schema
pqg --num-queries 100 --schema examples/customer/schema.json --verbose

# Save queries to a file
pqg --num-queries 100 --schema examples/customer/schema.json --output-file queries.txt

# Generate multi-line queries
pqg --num-queries 100 --schema examples/customer/schema.json --multi-line

Generate Queries in Python

Here’s a simple example of generating queries programmatically:

from pqg import Generator, Schema, QueryStructure

# Load schema definition
schema = Schema.from_file('examples/customer/schema.json')

# Configure query generation parameters
structure = QueryStructure(
  max_merges=2,                       # Max number of table joins
  max_selection_conditions=3,         # Max WHERE conditions
  max_projection_columns=4,           # Max columns in SELECT
  selection_probability=0.7,          # 70% chance of WHERE clause
  projection_probability=0.8,         # 80% chance of column projection
  groupby_aggregation_probability=0.3 # 30% chance of GROUP BY
)

# Create generator
generator = Generator(schema, structure)

# Generate pool of queries
query_pool = generator.generate(num_queries=100)

# Print queries
for query in query_pool:
  print(query)

Schema Definition

Schemas are defined in JSON format with entity (table) definitions:

{
  "entities": {
    "customer": {
      "primary_key": "id",
      "properties": {
        "id": {
          "type": "int",
          "min": 1,
          "max": 1000
        },
        "name": {
          "type": "string",
          "starting_character": ["A", "B", "C"]
        },
        "status": {
          "type": "enum",
          "values": ["active", "inactive"]
        }
      },
      "foreign_keys": {}
    }
  }
}

Property Types

The following property types are supported:

  • int: Integer values with min/max range

  • float: Floating point values with min/max range

  • string: Text with configurable starting characters

  • enum: Fixed set of possible string values

  • date: Dates within a min/max range

Advanced Usage

Filtering Queries

Filter queries based on their execution results:

from pqg import QueryFilter

# Keep only queries that return data
query_pool.filter(QueryFilter.NON_EMPTY)

# Keep only queries that failed
query_pool.filter(QueryFilter.HAS_ERROR)

Query Statistics

Get statistics about generated queries:

stats = query_pool.statistics()
print(stats)  # Shows operation frequencies, complexity metrics

CLI Options

Common command-line options:

--ensure-non-empty        # Only generate queries returning data
--filter non-empty|empty  # Filter queries by result type
--multi-line              # Format queries across multiple lines
--sort                    # Sort queries by complexity
--verbose                 # Print detailed information

Next Steps

  • Check out the API Reference for detailed documentation

  • See example schemas in the examples/ directory

  • Read the technical paper in docs/paper.pdf