API Reference

This reference documents the key classes and functions in pqg.

Schema Components

Schema

class pqg.Schema(entities: ~typing.Set[~pqg.entity.Entity] = <factory>)[source]

A dictionary-like class representing a database schema containing multiple entities.

The Schema class provides dictionary-style access to Entity objects, allowing for both iteration over entities and direct access to specific entities by name.

entities

Set of Entity objects in the schema.

Type:

t.Set[Entity]

_entity_map

Internal mapping of entity names to Entity objects.

Type:

t.Dict[str, Entity]

__getitem__(key: str) Entity[source]

Get an entity by name using dictionary-style access.

Parameters:

key (str) – The name of the entity to retrieve.

Returns:

The requested entity.

Return type:

Entity

Raises:

KeyError – If no entity exists with the given name.

__iter__() Iterator[Entity][source]

Iterate over all entities in the schema.

Returns:

Iterator yielding Entity objects.

Return type:

Iterator[Entity]

__len__() int[source]

Get the number of entities in the schema.

Returns:

Total number of entities.

Return type:

int

static from_dict(data: dict) Schema[source]

Create a Schema instance from a dictionary.

Similar to from_file but accepts a dictionary directly instead of reading from a file. This is useful for creating schemas programmatically or when the schema definition is already in memory.

Parameters:

data (dict) – Dictionary containing the schema configuration. Expected to have an ‘entities’ key mapping to entity configs.

Returns:

A new Schema instance containing the entities defined in the dictionary.

Return type:

Schema

Raises:

ValueError – If the dictionary structure is invalid.

Example

schema_dict = {
“entities”: {
“customer”: {

“primary_key”: “id”, “properties”: {…}, “foreign_keys”: {…}

}

}

}

schema = Schema.from_dict(schema_dict)

static from_file(path: str) Schema[source]

Create a Schema instance by loading entity configurations from a JSON file.

This method reads a JSON file containing entity configurations and creates a Schema object with Entity instances for each configured entity.

Parameters:

path (str) – The file path to the JSON configuration file.

Returns:

A new Schema instance containing the entities defined in the file.

Return type:

Schema

Raises:
  • json.JSONDecodeError – If the file contains invalid JSON.

  • FileNotFoundError – If the specified file does not exist.

  • PermissionError – If the file cannot be read due to permission issues.

Note

If the ‘entities’ key is not present in the JSON file, an empty Schema will be returned.

Example

schema = Schema.from_file(‘path/to/schema_config.json’)

Entity

class pqg.Entity(name: str, primary_key: str | List[str] | None, properties: Dict[str, PropertyInt | PropertyFloat | PropertyEnum | PropertyString | PropertyDate], foreign_keys: Dict[str, List[str]])[source]

Represents entity information parsed from schema files.

This class is used to generate well-formed and meaningful queries based on entity information and a high-level structure describing how queries should generally look (see QueryStructure).

name

The name of the entity.

Type:

str

primary_key

The primary key(s) of the entity.

Type:

str | t.List[str] | None

properties

A dictionary of property names to their definitions.

Type:

t.Dict[str, Property]

foreign_keys

A dictionary of foreign key relationships.

Type:

t.Dict[str, t.List[str]]

__eq__(other: object) bool[source]

Compare this entity with another for equality.

Entities are considered equal if they have the same name, as names must be unique within a schema.

Parameters:

other – The object to compare with.

Returns:

True if the objects are equal, False otherwise.

Return type:

bool

__hash__() int[source]

Generate a hash based on the entity’s name.

Since entity names must be unique within a schema, using the name as the hash basis ensures proper hash table behavior.

Returns:

Hash value for the entity.

Return type:

int

property data_ranges: Dict[str, Tuple[int, int] | List[str]]

Get the data ranges for all properties of the entity.

Returns:

A dictionary mapping property names to their respective ranges or possible values.

static from_configuration(name: str, config: Dict) Entity[source]

Create an Entity instance from a configuration dictionary.

Parameters:

config (t.Dict) – A dictionary containing entity configuration.

Returns:

An instance of the Entity class.

Return type:

Entity

Raises:

ValueError – If an unknown property type is encountered.

generate_dataframe(num_rows=1000) DataFrame[source]

Generate a Pandas dataframe using this entity’s information.

Parameters:

num_rows (int) – The number of rows to generate. Default is 1000.

Returns:

A dataframe populated with randomly generated data based on the entity’s properties.

Return type:

pd.DataFrame

Note

If the entity has a unique primary key of type int, the number of rows may be limited to the range of possible values for that key.

property has_unique_primary_key: bool

Check if the entity has a single, unique primary key.

Property Types

class pqg.PropertyInt(min: int, max: int, type: str = 'int')[source]

Integer property definition with range constraints.

Used to define integer columns in entity schemas, providing valid value ranges for query generation.

min

Minimum allowed value (inclusive)

Type:

int

max

Maximum allowed value (inclusive)

Type:

int

type

Always “int”, used for schema parsing

Type:

str

class pqg.PropertyFloat(min: float, max: float, type: str = 'float')[source]

Floating point property definition with range constraints.

Used to define decimal/float columns in entity schemas, providing valid value ranges for query generation.

min

Minimum allowed value (inclusive)

Type:

float

max

Maximum allowed value (inclusive)

Type:

float

type

Always “float”, used for schema parsing

Type:

str

class pqg.PropertyEnum(values: List[str], type: str = 'enum')[source]

Enumeration property definition with allowed values.

Used to define columns with a fixed set of possible values in entity schemas, such as status fields or categories.

values

List of valid string values for this property

Type:

List[str]

type

Always “enum”, used for schema parsing

Type:

str

Example

status = PropertyEnum(values=[“active”, “inactive”, “pending”])

class pqg.PropertyString(starting_character: List[str], type: str = 'string')[source]

String property definition with character constraints.

Used to define text columns in entity schemas, with optional constraints on what characters strings can start with. This allows generating realistic looking text data.

starting_character

List of valid first characters for generated strings

Type:

List[str]

type

Always “string”, used for schema parsing

Type:

str

Example

# Names starting with uppercase letters name = PropertyString(starting_character=list(string.ascii_uppercase))

class pqg.PropertyDate(min: date, max: date, type: str = 'date')[source]

Date property definition with range constraints.

Used to define date columns in entity schemas, providing valid date ranges for query generation. Generates dates between the min and max values inclusive.

min

Earliest allowed date

Type:

datetime.date

max

Latest allowed date

Type:

datetime.date

type

Always “date”, used for schema parsing

Type:

str

Example

# Dates in the year 2023 created_at = PropertyDate(

min=date(2023, 1, 1), max=date(2023, 12, 31)

)

Query Generation

Generator

class pqg.Generator(schema: Schema, query_structure: QueryStructure, with_status: bool = False)[source]

Generator for creating pools of pandas DataFrame queries.

This class handles the generation of valid pandas DataFrame queries based on a provided schema and query structure parameters. It supports both parallel and sequential query generation with optional progress tracking.

The generator can ensure that queries produce non-empty results by retrying failed generations, and supports formatting queries in both single-line and multi-line styles.

schema

Schema defining the database structure and relationships

query_structure

Parameters controlling query complexity and features

sample_data

Dictionary mapping entity names to sample DataFrames

with_status

Whether to display progress bars during operations

generate(options: GenerateOptions) QueryPool[source]

Generate a pool of queries using parallel or sequential processing.

This method creates multiple queries according to the specified options, either concurrently using a process pool or sequentially. Progress is tracked with a progress bar when with_status is True.

Parameters:

options – Configuration options controlling generation behavior

Returns:

QueryPool containing the generated queries and sample data

Note

When using parallel processing, the progress bar accurately tracks completion across all processes. The resulting QueryPool contains all successfully generated queries in an arbitrary order.

GenerateOptions

class pqg.GenerateOptions(ensure_non_empty: bool = False, multi_line: bool = False, multi_processing: bool = True, num_queries: int = 1000)[source]

Configuration options for controlling query generation behavior.

This class provides settings that determine how queries are generated and validated, including performance options like parallel processing.

ensure_non_empty

If True, only generate queries that return data

Type:

bool

multi_line

If True, format queries with line breaks for readability

Type:

bool

multi_processing

If True, generate queries in parallel

Type:

bool

num_queries

Total number of queries to generate

Type:

int

Example

options = GenerateOptions(

ensure_non_empty=True, num_queries=1000, multi_processing=True

) generator.generate(options)

static from_args(arguments: Arguments) GenerateOptions[source]

Create GenerateOptions from command-line arguments.

Parameters:

arguments – Parsed command-line arguments

Returns:

GenerateOptions configured according to provided arguments

QueryStructure

class pqg.QueryStructure(groupby_aggregation_probability: float, max_groupby_columns: int, max_merges: int, max_projection_columns: int, max_selection_conditions: int, projection_probability: float, selection_probability: float)[source]

Configuration parameters controlling query generation behavior.

This class encapsulates the probability and limit settings that determine what kinds of queries are generated. It controls aspects like how likely different operations are to appear and how complex they can be.

groupby_aggregation_probability

Probability (0-1) of including a GROUP BY operation

Type:

float

max_groupby_columns

Maximum number of columns that can be grouped on

Type:

int

max_merges

Maximum number of table joins allowed in a query

Type:

int

max_projection_columns

Maximum number of columns that can be selected

Type:

int

max_selection_conditions

Maximum number of WHERE clause conditions

Type:

int

projection_probability

Probability (0-1) of including a SELECT operation

Type:

float

selection_probability

Probability (0-1) of including a WHERE operation

Type:

float

static from_args(arguments: Arguments) QueryStructure[source]

Create a QueryStructure instance from command-line arguments.

Parameters:

arguments – Instance of Arguments containing parsed command-line parameters

Returns:

Instance configured according to the provided arguments

Return type:

QueryStructure

Query Components

Query

class pqg.Query(entity: str, operations: List[Operation], multi_line: bool, columns: Set[str])[source]

Represents a complete database query with tracking for query complexity.

A query consists of a target entity and a sequence of operations to be applied to that entity. Query complexity is determined primarily by the number of merge operations and their nesting depth.

entity

The name of the target entity.

Type:

str

operations

List of operations to apply.

Type:

List[Operation]

multi_line

Whether to format output across multiple lines.

Type:

bool

columns

Columns available for operations.

Type:

Set[str]

__eq__(other: object) bool[source]

Equality comparison based on complexity and string representation.

__hash__() int[source]

Hash based on complexity and string representation.

__lt__(other: object) bool[source]

Less than comparison based on complexity and string representation.

__str__() str[source]

Return str(self).

property complexity: int

Calculate query complexity based on all operations and their details.

Complexity is determined by: 1. Base complexity: Total number of operations 2. Merge complexity:

  • Each merge adds weight of 3 (more complex than other operations)

  • Additional complexity from nested queries

  1. Selection complexity: Number of conditions in each selection

  2. Projection complexity: Number of columns being projected

  3. GroupBy complexity: Number of grouping columns plus weight of aggregation

Returns:

Complexity score for the query

Return type:

int

format_multi_line(start_counter: int = 1) Tuple[str, int][source]

Format the query across multiple lines for better readability.

Transforms the query into a sequence of DataFrame operations where each operation is assigned to a numbered DataFrame variable (df1, df2, etc.). Handles nested operations by recursively formatting sub-queries and maintaining proper DataFrame references.

Args:

start_counter (int): Initial counter value for DataFrame numbering. Defaults to 1.

Returns:
Tuple[str, int]: A tuple containing:
  • The formatted multi-line query string

  • The final counter value after processing all operations

Example:

For a query with multiple operations, might return: (“df1 = customer[customer[‘age’] >= 25]

“df2 = df1.merge(orders, left_on=’id’, right_on=’customer_id’)”, 3)

property merge_count: int

Count the total number of merge operations in the query, including nested merges.

Returns:

Total number of merge operations

Return type:

int

property merge_entities: Set[str]

Get the set of all entities involved in this query, including the base entity and all merged entities.

This property maintains a complete picture of table dependencies by tracking: 1. The base entity of the query 2. All entities that have been merged directly into this query 3. All entities that have been merged into sub-queries (nested merges)

The tracking helps prevent: - Circular dependencies (e.g., orders → customers → orders) - Redundant joins (e.g., merging the same table multiple times) - Invalid join paths

Returns:

A set of entity names (table names) that are part of this query’s join graph. Includes both the base entity and all merged entities.

Return type:

Set[str]

Operations

class pqg.Selection(conditions: ~typing.List[~typing.Tuple[str, str, ~typing.Any, str | None]] = <factory>)[source]

Represents a selection operation in a query.

Attributes: conditions (List[Tuple[str, str, Any, str]]): List of selection conditions and operators.

Each tuple contains (column, operation, value, next_condition_operator). The last tuple’s next_condition_operator is ignored.

apply(entity: str) str[source]

Apply the operation to the given entity name.

Parameters:

entity – Name of the entity (table) to apply the operation to.

Returns:

A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

class pqg.Projection(columns: ~typing.List[str] = <factory>)[source]

Represents a projection operation in a query.

columns

A set of column names to project.

Type:

Set[str]

apply(entity: str) str[source]

Apply the operation to the given entity name.

Parameters:

entity – Name of the entity (table) to apply the operation to.

Returns:

A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

class pqg.GroupByAggregation(group_by_columns: List[str], agg_function: str)[source]
apply(entity: str) str[source]

Apply the operation to the given entity name.

Parameters:

entity – Name of the entity (table) to apply the operation to.

Returns:

A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

class pqg.Merge(right: pqg.merge.Query, left_on: str, right_on: str)[source]
apply(entity: str) str[source]

Apply the operation to the given entity name.

Parameters:

entity – Name of the entity (table) to apply the operation to.

Returns:

A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

property entities: Set[str]

Get the set of all entities involved in the right side of this merge operation.

This property provides a complete view of all tables involved in the right-hand side of the merge, including: 1. The immediate right entity being merged 2. Any entities that have been merged into the right entity through nested merges

The difference between this and Query.merge_entities is that this property only tracks entities from the right side of the merge operation, while merge_entities includes the base entity and all merged entities from both sides.

Returns:

A set of entity names that are part of the right-hand side of this merge operation’s join graph.

Return type:

Set[str]

Query Management

QueryPool

class pqg.QueryPool(queries: List[Query], query_structure: QueryStructure, sample_data: Dict[str, DataFrame], multi_processing: bool = True, with_status: bool = False)[source]

Manages a collection of database queries with execution and analysis capabilities.

Provides functionality for executing queries in parallel, filtering based on results, computing statistics, and managing query persistence. The pool maintains execution results to avoid redundant computation when performing multiple operations.

_queries

List of Query objects in the pool

_query_structure

Parameters controlling query generation

_sample_data

Dictionary mapping entity names to sample DataFrames

_results

Cached query execution results (DataFrame/Series, error message)

_with_status

Whether to display progress bars during operations

__iter__() Iterator[Query][source]

Iterate over the queries in the pool.

__len__() int[source]

Return the number of queries in the pool.

execute(force_execute: bool = False, num_processes: int | None = None) List[Tuple[DataFrame | Series | None, str | None]][source]

Execute all queries against the sample data, either in parallel or sequentially.

Evaluates queries using either multiprocessing for parallel execution or sequential processing. Results are cached to avoid re-execution unless explicitly requested.

Parameters:
  • force_execute – If True, re-execute all queries even if results exist

  • num_processes – Number of parallel processes to use. Defaults to CPU count. Only used when multi_processing is True.

Returns:

List of tuples containing (result, error) for each query

filter(filter_type: QueryFilter, force_execute: bool = False) None[source]

Filter queries based on their execution results.

Modifies the query pool in-place to keep only queries matching the filter criteria. Executes queries if results don’t exist.

Parameters:
  • filter_type – Criteria for keeping queries (NON_EMPTY, EMPTY, etc.)

  • force_execute – If True, re-execute queries before filtering

items() Iterator[tuple[Query, Tuple[DataFrame | Series | None, str | None]]][source]

Iterate over query-result pairs.

Each iteration yields a tuple containing a query and its execution result. If results haven’t been computed yet, executes the queries first.

Yields:

tuple[Query, QueryResult] – Pairs of (query, (result, error))

results() Iterator[Tuple[DataFrame | Series | None, str | None]][source]

Iterate over query results.

If results haven’t been computed yet, executes the queries first.

Yields:

QueryResult – Pairs of (result, error) for each query

save(output_file: str, create_dirs: bool = True) None[source]

Save all queries to a file.

Each query is saved on a separate line in its string representation. Empty queries are filtered out and whitespace is trimmed.

Parameters:
  • output_file – Path to the output file

  • create_dirs – If True, creates parent directories if needed

sort() None[source]

Sort queries by their complexity.

Orders queries based on their complexity score while maintaining the association between queries and their execution results if they exist.

statistics(force_execute: bool = False) QueryStatistics[source]

Generate comprehensive statistics about the query pool.

Analyzes query characteristics and execution results to measure how well the generated queries match the target parameters.

Parameters:

force_execute – If True, re-execute queries before analysis

Returns:

Statistics comparing actual vs. target characteristics

QueryFilter

class pqg.QueryFilter(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Enum for query filter options

Statistics

class pqg.query_pool.ExecutionStatistics(successful: int = 0, failed: int = 0, non_empty: int = 0, empty: int = 0, errors: ~collections.Counter[str] = <factory>)[source]

Statistics about query execution results.

This class tracks success/failure rates and categorizes execution outcomes to provide insights into query generation quality.

successful

Number of queries that executed without error

Type:

int

failed

Number of queries that raised an error

Type:

int

non_empty

Number of queries returning data (rows/values)

Type:

int

empty

Number of queries returning empty results

Type:

int

errors

Count of each error type encountered

Type:

Counter[str]

Example

stats = ExecutionStatistics() stats.successful = 95 stats.failed = 5 stats.errors[“KeyError”] = 3 print(stats) # Shows execution summary with percentages

class pqg.query_pool.QueryStatistics(query_structure: ~pqg.query_structure.QueryStructure, total_queries: int = 0, queries_with_selection: int = 0, queries_with_projection: int = 0, queries_with_groupby: int = 0, queries_with_merge: int = 0, selection_conditions: ~typing.List[int] = <factory>, projection_columns: ~typing.List[int] = <factory>, groupby_columns: ~typing.List[int] = <factory>, merge_count: ~typing.List[int] = <factory>, execution_results: ~pqg.query_pool.ExecutionStatistics = <factory>)[source]

Statistics comparing actual query characteristics against target parameters.

Analyzes how closely generated queries match the desired structure parameters and collects distributions of various operation characteristics.

query_structure

Target parameters for query generation

Type:

pqg.query_structure.QueryStructure

total_queries

Total number of queries analyzed

Type:

int

queries_with_selection

Number of queries containing selection operations

Type:

int

queries_with_projection

Number of queries containing projection operations

Type:

int

queries_with_groupby

Number of queries containing groupby operations

Type:

int

queries_with_merge

Number of queries containing merge operations

Type:

int

selection_conditions

List of condition counts from each selection operation

Type:

List[int]

projection_columns

List of column counts from each projection operation

Type:

List[int]

groupby_columns

List of column counts from each groupby operation

Type:

List[int]

merge_count

List of merge counts from each query

Type:

List[int]

execution_results

Statistics about query execution outcomes

Type:

pqg.query_pool.ExecutionStatistics

Command Line Interface

The package provides a command-line interface through the pqg command. See the Quickstart Guide guide for common usage examples.

Arguments

The following command-line arguments are available:

--disable-multi-processing

Generate and execute queries sequentially

--ensure-non-empty

Only generate queries that return data

–filter {non-empty,empty,has-error,without-error}

Filter queries by execution results

--groupby-aggregation-probability FLOAT

Probability of including GROUP BY operations (default: 0.5)

--max-groupby-columns INT

Maximum columns in GROUP BY clauses (default: 5)

--max-merges INT

Maximum number of table joins (default: 2)

--max-projection-columns INT

Maximum columns to select (default: 5)

--max-selection-conditions INT

Maximum WHERE conditions (default: 5)

--multi-line

Format queries across multiple lines

--num-queries INT

Number of queries to generate [required]

--output-file FILE

Save queries to a file

--projection-probability FLOAT

Probability of including projections (default: 0.5)

--schema FILE

Schema definition file [required]

--selection-probability FLOAT

Probability of including selections (default: 0.5)

--sort

Sort queries by complexity

--verbose

Print detailed information

Error Handling

The following exceptions may be raised:

ValueError

Raised for invalid configurations or parameters

IOError

Raised for schema file read errors

Examples

See the Quickstart Guide guide for usage examples.