API Reference¶

This reference documents the key classes and functions in pqg.

Schema Components¶

Schema¶

class pqg.Schema(entities: ~typing.Set[~pqg.entity.Entity] = <factory>)[source]¶

A dictionary-like class representing a database schema containing multiple entities.

The Schema class provides dictionary-style access to Entity objects, allowing for both iteration over entities and direct access to specific entities by name.

entities¶

Set of Entity objects in the schema.

Type:: t.Set[Entity]

_entity_map¶

Internal mapping of entity names to Entity objects.

Type:: t.Dict[str, Entity]

__getitem__(key: str) → Entity[source]¶

Get an entity by name using dictionary-style access.

Parameters:: key (str) – The name of the entity to retrieve.
Returns:: The requested entity.
Return type:: Entity
Raises:: KeyError – If no entity exists with the given name.

__iter__() → Iterator[Entity][source]¶

Iterate over all entities in the schema.

Returns:: Iterator yielding Entity objects.
Return type:: Iterator[Entity]

__len__() → int[source]¶

Get the number of entities in the schema.

Returns:: Total number of entities.
Return type:: int

static from_dict(data: dict) → Schema[source]¶

Create a Schema instance from a dictionary.

Similar to from_file but accepts a dictionary directly instead of reading from a file. This is useful for creating schemas programmatically or when the schema definition is already in memory.

Parameters:: data (dict) – Dictionary containing the schema configuration. Expected to have an ‘entities’ key mapping to entity configs.
Returns:: A new Schema instance containing the entities defined in the dictionary.
Return type:: Schema
Raises:: ValueError – If the dictionary structure is invalid.

Example

schema_dict = {

“entities”: {

“customer”: {: “primary_key”: “id”, “properties”: {…}, “foreign_keys”: {…}

}

schema = Schema.from_dict(schema_dict)

static from_file(path: str) → Schema[source]¶

Create a Schema instance by loading entity configurations from a JSON file.

This method reads a JSON file containing entity configurations and creates a Schema object with Entity instances for each configured entity.

Parameters:

path (str) – The file path to the JSON configuration file.

Returns:

A new Schema instance containing the entities defined in the file.

Return type:

Schema

Raises:

json.JSONDecodeError – If the file contains invalid JSON.
FileNotFoundError – If the specified file does not exist.
PermissionError – If the file cannot be read due to permission issues.

Note

If the ‘entities’ key is not present in the JSON file, an empty Schema will be returned.

Example

schema = Schema.from_file(‘path/to/schema_config.json’)

Entity¶

Represents entity information parsed from schema files.

This class is used to generate well-formed and meaningful queries based on entity information and a high-level structure describing how queries should generally look (see QueryStructure).

name¶

The name of the entity.

Type:: str

primary_key¶

The primary key(s) of the entity.

Type:: str | t.List[str] | None

properties¶

A dictionary of property names to their definitions.

Type:: t.Dict[str, Property]

foreign_keys¶

A dictionary of foreign key relationships.

Type:: t.Dict[str, t.List[str]]

__eq__(other: object) → bool[source]¶

Compare this entity with another for equality.

Entities are considered equal if they have the same name, as names must be unique within a schema.

Parameters:: other – The object to compare with.
Returns:: True if the objects are equal, False otherwise.
Return type:: bool

__hash__() → int[source]¶

Generate a hash based on the entity’s name.

Since entity names must be unique within a schema, using the name as the hash basis ensures proper hash table behavior.

Returns:: Hash value for the entity.
Return type:: int

property data_ranges: Dict[str, Tuple[int, int] | List[str]]¶

Get the data ranges for all properties of the entity.

Returns:: A dictionary mapping property names to their respective ranges or possible values.

static from_configuration(name: str, config: Dict) → Entity[source]¶

Create an Entity instance from a configuration dictionary.

Parameters:: config (t.Dict) – A dictionary containing entity configuration.
Returns:: An instance of the Entity class.
Return type:: Entity
Raises:: ValueError – If an unknown property type is encountered.

generate_dataframe(num_rows=1000) → DataFrame[source]¶

Generate a Pandas dataframe using this entity’s information.

Parameters:: num_rows (int) – The number of rows to generate. Default is 1000.
Returns:: A dataframe populated with randomly generated data based on the entity’s properties.
Return type:: pd.DataFrame

Note

If the entity has a unique primary key of type int, the number of rows may be limited to the range of possible values for that key.

property has_unique_primary_key: bool¶: Check if the entity has a single, unique primary key.

Property Types¶

class pqg.PropertyInt(min: int, max: int, type: str = 'int')[source]¶

Integer property definition with range constraints.

Used to define integer columns in entity schemas, providing valid value ranges for query generation.

min¶

Minimum allowed value (inclusive)

Type:: int

max¶

Maximum allowed value (inclusive)

Type:: int

type¶

Always “int”, used for schema parsing

Type:: str

class pqg.PropertyFloat(min: float, max: float, type: str = 'float')[source]¶

Floating point property definition with range constraints.

Used to define decimal/float columns in entity schemas, providing valid value ranges for query generation.

min¶

Minimum allowed value (inclusive)

Type:: float

max¶

Maximum allowed value (inclusive)

Type:: float

type¶

Always “float”, used for schema parsing

Type:: str

class pqg.PropertyEnum(values: List[str], type: str = 'enum')[source]¶

Enumeration property definition with allowed values.

Used to define columns with a fixed set of possible values in entity schemas, such as status fields or categories.

values¶

List of valid string values for this property

Type:: List[str]

type¶

Always “enum”, used for schema parsing

Type:: str

Example

status = PropertyEnum(values=[“active”, “inactive”, “pending”])

class pqg.PropertyString(starting_character: List[str], type: str = 'string')[source]¶

String property definition with character constraints.

Used to define text columns in entity schemas, with optional constraints on what characters strings can start with. This allows generating realistic looking text data.

starting_character¶

List of valid first characters for generated strings

Type:: List[str]

type¶

Always “string”, used for schema parsing

Type:: str

Example

# Names starting with uppercase letters name = PropertyString(starting_character=list(string.ascii_uppercase))

class pqg.PropertyDate(min: date, max: date, type: str = 'date')[source]¶

Date property definition with range constraints.

Used to define date columns in entity schemas, providing valid date ranges for query generation. Generates dates between the min and max values inclusive.

min¶

Earliest allowed date

Type:: datetime.date

max¶

Latest allowed date

Type:: datetime.date

type¶

Always “date”, used for schema parsing

Type:: str

Example

# Dates in the year 2023 created_at = PropertyDate(

min=date(2023, 1, 1), max=date(2023, 12, 31)

)

Query Generation¶

Generator¶

class pqg.Generator(schema: Schema, query_structure: QueryStructure, with_status: bool = False)[source]¶

Generator for creating pools of pandas DataFrame queries.

This class handles the generation of valid pandas DataFrame queries based on a provided schema and query structure parameters. It supports both parallel and sequential query generation with optional progress tracking.

The generator can ensure that queries produce non-empty results by retrying failed generations, and supports formatting queries in both single-line and multi-line styles.

schema¶: Schema defining the database structure and relationships

query_structure¶: Parameters controlling query complexity and features

sample_data¶: Dictionary mapping entity names to sample DataFrames

with_status¶: Whether to display progress bars during operations

generate(options: GenerateOptions) → QueryPool[source]¶

Generate a pool of queries using parallel or sequential processing.

This method creates multiple queries according to the specified options, either concurrently using a process pool or sequentially. Progress is tracked with a progress bar when with_status is True.

Parameters:: options – Configuration options controlling generation behavior
Returns:: QueryPool containing the generated queries and sample data

Note

When using parallel processing, the progress bar accurately tracks completion across all processes. The resulting QueryPool contains all successfully generated queries in an arbitrary order.

GenerateOptions¶

class pqg.GenerateOptions(ensure_non_empty: bool = False, multi_line: bool = False, multi_processing: bool = True, num_queries: int = 1000)[source]¶

Configuration options for controlling query generation behavior.

This class provides settings that determine how queries are generated and validated, including performance options like parallel processing.

ensure_non_empty¶

If True, only generate queries that return data

Type:: bool

multi_line¶

If True, format queries with line breaks for readability

Type:: bool

multi_processing¶

If True, generate queries in parallel

Type:: bool

num_queries¶

Total number of queries to generate

Type:: int

Example

options = GenerateOptions(: ensure_non_empty=True, num_queries=1000, multi_processing=True

) generator.generate(options)

static from_args(arguments: Arguments) → GenerateOptions[source]¶

Create GenerateOptions from command-line arguments.

Parameters:: arguments – Parsed command-line arguments
Returns:: GenerateOptions configured according to provided arguments

QueryStructure¶

class pqg.QueryStructure(groupby_aggregation_probability: float, max_groupby_columns: int, max_merges: int, max_projection_columns: int, max_selection_conditions: int, projection_probability: float, selection_probability: float)[source]¶

Configuration parameters controlling query generation behavior.

This class encapsulates the probability and limit settings that determine what kinds of queries are generated. It controls aspects like how likely different operations are to appear and how complex they can be.

groupby_aggregation_probability¶

Probability (0-1) of including a GROUP BY operation

Type:: float

max_groupby_columns¶

Maximum number of columns that can be grouped on

Type:: int

max_merges¶

Maximum number of table joins allowed in a query

Type:: int

max_projection_columns¶

Maximum number of columns that can be selected

Type:: int

max_selection_conditions¶

Maximum number of WHERE clause conditions

Type:: int

projection_probability¶

Probability (0-1) of including a SELECT operation

Type:: float

selection_probability¶

Probability (0-1) of including a WHERE operation

Type:: float

static from_args(arguments: Arguments) → QueryStructure[source]¶

Create a QueryStructure instance from command-line arguments.

Parameters:: arguments – Instance of Arguments containing parsed command-line parameters
Returns:: Instance configured according to the provided arguments
Return type:: QueryStructure

Query Components¶

Query¶

class pqg.Query(entity: str, operations: List[Operation], multi_line: bool, columns: Set[str])[source]¶

Represents a complete database query with tracking for query complexity.

A query consists of a target entity and a sequence of operations to be applied to that entity. Query complexity is determined primarily by the number of merge operations and their nesting depth.

entity¶

The name of the target entity.

Type:: str

operations¶

List of operations to apply.

Type:: List[Operation]

multi_line¶

Whether to format output across multiple lines.

Type:: bool

columns¶

Columns available for operations.

Type:: Set[str]

__eq__(other: object) → bool[source]¶: Equality comparison based on complexity and string representation.

__hash__() → int[source]¶: Hash based on complexity and string representation.

__lt__(other: object) → bool[source]¶: Less than comparison based on complexity and string representation.

__str__() → str[source]¶: Return str(self).

property complexity: int¶

Calculate query complexity based on all operations and their details.

Complexity is determined by: 1. Base complexity: Total number of operations 2. Merge complexity:

Each merge adds weight of 3 (more complex than other operations)

Additional complexity from nested queries

Selection complexity: Number of conditions in each selection
Projection complexity: Number of columns being projected
GroupBy complexity: Number of grouping columns plus weight of aggregation

Returns:: Complexity score for the query
Return type:: int

format_multi_line(start_counter: int = 1) → Tuple[str, int][source]¶

Format the query across multiple lines for better readability.

Transforms the query into a sequence of DataFrame operations where each operation is assigned to a numbered DataFrame variable (df1, df2, etc.). Handles nested operations by recursively formatting sub-queries and maintaining proper DataFrame references.

Args:

start_counter (int): Initial counter value for DataFrame numbering. Defaults to 1.

Returns:

Tuple[str, int]: A tuple containing:

The formatted multi-line query string

The final counter value after processing all operations

Example:

For a query with multiple operations, might return: (“df1 = customer[customer[‘age’] >= 25]

“: “df2 = df1.merge(orders, left_on=’id’, right_on=’customer_id’)”, 3)

property merge_count: int¶

Count the total number of merge operations in the query, including nested merges.

Returns:: Total number of merge operations
Return type:: int

property merge_entities: Set[str]¶

Get the set of all entities involved in this query, including the base entity and all merged entities.

This property maintains a complete picture of table dependencies by tracking: 1. The base entity of the query 2. All entities that have been merged directly into this query 3. All entities that have been merged into sub-queries (nested merges)

The tracking helps prevent: - Circular dependencies (e.g., orders → customers → orders) - Redundant joins (e.g., merging the same table multiple times) - Invalid join paths

Returns:: A set of entity names (table names) that are part of this query’s join graph. Includes both the base entity and all merged entities.
Return type:: Set[str]

Operations¶

class pqg.Selection(conditions: ~typing.List[~typing.Tuple[str, str, ~typing.Any, str | None]] = <factory>)[source]¶

Represents a selection operation in a query.

Attributes: conditions (List[Tuple[str, str, Any, str]]): List of selection conditions and operators.

Each tuple contains (column, operation, value, next_condition_operator). The last tuple’s next_condition_operator is ignored.

apply(entity: str) → str[source]¶

Apply the operation to the given entity name.

Parameters:: entity – Name of the entity (table) to apply the operation to.
Returns:: A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

class pqg.Projection(columns: ~typing.List[str] = <factory>)[source]¶

Represents a projection operation in a query.

columns¶

A set of column names to project.

Type:: Set[str]

apply(entity: str) → str[source]¶

Apply the operation to the given entity name.

Parameters:: entity – Name of the entity (table) to apply the operation to.
Returns:: A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

class pqg.GroupByAggregation(group_by_columns: List[str], agg_function: str)[source]¶

apply(entity: str) → str[source]¶

Apply the operation to the given entity name.

Parameters:: entity – Name of the entity (table) to apply the operation to.
Returns:: A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

class pqg.Merge(right: pqg.merge.Query, left_on: str, right_on: str)[source]¶

apply(entity: str) → str[source]¶

Apply the operation to the given entity name.

Parameters:: entity – Name of the entity (table) to apply the operation to.
Returns:: A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”

property entities: Set[str]¶

Get the set of all entities involved in the right side of this merge operation.

This property provides a complete view of all tables involved in the right-hand side of the merge, including: 1. The immediate right entity being merged 2. Any entities that have been merged into the right entity through nested merges

The difference between this and Query.merge_entities is that this property only tracks entities from the right side of the merge operation, while merge_entities includes the base entity and all merged entities from both sides.

Returns:: A set of entity names that are part of the right-hand side of this merge operation’s join graph.
Return type:: Set[str]

Query Management¶

QueryPool¶

class pqg.QueryPool(queries: List[Query], query_structure: QueryStructure, sample_data: Dict[str, DataFrame], multi_processing: bool = True, with_status: bool = False)[source]¶

Manages a collection of database queries with execution and analysis capabilities.

Provides functionality for executing queries in parallel, filtering based on results, computing statistics, and managing query persistence. The pool maintains execution results to avoid redundant computation when performing multiple operations.

_queries¶: List of Query objects in the pool

_query_structure¶: Parameters controlling query generation

_sample_data¶: Dictionary mapping entity names to sample DataFrames

_results¶: Cached query execution results (DataFrame/Series, error message)

_with_status¶: Whether to display progress bars during operations

__iter__() → Iterator[Query][source]¶: Iterate over the queries in the pool.

__len__() → int[source]¶: Return the number of queries in the pool.

execute(force_execute: bool = False, num_processes: int | None = None) → List[Tuple[DataFrame | Series | None, str | None]][source]¶

Execute all queries against the sample data, either in parallel or sequentially.

Evaluates queries using either multiprocessing for parallel execution or sequential processing. Results are cached to avoid re-execution unless explicitly requested.

Parameters:

force_execute – If True, re-execute all queries even if results exist
num_processes – Number of parallel processes to use. Defaults to CPU count. Only used when multi_processing is True.

Returns:

List of tuples containing (result, error) for each query

filter(filter_type: QueryFilter, force_execute: bool = False) → None[source]¶

Filter queries based on their execution results.

Modifies the query pool in-place to keep only queries matching the filter criteria. Executes queries if results don’t exist.

Parameters:

filter_type – Criteria for keeping queries (NON_EMPTY, EMPTY, etc.)
force_execute – If True, re-execute queries before filtering

items() → Iterator[tuple[Query, Tuple[DataFrame | Series | None, str | None]]][source]¶

Iterate over query-result pairs.

Each iteration yields a tuple containing a query and its execution result. If results haven’t been computed yet, executes the queries first.

Yields:: tuple[Query, QueryResult] – Pairs of (query, (result, error))

results() → Iterator[Tuple[DataFrame | Series | None, str | None]][source]¶

Iterate over query results.

If results haven’t been computed yet, executes the queries first.

Yields:: QueryResult – Pairs of (result, error) for each query

save(output_file: str, create_dirs: bool = True) → None[source]¶

Save all queries to a file.

Each query is saved on a separate line in its string representation. Empty queries are filtered out and whitespace is trimmed.

Parameters:

output_file – Path to the output file
create_dirs – If True, creates parent directories if needed

sort() → None[source]¶

Sort queries by their complexity.

Orders queries based on their complexity score while maintaining the association between queries and their execution results if they exist.

statistics(force_execute: bool = False) → QueryStatistics[source]¶

Generate comprehensive statistics about the query pool.

Analyzes query characteristics and execution results to measure how well the generated queries match the target parameters.

Parameters:: force_execute – If True, re-execute queries before analysis
Returns:: Statistics comparing actual vs. target characteristics

QueryFilter¶

class pqg.QueryFilter(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶: Enum for query filter options

Statistics¶

class pqg.query_pool.ExecutionStatistics(successful: int = 0, failed: int = 0, non_empty: int = 0, empty: int = 0, errors: ~collections.Counter[str] = <factory>)[source]¶

Statistics about query execution results.

This class tracks success/failure rates and categorizes execution outcomes to provide insights into query generation quality.

successful¶

Number of queries that executed without error

Type:: int

failed¶

Number of queries that raised an error

Type:: int

non_empty¶

Number of queries returning data (rows/values)

Type:: int

empty¶

Number of queries returning empty results

Type:: int

errors¶

Count of each error type encountered

Type:: Counter[str]

Example

stats = ExecutionStatistics() stats.successful = 95 stats.failed = 5 stats.errors[“KeyError”] = 3 print(stats) # Shows execution summary with percentages

class pqg.query_pool.QueryStatistics(query_structure: ~pqg.query_structure.QueryStructure, total_queries: int = 0, queries_with_selection: int = 0, queries_with_projection: int = 0, queries_with_groupby: int = 0, queries_with_merge: int = 0, selection_conditions: ~typing.List[int] = <factory>, projection_columns: ~typing.List[int] = <factory>, groupby_columns: ~typing.List[int] = <factory>, merge_count: ~typing.List[int] = <factory>, execution_results: ~pqg.query_pool.ExecutionStatistics = <factory>)[source]¶

Statistics comparing actual query characteristics against target parameters.

Analyzes how closely generated queries match the desired structure parameters and collects distributions of various operation characteristics.

query_structure¶

Target parameters for query generation

Type:: pqg.query_structure.QueryStructure

total_queries¶

Total number of queries analyzed

Type:: int

queries_with_selection¶

Number of queries containing selection operations

Type:: int

queries_with_projection¶

Number of queries containing projection operations

Type:: int

queries_with_groupby¶

Number of queries containing groupby operations

Type:: int

queries_with_merge¶

Number of queries containing merge operations

Type:: int

selection_conditions¶

List of condition counts from each selection operation

Type:: List[int]

projection_columns¶

List of column counts from each projection operation

Type:: List[int]

groupby_columns¶

List of column counts from each groupby operation

Type:: List[int]

merge_count¶

List of merge counts from each query

Type:: List[int]

execution_results¶

Statistics about query execution outcomes

Type:: pqg.query_pool.ExecutionStatistics

Command Line Interface¶

The package provides a command-line interface through the pqg command. See the Quickstart Guide guide for common usage examples.

Arguments¶

The following command-line arguments are available:

--disable-multi-processing: Generate and execute queries sequentially
--ensure-non-empty: Only generate queries that return data

–filter {non-empty,empty,has-error,without-error}: Filter queries by execution results

--groupby-aggregation-probability FLOAT: Probability of including GROUP BY operations (default: 0.5)
--max-groupby-columns INT: Maximum columns in GROUP BY clauses (default: 5)
--max-merges INT: Maximum number of table joins (default: 2)
--max-projection-columns INT: Maximum columns to select (default: 5)
--max-selection-conditions INT: Maximum WHERE conditions (default: 5)
--multi-line: Format queries across multiple lines
--num-queries INT: Number of queries to generate [required]
--output-file FILE: Save queries to a file
--projection-probability FLOAT: Probability of including projections (default: 0.5)
--schema FILE: Schema definition file [required]
--selection-probability FLOAT: Probability of including selections (default: 0.5)
--sort: Sort queries by complexity
--verbose: Print detailed information

Error Handling¶

The following exceptions may be raised:

ValueError: Raised for invalid configurations or parameters
IOError: Raised for schema file read errors

Examples¶

See the Quickstart Guide guide for usage examples.

API Reference¶

Schema Components¶

Schema¶

Entity¶

Property Types¶

Query Generation¶

Generator¶

GenerateOptions¶

QueryStructure¶

Query Components¶

Query¶

Operations¶

Query Management¶

QueryPool¶

QueryFilter¶

Statistics¶

Command Line Interface¶

Arguments¶

Error Handling¶

Examples¶

pqg

Navigation

Related Topics