API Reference¶
This reference documents the key classes and functions in pqg.
Schema Components¶
Schema¶
- class pqg.Schema(entities: ~typing.Set[~pqg.entity.Entity] = <factory>)[source]¶
-
A dictionary-like class representing a database schema containing multiple entities.
The Schema class provides dictionary-style access to Entity objects, allowing for both iteration over entities and direct access to specific entities by name.
- __getitem__(key: str) Entity [source]¶
-
Get an entity by name using dictionary-style access.
- Parameters:
-
key (str) – The name of the entity to retrieve.
- Returns:
-
The requested entity.
- Return type:
- Raises:
-
KeyError – If no entity exists with the given name.
- __iter__() Iterator[Entity] [source]¶
-
Iterate over all entities in the schema.
- Returns:
-
Iterator yielding Entity objects.
- Return type:
-
Iterator[Entity]
- __len__() int [source]¶
-
Get the number of entities in the schema.
- Returns:
-
Total number of entities.
- Return type:
int
- static from_dict(data: dict) Schema [source]¶
-
Create a Schema instance from a dictionary.
Similar to from_file but accepts a dictionary directly instead of reading from a file. This is useful for creating schemas programmatically or when the schema definition is already in memory.
- Parameters:
-
data (dict) – Dictionary containing the schema configuration. Expected to have an ‘entities’ key mapping to entity configs.
- Returns:
-
A new Schema instance containing the entities defined in the dictionary.
- Return type:
- Raises:
-
ValueError – If the dictionary structure is invalid.
Example
- schema_dict = {
-
- “entities”: {
-
- “customer”: {
-
“primary_key”: “id”, “properties”: {…}, “foreign_keys”: {…}
}
}
}
schema = Schema.from_dict(schema_dict)
- static from_file(path: str) Schema [source]¶
-
Create a Schema instance by loading entity configurations from a JSON file.
This method reads a JSON file containing entity configurations and creates a Schema object with Entity instances for each configured entity.
- Parameters:
-
path (str) – The file path to the JSON configuration file.
- Returns:
-
A new Schema instance containing the entities defined in the file.
- Return type:
- Raises:
-
-
json.JSONDecodeError – If the file contains invalid JSON.
-
FileNotFoundError – If the specified file does not exist.
-
PermissionError – If the file cannot be read due to permission issues.
-
Note
If the ‘entities’ key is not present in the JSON file, an empty Schema will be returned.
Example
schema = Schema.from_file(‘path/to/schema_config.json’)
Entity¶
- class pqg.Entity(name: str, primary_key: str | List[str] | None, properties: Dict[str, PropertyInt | PropertyFloat | PropertyEnum | PropertyString | PropertyDate], foreign_keys: Dict[str, List[str]])[source]¶
-
Represents entity information parsed from schema files.
This class is used to generate well-formed and meaningful queries based on entity information and a high-level structure describing how queries should generally look (see QueryStructure).
- name¶
-
The name of the entity.
- Type:
str
- primary_key¶
-
The primary key(s) of the entity.
- Type:
-
str | t.List[str] | None
- properties¶
-
A dictionary of property names to their definitions.
- Type:
-
t.Dict[str, Property]
- foreign_keys¶
-
A dictionary of foreign key relationships.
- Type:
-
t.Dict[str, t.List[str]]
- __eq__(other: object) bool [source]¶
-
Compare this entity with another for equality.
Entities are considered equal if they have the same name, as names must be unique within a schema.
- Parameters:
-
other – The object to compare with.
- Returns:
-
True if the objects are equal, False otherwise.
- Return type:
bool
- __hash__() int [source]¶
-
Generate a hash based on the entity’s name.
Since entity names must be unique within a schema, using the name as the hash basis ensures proper hash table behavior.
- Returns:
-
Hash value for the entity.
- Return type:
int
- property data_ranges: Dict[str, Tuple[int, int] | List[str]]¶
-
Get the data ranges for all properties of the entity.
- Returns:
-
A dictionary mapping property names to their respective ranges or possible values.
- static from_configuration(name: str, config: Dict) Entity [source]¶
-
Create an Entity instance from a configuration dictionary.
- Parameters:
-
config (t.Dict) – A dictionary containing entity configuration.
- Returns:
-
An instance of the Entity class.
- Return type:
- Raises:
-
ValueError – If an unknown property type is encountered.
- generate_dataframe(num_rows=1000) DataFrame [source]¶
-
Generate a Pandas dataframe using this entity’s information.
- Parameters:
-
num_rows (int) – The number of rows to generate. Default is 1000.
- Returns:
-
A dataframe populated with randomly generated data based on the entity’s properties.
- Return type:
pd.DataFrame
Note
If the entity has a unique primary key of type int, the number of rows may be limited to the range of possible values for that key.
- property has_unique_primary_key: bool¶
-
Check if the entity has a single, unique primary key.
Property Types¶
- class pqg.PropertyInt(min: int, max: int, type: str = 'int')[source]¶
-
Integer property definition with range constraints.
Used to define integer columns in entity schemas, providing valid value ranges for query generation.
- min¶
-
Minimum allowed value (inclusive)
- Type:
int
- max¶
-
Maximum allowed value (inclusive)
- Type:
int
- type¶
-
Always “int”, used for schema parsing
- Type:
str
- class pqg.PropertyFloat(min: float, max: float, type: str = 'float')[source]¶
-
Floating point property definition with range constraints.
Used to define decimal/float columns in entity schemas, providing valid value ranges for query generation.
- min¶
-
Minimum allowed value (inclusive)
- Type:
float
- max¶
-
Maximum allowed value (inclusive)
- Type:
float
- type¶
-
Always “float”, used for schema parsing
- Type:
str
- class pqg.PropertyEnum(values: List[str], type: str = 'enum')[source]¶
-
Enumeration property definition with allowed values.
Used to define columns with a fixed set of possible values in entity schemas, such as status fields or categories.
- values¶
-
List of valid string values for this property
- Type:
List[str]
- type¶
-
Always “enum”, used for schema parsing
- Type:
str
Example
status = PropertyEnum(values=[“active”, “inactive”, “pending”])
- class pqg.PropertyString(starting_character: List[str], type: str = 'string')[source]¶
-
String property definition with character constraints.
Used to define text columns in entity schemas, with optional constraints on what characters strings can start with. This allows generating realistic looking text data.
- starting_character¶
-
List of valid first characters for generated strings
- Type:
List[str]
- type¶
-
Always “string”, used for schema parsing
- Type:
str
Example
# Names starting with uppercase letters name = PropertyString(starting_character=list(string.ascii_uppercase))
- class pqg.PropertyDate(min: date, max: date, type: str = 'date')[source]¶
-
Date property definition with range constraints.
Used to define date columns in entity schemas, providing valid date ranges for query generation. Generates dates between the min and max values inclusive.
- min¶
-
Earliest allowed date
- Type:
datetime.date
- max¶
-
Latest allowed date
- Type:
datetime.date
- type¶
-
Always “date”, used for schema parsing
- Type:
str
Example
# Dates in the year 2023 created_at = PropertyDate(
min=date(2023, 1, 1), max=date(2023, 12, 31)
)
Query Generation¶
Generator¶
- class pqg.Generator(schema: Schema, query_structure: QueryStructure, with_status: bool = False)[source]¶
-
Generator for creating pools of pandas DataFrame queries.
This class handles the generation of valid pandas DataFrame queries based on a provided schema and query structure parameters. It supports both parallel and sequential query generation with optional progress tracking.
The generator can ensure that queries produce non-empty results by retrying failed generations, and supports formatting queries in both single-line and multi-line styles.
- schema¶
-
Schema defining the database structure and relationships
- query_structure¶
-
Parameters controlling query complexity and features
- sample_data¶
-
Dictionary mapping entity names to sample DataFrames
- with_status¶
-
Whether to display progress bars during operations
- generate(options: GenerateOptions) QueryPool [source]¶
-
Generate a pool of queries using parallel or sequential processing.
This method creates multiple queries according to the specified options, either concurrently using a process pool or sequentially. Progress is tracked with a progress bar when with_status is True.
- Parameters:
-
options – Configuration options controlling generation behavior
- Returns:
-
QueryPool containing the generated queries and sample data
Note
When using parallel processing, the progress bar accurately tracks completion across all processes. The resulting QueryPool contains all successfully generated queries in an arbitrary order.
GenerateOptions¶
- class pqg.GenerateOptions(ensure_non_empty: bool = False, multi_line: bool = False, multi_processing: bool = True, num_queries: int = 1000)[source]¶
-
Configuration options for controlling query generation behavior.
This class provides settings that determine how queries are generated and validated, including performance options like parallel processing.
- ensure_non_empty¶
-
If True, only generate queries that return data
- Type:
bool
- multi_line¶
-
If True, format queries with line breaks for readability
- Type:
bool
- multi_processing¶
-
If True, generate queries in parallel
- Type:
bool
- num_queries¶
-
Total number of queries to generate
- Type:
int
Example
- options = GenerateOptions(
-
ensure_non_empty=True, num_queries=1000, multi_processing=True
) generator.generate(options)
- static from_args(arguments: Arguments) GenerateOptions [source]¶
-
Create GenerateOptions from command-line arguments.
- Parameters:
-
arguments – Parsed command-line arguments
- Returns:
-
GenerateOptions configured according to provided arguments
QueryStructure¶
- class pqg.QueryStructure(groupby_aggregation_probability: float, max_groupby_columns: int, max_merges: int, max_projection_columns: int, max_selection_conditions: int, projection_probability: float, selection_probability: float)[source]¶
-
Configuration parameters controlling query generation behavior.
This class encapsulates the probability and limit settings that determine what kinds of queries are generated. It controls aspects like how likely different operations are to appear and how complex they can be.
- groupby_aggregation_probability¶
-
Probability (0-1) of including a GROUP BY operation
- Type:
float
- max_groupby_columns¶
-
Maximum number of columns that can be grouped on
- Type:
int
- max_merges¶
-
Maximum number of table joins allowed in a query
- Type:
int
- max_projection_columns¶
-
Maximum number of columns that can be selected
- Type:
int
- max_selection_conditions¶
-
Maximum number of WHERE clause conditions
- Type:
int
- projection_probability¶
-
Probability (0-1) of including a SELECT operation
- Type:
float
- selection_probability¶
-
Probability (0-1) of including a WHERE operation
- Type:
float
- static from_args(arguments: Arguments) QueryStructure [source]¶
-
Create a QueryStructure instance from command-line arguments.
- Parameters:
-
arguments – Instance of Arguments containing parsed command-line parameters
- Returns:
-
Instance configured according to the provided arguments
- Return type:
Query Components¶
Query¶
- class pqg.Query(entity: str, operations: List[Operation], multi_line: bool, columns: Set[str])[source]¶
-
Represents a complete database query with tracking for query complexity.
A query consists of a target entity and a sequence of operations to be applied to that entity. Query complexity is determined primarily by the number of merge operations and their nesting depth.
- entity¶
-
The name of the target entity.
- Type:
str
- multi_line¶
-
Whether to format output across multiple lines.
- Type:
bool
- columns¶
-
Columns available for operations.
- Type:
Set[str]
- __eq__(other: object) bool [source]¶
-
Equality comparison based on complexity and string representation.
- __lt__(other: object) bool [source]¶
-
Less than comparison based on complexity and string representation.
- property complexity: int¶
-
Calculate query complexity based on all operations and their details.
Complexity is determined by: 1. Base complexity: Total number of operations 2. Merge complexity:
-
Each merge adds weight of 3 (more complex than other operations)
-
Additional complexity from nested queries
-
Selection complexity: Number of conditions in each selection
-
Projection complexity: Number of columns being projected
-
GroupBy complexity: Number of grouping columns plus weight of aggregation
- Returns:
-
Complexity score for the query
- Return type:
int
-
- format_multi_line(start_counter: int = 1) Tuple[str, int] [source]¶
-
Format the query across multiple lines for better readability.
Transforms the query into a sequence of DataFrame operations where each operation is assigned to a numbered DataFrame variable (df1, df2, etc.). Handles nested operations by recursively formatting sub-queries and maintaining proper DataFrame references.
- Args:
-
start_counter (int): Initial counter value for DataFrame numbering. Defaults to 1.
- Returns:
-
- Tuple[str, int]: A tuple containing:
-
-
The formatted multi-line query string
-
The final counter value after processing all operations
-
- Example:
-
For a query with multiple operations, might return: (“df1 = customer[customer[‘age’] >= 25]
- “
-
“df2 = df1.merge(orders, left_on=’id’, right_on=’customer_id’)”, 3)
- property merge_count: int¶
-
Count the total number of merge operations in the query, including nested merges.
- Returns:
-
Total number of merge operations
- Return type:
int
- property merge_entities: Set[str]¶
-
Get the set of all entities involved in this query, including the base entity and all merged entities.
This property maintains a complete picture of table dependencies by tracking: 1. The base entity of the query 2. All entities that have been merged directly into this query 3. All entities that have been merged into sub-queries (nested merges)
The tracking helps prevent: - Circular dependencies (e.g., orders → customers → orders) - Redundant joins (e.g., merging the same table multiple times) - Invalid join paths
- Returns:
-
A set of entity names (table names) that are part of this query’s join graph. Includes both the base entity and all merged entities.
- Return type:
Set[str]
Operations¶
- class pqg.Selection(conditions: ~typing.List[~typing.Tuple[str, str, ~typing.Any, str | None]] = <factory>)[source]¶
-
Represents a selection operation in a query.
Attributes: conditions (List[Tuple[str, str, Any, str]]): List of selection conditions and operators.
Each tuple contains (column, operation, value, next_condition_operator). The last tuple’s next_condition_operator is ignored.
- class pqg.Projection(columns: ~typing.List[str] = <factory>)[source]¶
-
Represents a projection operation in a query.
- columns¶
-
A set of column names to project.
- Type:
Set[str]
- class pqg.Merge(right: pqg.merge.Query, left_on: str, right_on: str)[source]¶
-
- apply(entity: str) str [source]¶
-
Apply the operation to the given entity name.
- Parameters:
-
entity – Name of the entity (table) to apply the operation to.
- Returns:
-
A pandas query string fragment representing this operation. For example: “.groupby([‘col’])” or “[[‘col1’, ‘col2’]]”
- property entities: Set[str]¶
-
Get the set of all entities involved in the right side of this merge operation.
This property provides a complete view of all tables involved in the right-hand side of the merge, including: 1. The immediate right entity being merged 2. Any entities that have been merged into the right entity through nested merges
The difference between this and Query.merge_entities is that this property only tracks entities from the right side of the merge operation, while merge_entities includes the base entity and all merged entities from both sides.
- Returns:
-
A set of entity names that are part of the right-hand side of this merge operation’s join graph.
- Return type:
Set[str]
Query Management¶
QueryPool¶
- class pqg.QueryPool(queries: List[Query], query_structure: QueryStructure, sample_data: Dict[str, DataFrame], multi_processing: bool = True, with_status: bool = False)[source]¶
-
Manages a collection of database queries with execution and analysis capabilities.
Provides functionality for executing queries in parallel, filtering based on results, computing statistics, and managing query persistence. The pool maintains execution results to avoid redundant computation when performing multiple operations.
- _queries¶
List of Query objects in the pool
- _query_structure¶
Parameters controlling query generation
- _sample_data¶
-
Dictionary mapping entity names to sample DataFrames
- _results¶
-
Cached query execution results (DataFrame/Series, error message)
- _with_status¶
-
Whether to display progress bars during operations
- execute(force_execute: bool = False, num_processes: int | None = None) List[Tuple[DataFrame | Series | None, str | None]] [source]¶
-
Execute all queries against the sample data, either in parallel or sequentially.
Evaluates queries using either multiprocessing for parallel execution or sequential processing. Results are cached to avoid re-execution unless explicitly requested.
- Parameters:
-
-
force_execute – If True, re-execute all queries even if results exist
-
num_processes – Number of parallel processes to use. Defaults to CPU count. Only used when multi_processing is True.
-
- Returns:
-
List of tuples containing (result, error) for each query
- filter(filter_type: QueryFilter, force_execute: bool = False) None [source]¶
-
Filter queries based on their execution results.
Modifies the query pool in-place to keep only queries matching the filter criteria. Executes queries if results don’t exist.
- Parameters:
-
-
filter_type – Criteria for keeping queries (NON_EMPTY, EMPTY, etc.)
-
force_execute – If True, re-execute queries before filtering
-
- items() Iterator[tuple[Query, Tuple[DataFrame | Series | None, str | None]]] [source]¶
-
Iterate over query-result pairs.
Each iteration yields a tuple containing a query and its execution result. If results haven’t been computed yet, executes the queries first.
- Yields:
-
tuple[Query, QueryResult] – Pairs of (query, (result, error))
- results() Iterator[Tuple[DataFrame | Series | None, str | None]] [source]¶
-
Iterate over query results.
If results haven’t been computed yet, executes the queries first.
- Yields:
-
QueryResult – Pairs of (result, error) for each query
- save(output_file: str, create_dirs: bool = True) None [source]¶
-
Save all queries to a file.
Each query is saved on a separate line in its string representation. Empty queries are filtered out and whitespace is trimmed.
- Parameters:
-
-
output_file – Path to the output file
-
create_dirs – If True, creates parent directories if needed
-
- sort() None [source]¶
-
Sort queries by their complexity.
Orders queries based on their complexity score while maintaining the association between queries and their execution results if they exist.
- statistics(force_execute: bool = False) QueryStatistics [source]¶
-
Generate comprehensive statistics about the query pool.
Analyzes query characteristics and execution results to measure how well the generated queries match the target parameters.
- Parameters:
-
force_execute – If True, re-execute queries before analysis
- Returns:
-
Statistics comparing actual vs. target characteristics
QueryFilter¶
Statistics¶
- class pqg.query_pool.ExecutionStatistics(successful: int = 0, failed: int = 0, non_empty: int = 0, empty: int = 0, errors: ~collections.Counter[str] = <factory>)[source]¶
-
Statistics about query execution results.
This class tracks success/failure rates and categorizes execution outcomes to provide insights into query generation quality.
- successful¶
-
Number of queries that executed without error
- Type:
int
- failed¶
-
Number of queries that raised an error
- Type:
int
- non_empty¶
-
Number of queries returning data (rows/values)
- Type:
int
- empty¶
-
Number of queries returning empty results
- Type:
int
- errors¶
-
Count of each error type encountered
- Type:
Counter[str]
Example
stats = ExecutionStatistics() stats.successful = 95 stats.failed = 5 stats.errors[“KeyError”] = 3 print(stats) # Shows execution summary with percentages
- class pqg.query_pool.QueryStatistics(query_structure: ~pqg.query_structure.QueryStructure, total_queries: int = 0, queries_with_selection: int = 0, queries_with_projection: int = 0, queries_with_groupby: int = 0, queries_with_merge: int = 0, selection_conditions: ~typing.List[int] = <factory>, projection_columns: ~typing.List[int] = <factory>, groupby_columns: ~typing.List[int] = <factory>, merge_count: ~typing.List[int] = <factory>, execution_results: ~pqg.query_pool.ExecutionStatistics = <factory>)[source]¶
-
Statistics comparing actual query characteristics against target parameters.
Analyzes how closely generated queries match the desired structure parameters and collects distributions of various operation characteristics.
- query_structure¶
-
Target parameters for query generation
- total_queries¶
-
Total number of queries analyzed
- Type:
int
- queries_with_selection¶
-
Number of queries containing selection operations
- Type:
int
- queries_with_projection¶
-
Number of queries containing projection operations
- Type:
int
- queries_with_groupby¶
-
Number of queries containing groupby operations
- Type:
int
- queries_with_merge¶
-
Number of queries containing merge operations
- Type:
int
- selection_conditions¶
-
List of condition counts from each selection operation
- Type:
List[int]
- projection_columns¶
-
List of column counts from each projection operation
- Type:
List[int]
- groupby_columns¶
-
List of column counts from each groupby operation
- Type:
List[int]
- merge_count¶
-
List of merge counts from each query
- Type:
List[int]
- execution_results¶
-
Statistics about query execution outcomes
Command Line Interface¶
The package provides a command-line interface through the
pqg
command. See the
Quickstart Guide
guide for common usage examples.
Arguments¶
The following command-line arguments are available:
- --disable-multi-processing
Generate and execute queries sequentially
- --ensure-non-empty
Only generate queries that return data
- –filter {non-empty,empty,has-error,without-error}
Filter queries by execution results
- --groupby-aggregation-probability FLOAT
-
Probability of including GROUP BY operations (default: 0.5)
- --max-groupby-columns INT
-
Maximum columns in GROUP BY clauses (default: 5)
- --max-merges INT
Maximum number of table joins (default: 2)
- --max-projection-columns INT
Maximum columns to select (default: 5)
- --max-selection-conditions INT
Maximum WHERE conditions (default: 5)
- --multi-line
Format queries across multiple lines
- --num-queries INT
Number of queries to generate [required]
- --output-file FILE
Save queries to a file
- --projection-probability FLOAT
-
Probability of including projections (default: 0.5)
- --schema FILE
Schema definition file [required]
- --selection-probability FLOAT
-
Probability of including selections (default: 0.5)
- --sort
Sort queries by complexity
- --verbose
Print detailed information
Error Handling¶
The following exceptions may be raised:
- ValueError
-
Raised for invalid configurations or parameters
- IOError
Raised for schema file read errors
Examples¶
See the Quickstart Guide guide for usage examples.