PDG Query Language (Cypher)

The Program Dependence Graph (PDG) Query Language uses Cypher, a declarative graph query language, to query dependencies, perform slicing, and verify security policies on program dependence graphs.

Overview

Cypher provides a declarative way to:

  • Query program dependencies (data and control)

  • Perform forward and backward slicing

  • Check information flow properties

  • Verify security policies

  • Find shortest paths between program elements

  • Analyze parameter dependencies

The language supports:

  • Pattern matching (nodes and relationships)

  • WHERE clauses for filtering

  • Path queries with variable-length patterns

  • Set operations (UNION, intersection via WHERE EXISTS)

  • Aggregations and ordering

Getting Started

Building the PDG

First, compile your program and build the PDG:

# Compile to LLVM IR
clang -emit-llvm -g -c program.c -o program.bc

# Run PDG query tool
./build/bin/pdg-query program.bc

Interactive Mode

Use -i flag for interactive queries:

./build/bin/pdg-query -i program.bc

PDG> MATCH (n:FUNC_ENTRY) WHERE n.name = 'main' RETURN n
[Results displayed]

PDG> MATCH (input:FUNC_ENTRY)-[:PARAMETER_OUT]->(ret:INST_RET)-[*]->(n)
PDG> WHERE input.name = 'getInput'
PDG> RETURN n
[Slice displayed]

PDG> exit

Single Query Mode

Use -q flag for single query:

./build/bin/pdg-query -q "MATCH (n:FUNC_ENTRY) WHERE n.name = 'main' RETURN n" program.bc

Batch Query Mode

Use -f flag to run queries from file:

./build/bin/pdg-query -f queries.txt program.bc

Language Reference

Basic Node Queries

Get All Nodes

Returns all nodes in the PDG.

Syntax: MATCH (n) RETURN n

Example:

MATCH (n) RETURN n

Returns: All nodes in the PDG.

Query Nodes by Type

Select nodes by their label (type).

Syntax: MATCH (n:LABEL) RETURN n

Node Types (Labels):

  • INST_FUNCALL - Function call instructions

  • INST_RET - Return instructions

  • INST_BR - Branch instructions

  • INST_LOAD - Load instructions

  • INST_STORE - Store instructions

  • INST_ALLOCA - Alloca instructions

  • INST_GEP - GetElementPtr instructions

  • INST_OTHER - Other instructions

  • FUNC_ENTRY - Function entry points

  • PARAM_FORMALIN - Formal input parameters

  • PARAM_FORMALOUT - Formal output parameters (return values)

  • PARAM_ACTUALIN - Actual input parameters

  • PARAM_ACTUALOUT - Actual output parameters

  • GLOBAL - Global variables

  • FUNC - Function nodes

  • CLASS - Class nodes

Example:

MATCH (n:INST_FUNCALL) RETURN n
MATCH (n:INST_BR) RETURN n

Function-Specific Queries

Return Values of a Function

Returns the return values of a function.

Syntax: MATCH (n:FUNC_ENTRY)-[:PARAMETER_OUT]->(ret:INST_RET) WHERE n.name = 'functionName' RETURN ret

Example:

MATCH (n:FUNC_ENTRY)-[:PARAMETER_OUT]->(ret:INST_RET)
WHERE n.name = 'main'
RETURN ret

MATCH (n:FUNC_ENTRY)-[:PARAMETER_OUT]->(ret:INST_RET)
WHERE n.name = 'getInput'
RETURN ret

Returns: Nodes representing return values of the specified function.

Formal Parameters of a Function

Returns the formal parameters of a function.

Syntax: MATCH (n:FUNC_ENTRY)-[:PARAMETER_IN]->(param:PARAM_FORMALIN) WHERE n.name = 'functionName' RETURN param

Example:

MATCH (n:FUNC_ENTRY)-[:PARAMETER_IN]->(param:PARAM_FORMALIN)
WHERE n.name = 'process_data'
RETURN param

Returns: Nodes representing formal input parameters.

Function Entry Points

Returns the entry points of a function.

Syntax: MATCH (n:FUNC_ENTRY) WHERE n.name = 'functionName' RETURN n

Example:

MATCH (n:FUNC_ENTRY)
WHERE n.name = 'main'
RETURN n

Returns: Function entry nodes.

Edge Queries

Query Edges by Type

Select edges by their relationship type.

Syntax: MATCH ()-[r:TYPE]->() RETURN r

Edge Types (Relationship Types):

  • DATA_DEF_USE - Data definition-use edges

  • DATA_RAW - Read-after-write edges

  • DATA_READ - Read edges

  • DATA_WRITE - Write edges

  • DATA_ALIAS - Alias edges

  • CONTROL_DEP - Control dependency edges

  • CONTROLDEP_BR - Control dependency from branches

  • CONTROLDEP_ENTRY - Control dependency from entry

  • PARAMETER_IN - Parameter input edges

  • PARAMETER_OUT - Parameter output edges

  • PARAMETER_FIELD - Field parameter edges

  • CALL_DEP - Call dependency edges

  • GLOBAL_DEP - Global dependency edges

Example:

MATCH ()-[r:DATA_DEF_USE]->() RETURN r
MATCH ()-[r:CONTROLDEP_BR]->() RETURN r

Connected Nodes

Get nodes connected by edges.

Syntax: MATCH (a)-[r]->(b) RETURN a, b

Example:

MATCH (a)-[r:DATA_DEF_USE]->(b) RETURN a, b
MATCH (a)-[r:CONTROLDEP_BR]->(b) RETURN a, b

Slicing Operations

Forward Slice

Compute forward slice from given nodes (all nodes reachable from source).

Syntax: MATCH (source)-[*]->(n) RETURN DISTINCT n

Example:

MATCH (input:FUNC_ENTRY)-[:PARAMETER_OUT]->(inputRet:INST_RET)
WHERE input.name = 'getInput'
MATCH (inputRet)-[*]->(n)
RETURN DISTINCT n

Returns: All nodes reachable from the input nodes.

Use Case: Find what program elements are affected by a source.

Backward Slice

Compute backward slice from given nodes (all nodes that can reach the sink).

Syntax: MATCH (n)-[*]->(sink) RETURN DISTINCT n

Example:

MATCH (output:FUNC_ENTRY)-[:PARAMETER_IN]->(outputParam:PARAM_FORMALIN)
WHERE output.name = 'printOutput'
MATCH (n)-[*]->(outputParam)
RETURN DISTINCT n

Returns: All nodes that can reach the input nodes.

Use Case: Find what program elements influence a sink.

Path Between Nodes

Find nodes on paths between two sets of nodes.

Syntax: MATCH path = (start)-[*]->(end) RETURN path

Example:

MATCH (input:FUNC_ENTRY)-[:PARAMETER_OUT]->(inputRet:INST_RET)
WHERE input.name = 'getInput'
MATCH (output:FUNC_ENTRY)-[:PARAMETER_IN]->(outputParam:PARAM_FORMALIN)
WHERE output.name = 'system'
MATCH path = (inputRet)-[*]->(outputParam)
RETURN path

Returns: Paths from sources to sinks.

Use Case: Information flow analysis.

Shortest Path

Find shortest path between two sets of nodes.

Syntax: MATCH path = shortestPath((start)-[*]->(end)) RETURN path

Example:

MATCH (malloc:FUNC_ENTRY)-[:PARAMETER_OUT]->(mallocRet:INST_RET)
WHERE malloc.name = 'malloc'
MATCH (free:FUNC_ENTRY)-[:PARAMETER_IN]->(freeParam:PARAM_FORMALIN)
WHERE free.name = 'free'
MATCH path = shortestPath((mallocRet)-[*]->(freeParam))
RETURN path

Returns: Shortest path between nodes.

Set Operations

Union

Union of two sets using UNION keyword.

Syntax: MATCH ... RETURN ... UNION MATCH ... RETURN ...

Example:

MATCH (func1:FUNC_ENTRY)-[:PARAMETER_OUT]->(ret1:INST_RET)
WHERE func1.name = 'func1'
RETURN ret1
UNION
MATCH (func2:FUNC_ENTRY)-[:PARAMETER_OUT]->(ret2:INST_RET)
WHERE func2.name = 'func2'
RETURN ret2

Returns: All nodes in either set.

Intersection

Intersection using WHERE EXISTS subquery.

Syntax: MATCH (n) WHERE EXISTS { MATCH (m) WHERE ... } RETURN n

Example:

MATCH (dataNodes:INST_FUNCALL)
WHERE EXISTS {
  MATCH (controlNodes:INST_BR)
  WHERE dataNodes.id = controlNodes.id
}
RETURN dataNodes

Returns: Nodes in both sets.

Use Case: Find nodes that are both data and control dependent.

Difference

Set difference using WHERE NOT.

Syntax: MATCH (n) WHERE NOT (condition) RETURN n

Example:

MATCH (calls:INST_FUNCALL)
WHERE NOT (calls:INST_BR)
RETURN calls

Returns: Nodes in first set but not matching condition.

Filtering with WHERE

Property Filtering

Filter nodes by properties.

Syntax: MATCH (n) WHERE n.property = value RETURN n

Example:

MATCH (n:FUNC_ENTRY)
WHERE n.name = 'main'
RETURN n

MATCH (n)
WHERE n.line > 100 AND n.line < 200
RETURN n

Multiple Conditions

Combine conditions with AND, OR, NOT.

Syntax: MATCH (n) WHERE condition1 AND condition2 RETURN n

Example:

MATCH (n:INST_FUNCALL)
WHERE n.name = 'malloc' OR n.name = 'calloc'
RETURN n

Security Analysis Examples

Information Flow Analysis

Check if secret data flows to network:

MATCH (secret:FUNC_ENTRY)-[:PARAMETER_OUT]->(secretRet:INST_RET)
WHERE secret.name = 'getPassword'
MATCH (network:FUNC_ENTRY)-[:PARAMETER_IN]->(networkParam:PARAM_FORMALIN)
WHERE network.name = 'sendToNetwork'
MATCH path = (secretRet)-[*]->(networkParam)
RETURN path

If result is non-empty, there’s a potential information leak.

Sanitization Verification

Verify all user inputs are sanitized before use in SQL:

MATCH (sources:FUNC_ENTRY)-[:PARAMETER_OUT]->(sourceRet:INST_RET)
WHERE sources.name = 'getUserInput'
MATCH (sanitizers:FUNC_ENTRY)-[:PARAMETER_OUT]->(sanitizerRet:INST_RET)
WHERE sanitizers.name = 'sanitizeSQL'
MATCH (sinks:FUNC_ENTRY)-[:PARAMETER_IN]->(sinkParam:PARAM_FORMALIN)
WHERE sinks.name = 'executeSQL'
MATCH path = (sourceRet)-[*]->(sinkParam)
WHERE NOT EXISTS {
  MATCH (sourceRet)-[*]->(sanitizerRet)-[*]->(sinkParam)
}
RETURN path

If result is empty, all flows are sanitized.

Access Control Policy

Verify sensitive file operations require authorization:

MATCH (auth:FUNC_ENTRY)-[:PARAMETER_OUT]->(authRet:INST_RET)
WHERE auth.name = 'checkPermission'
MATCH (authRet)-[:CONTROLDEP_BR]->(check)
MATCH (fileOps:FUNC_ENTRY)-[:PARAMETER_IN]->(fileParam:PARAM_FORMALIN)
WHERE fileOps.name = 'openFile' OR fileOps.name = 'writeFile'
WHERE NOT EXISTS {
  MATCH (check)-[:CONTROLDEP_BR]->(fileParam)
}
RETURN fileParam

Null Pointer Analysis

Find potential null dereferences:

MATCH (nullRet:INST_RET)
WHERE nullRet.function = 'malloc' OR nullRet.function = 'fopen'
MATCH (nullRet)-[*]->(deref)
WHERE deref:INST_LOAD OR deref:INST_STORE
RETURN deref

Control Flow Analysis

Find which conditionals affect a computation:

MATCH (branches:INST_BR)
MATCH (computation:FUNC_ENTRY)-[:PARAMETER_IN]->(compParam:PARAM_FORMALIN)
WHERE computation.name = 'compute'
MATCH (branches)-[*]->(compParam)
RETURN branches

Parameter Dependency

Find dependencies between function parameters:

MATCH (inputs:FUNC_ENTRY)-[:PARAMETER_IN]->(inputParam:PARAM_FORMALIN)
WHERE inputs.name = 'processData'
MATCH (outputs:FUNC_ENTRY)-[:PARAMETER_OUT]->(outputRet:INST_RET)
WHERE outputs.name = 'processData'
MATCH path = (inputParam)-[*]->(outputRet)
RETURN path

Taint Analysis

Complete taint analysis from sources to sinks:

MATCH (sources:FUNC_ENTRY)-[:PARAMETER_OUT]->(sourceRet:INST_RET)
WHERE sources.name = 'read' OR sources.name = 'recv' OR sources.name = 'scanf'
MATCH (sinks:FUNC_ENTRY)-[:PARAMETER_IN]->(sinkParam:PARAM_FORMALIN)
WHERE sinks.name = 'system' OR sinks.name = 'exec' OR sinks.name = 'popen'
MATCH path = (sourceRet)-[*]->(sinkParam)
RETURN path

Query File Format

Batch query files support:

  • Comments (# prefix)

  • Multiple queries (one per line or separated by ;)

  • Multi-line queries

Example query file (security_policy.txt):

# Security Policy Verification

# Check for direct flows from sources to sinks
MATCH (sources:FUNC_ENTRY)-[:PARAMETER_OUT]->(sourceRet:INST_RET)
WHERE sources.name = 'read' OR sources.name = 'recv'
MATCH (sinks:FUNC_ENTRY)-[:PARAMETER_IN]->(sinkParam:PARAM_FORMALIN)
WHERE sinks.name = 'system' OR sinks.name = 'exec'
MATCH path = (sourceRet)-[*]->(sinkParam)
RETURN path

# Check sensitive operations are authorized
MATCH (auth:FUNC_ENTRY)-[:PARAMETER_OUT]->(authRet:INST_RET)
WHERE auth.name = 'checkAuth'
MATCH (authRet)-[:CONTROLDEP_BR]->(check)
MATCH (sensitiveOps:FUNC_ENTRY)-[:PARAMETER_IN]->(opParam:PARAM_FORMALIN)
WHERE sensitiveOps.name = 'deleteFile' OR sensitiveOps.name = 'changePassword'
WHERE NOT EXISTS {
  MATCH (check)-[:CONTROLDEP_BR]->(opParam)
}
RETURN opParam

Run with:

./build/bin/pdg-query -f security_policy.txt program.bc

Best Practices

  1. Use Descriptive Variable Names:

    # Good
    MATCH (userInput:FUNC_ENTRY)-[:PARAMETER_OUT]->(inputRet:INST_RET)
    WHERE userInput.name = 'scanf'
    
    # Bad
    MATCH (x:FUNC_ENTRY)-[:PARAMETER_OUT]->(y:INST_RET)
    WHERE x.name = 'scanf'
    
  2. Break Complex Queries:

    Use multiple MATCH clauses for clarity:

    MATCH (sources:FUNC_ENTRY)-[:PARAMETER_OUT]->(sourceRet:INST_RET)
    WHERE sources.name = 'getInput'
    MATCH (sanitizers:FUNC_ENTRY)-[:PARAMETER_OUT]->(sanitizerRet:INST_RET)
    WHERE sanitizers.name = 'sanitize'
    MATCH (sinks:FUNC_ENTRY)-[:PARAMETER_IN]->(sinkParam:PARAM_FORMALIN)
    WHERE sinks.name = 'output'
    MATCH path = (sourceRet)-[*]->(sinkParam)
    WHERE NOT EXISTS {
      MATCH (sourceRet)-[*]->(sanitizerRet)-[*]->(sinkParam)
    }
    RETURN path
    
  3. Comment Your Policies:

    # Policy: User input must be sanitized before database queries
    MATCH (input:FUNC_ENTRY)-[:PARAMETER_OUT]->(inputRet:INST_RET)
    WHERE input.name = 'getUserInput'
    MATCH (sanitize:FUNC_ENTRY)-[:PARAMETER_OUT]->(sanitizerRet:INST_RET)
    WHERE sanitize.name = 'sanitizeSQL'
    MATCH (dbQuery:FUNC_ENTRY)-[:PARAMETER_IN]->(queryParam:PARAM_FORMALIN)
    WHERE dbQuery.name = 'executeQuery'
    MATCH path = (inputRet)-[*]->(queryParam)
    WHERE NOT EXISTS {
      MATCH (inputRet)-[*]->(sanitizerRet)-[*]->(queryParam)
    }
    RETURN path
    
  4. Test Incrementally:

    Start with simple queries and build up:

    # Step 1: Verify sources exist
    MATCH (n:FUNC_ENTRY) WHERE n.name = 'getInput' RETURN n
    
    # Step 2: Verify sinks exist
    MATCH (n:FUNC_ENTRY) WHERE n.name = 'system' RETURN n
    
    # Step 3: Check flows
    MATCH (input:FUNC_ENTRY)-[:PARAMETER_OUT]->(inputRet:INST_RET)
    WHERE input.name = 'getInput'
    MATCH (output:FUNC_ENTRY)-[:PARAMETER_IN]->(outputParam:PARAM_FORMALIN)
    WHERE output.name = 'system'
    MATCH path = (inputRet)-[*]->(outputParam)
    RETURN path
    
  5. Use DISTINCT for Slices:

    MATCH (start)-[*]->(n)
    RETURN DISTINCT n
    

Limitations

  1. Imprecision: PDG construction uses alias analysis which may be imprecise, leading to spurious dependencies.

  2. Scalability: Large programs with complex dependencies may result in very large PDGs.

  3. Context Sensitivity: PDG is context-insensitive by default. Multiple calls to the same function are merged.

  4. Array Handling: Array dependencies may be approximated.

  5. Pointer Analysis: Depends on underlying pointer analysis precision.

Performance Tips

  1. Limit Path Length: Use bounded variable-length patterns when possible:

    MATCH (start)-[*1..5]->(end)
    RETURN path
    
  2. Use Specific Labels: Always specify node labels when possible:

    # Good
    MATCH (n:INST_FUNCALL) RETURN n
    
    # Less efficient
    MATCH (n) WHERE n:INST_FUNCALL RETURN n
    
  3. Filter Early: Apply WHERE clauses as early as possible:

    MATCH (n:FUNC_ENTRY)
    WHERE n.name = 'main'
    MATCH (n)-[:PARAMETER_OUT]->(ret:INST_RET)
    RETURN ret
    

Troubleshooting

Query Returns Empty Set

  1. Check function names are correct (case-sensitive)

  2. Verify PDG was built successfully

  3. Try simpler queries first (MATCH (n:FUNC_ENTRY) WHERE n.name = 'main' RETURN n)

  4. Use -v verbose flag

Query Too Slow

  1. Reduce scope of query

  2. Use more specific node/edge selection

  3. Limit path length with [*1..10]

  4. Analyze smaller program subset

Syntax Errors

  1. Check quotes around string values: WHERE n.name = 'main'

  2. Ensure proper Cypher syntax

  3. Balance parentheses and brackets

  4. Check relationship types are correct

See Also