- Machine Learning Features (CanaryML)
This section documents the machine learning feature extraction capabilities in Lotus,
specifically the CanaryML library for memory-related feature extraction using Sea-DSA.
Overview
The CanaryML library (built from lib/ML/) provides utilities for extracting
memory-related features from program call sites. These features are designed for
machine learning applications, particularly for predicting memory safety properties
or learning memory access patterns.
Location: lib/ML/
Library Name: CanaryML
Dependencies: - Sea-DSA for points-to graph construction - LLVM analysis passes (CallGraph, TargetLibraryInfo, AllocWrapInfo)
MemoryMLFeaturesPass
The primary component is MemoryMLFeaturesPass, an LLVM ModulePass that extracts
features for each call site in a module.
LLVM Pass Name: -Pmem-ml-features
Header: .. code-block:: cpp
#include “ML/MemoryMLFeatures.h”
Basic Usage: .. code-block:: cpp
#include “ML/MemoryMLFeatures.h”
// Create and run the pass previrt::MemoryMLFeaturesPass ml_features_pass; ml_features_pass.runOnModule(M);
// Extract features for a specific call site for (auto &F : M) {
Command-Line Options:
-Pinclude-expensive-ml-features(default: true) Include expensive-to-compute memory access features.
Features Extracted
The pass extracts two sets of features for each call site:
Callee Summary Graph Features (prefix
callee_)Top-Down Unification Features (prefix
td_callee_)
Available Feature Accessors:
Graph Structure:
- getCalleeNumNodes() / getTdCalleeNumNodes()
- getCalleeNumAccessed() / getTdCalleeNumAccessed()
- getCalleeNumCollapsed() / getTdCalleeNumCollapsed()
- getCalleeNumSequence() / getTdCalleeNumSequence()
Memory Allocation:
- getCalleeNumAlloca() / getTdCalleeNumAlloca() (stack)
- getCalleeNumHeap() / getTdCalleeNumHeap() (heap)
- getCalleeNumExternal() / getTdCalleeNumExternal() (external)
- getCalleeNumAllocSites() / getTdCalleeNumAllocSites()
Pointers and Accesses:
- getCalleeNumPointers() / getTdCalleeNumPointers()
- getCalleePointersPerNode() / getTdCalleePointersPerNode()
- getCalleeNumMemAccesses() / getTdCalleeNumMemAccesses()
- getCalleeNumSafeAllocSites() / getTdCalleeNumSafeAllocSites()
Feature Output
Features can be written to a stream:
features.write(llvm::errs());
Output format:
## ML features for call @function_name
Callee's Summary graph
Number of nodes : <N>
Number of collapsed nodes: <N>
Number of accessed nodes : <N>
...
Analysis Dependencies
The pass requires the following LLVM analyses:
TargetLibraryInfoWrapperPassCallGraphWrapperPassAllocWrapInfoDsaLibFuncInfo
Integration Notes
The MemoryMLFeaturesPass is part of the previrt namespace and integrates
Sea-DSA for memory modeling. It is particularly useful for:
Building datasets for memory safety prediction
Learning patterns in library function effects
Analyzing memory access characteristics of call sites
Generating features for ML-based bug detection