Hill Climbing Algorithm
Code for Searching through the space of possible Bayesian Network structures.
Various optimization procedures are employed, from greedy search to simulated annealing, and so on - mostly using scipy.optimize.
Local search - possible moves: - Add edge - Delete edge - Invert edge
Strategies to improve Greedy Hill-Climbing: - Random Restarts
when we get stuck, take some number of
random steps and then start climbing again.
- Tabu List
keep a list of the K steps most recently taken,
and say that the search cannt reverse (undo) any of these steps.
- bamt.redef_HC.hc(data, metric='MI', max_iter=200, debug=False, init_nodes=None, restriction=None, init_edges=None, remove_geo_edges=True, black_list=None)[source]
Greedy Hill Climbing search proceeds by choosing the move which maximizes the increase in fitness of the network at the current step. It continues until it reaches a point where there does not exist any feasible single move that increases the network fitness.
It is called “greedy” because it simply does what is best at the current iteration only, and thus does not look ahead to what may be better later on in the search.
For computational saving, a Priority Queue (python’s heapq) can be used to maintain the best operators and reduce the complexity of picking the best operator from O(n^2) to O(nlogn). This works by maintaining the heapq of operators sorted by their delta score, and each time a move is made, we only have to recompute the O(n) delta-scores which were affected by the move. The rest of the operator delta-scores are not affected.
For additional computational efficiency, we can cache the sufficient statistics for various families of distributions - therefore, computing the mutual information for a given family only needs to happen once.
- The possible moves are the following:
add edge
delete edge
invert edge
Arguments
- datapd.DataFrame
The data from which the Bayesian network structure will be learned.
- metrica string
Which score metric to use. Options:
AIC
BIC / MDL
LL (log-likelihood)
- max_iteran integer
The maximum number of iterations of the hill-climbing algorithm to run. Note that the algorithm will terminate on its own if no improvement is made in a given iteration.
- debugboolean
Whether to print(the scores/moves of the) algorithm as its happening.
init_nodes : a list of initialize nodes (number of nodes according to the dataset)
- restrictiona list of 2-tuples
For MMHC algorithm, the list of allowable edge additions.
Returns
bn : a BayesNet object