DES-Lab/AALpyPublic

NotificationsYou must be signed in to change notification settings
Fork34
Star199

Active learning of RegEx with existing samples#53

Answeredbyemuskardin

leonbett asked this question inQ&A

leonbett

Nov 3, 2023

· 3 comments· 3 replies

AnsweredbyemuskardinReturn to top

Discussion options

leonbett
Nov 3, 2023

Hi!

Suppose I want to learn the following regex
regex = r'((0x[0-9a-fA-F]+)|infinity)'

The string "infinity" is fairly long and hard to bruteforce. I know that it's accepted by the regex I want to learn though, and I would like to pass this information to the regex learner.

I've tried to pass it via thesamples argument to the StatePrefixEqOracle, which does not seem to work. Please find the complete program below:

import reimport stringimport aalpyfrom aalpy.base import SULfrom aalpy.automata import Dfafrom aalpy.oracles import StatePrefixEqOraclefrom aalpy.learning_algs import run_Lstarclass RegexSUL(SUL):    def __init__(self, regex: str):        super().__init__()        self.regex = regex if regex[-1] == '$' else regex + '$'        self.string = ""    def pre(self):        self.string = ""        pass    def post(self):        pass    def step(self, letter):        if letter is not None:            self.string += str(letter)        return True if re.match(self.regex, self.string) else Falsedef main():  regex = r'((0x[0-9a-fA-F]+)|infinity)'  alphabet = [c for c in "x0123456789abcdefABCDEFintyINTY"]  regex_sul = RegexSUL(regex)  sample_inp = ("i", "n", "f", "i", "n", "i", "t", "y",)  sample_out = (False, False, False, False, False, False, False, True,)  samples = [(sample_inp, sample_out)]  eq_oracle = StatePrefixEqOracle(alphabet, regex_sul, walks_per_state=1000,walk_len=20)  learned_dfa: Dfa = run_Lstar(alphabet, regex_sul, eq_oracle, automaton_type='dfa', print_level=3, samples=samples)  print("automaton states:")  print(learned_dfa.states)if __name__ == "__main__":  main()

I'd appreciate your suggestions very much.

On a side note, is there a recommended way to convert the learned DFA to a RegEx? Currently I am using the Pythonautomata package for this. If there is a better way, kindly let me know.

Best regards,
Leon

You must be logged in to vote

Answered by emuskardin

Nov 3, 2023

Hi,

another alternative is to useProvidedSequencesOracleWrapper:
https://github.com/DES-Lab/AALpy/blob/master/aalpy/oracles/ProvidedSequencesOracleWrapper.py

You just wrap one oracle and this counterexample in this one, and pass it to the learning algorithm.

samples parameter adds samples to the cache, therefore making learning more efficient, but those are not included in the EqOracle.

View full answer

Replies: 3 comments 3 replies

Comment options

zwergziege
Nov 3, 2023
Collaborator

You could create a custom oracle that checks whether the model conforms to your domain knowledge in addition to using some other strategy. The code below should do the trick.

import refrom aalpy.base import Oraclefrom aalpy.automata import Dfafrom aalpy.oracles import StatePrefixEqOraclefrom aalpy.learning_algs import run_Lstarfrom aalpy.utils import visualize_automatonfrom aalpy.SULs import RegexSULclass ChainedOracle(Oracle):  def __init__(self, alphabet, sul, oracles : list[Oracle]):    super().__init__(alphabet, sul)    self.oracles = oracles  def find_cex(self, hypothesis):    for oracle in self.oracles:      cex = oracle.find_cex(hypothesis)      if cex is not None:        return cexclass DomainKnowledgeOracle(Oracle):  def __init__(self, alphabet, sul, queries):    super().__init__(alphabet, sul)    self.queries = queries  def find_cex(self, hypothesis):    for query in self.queries:      sul_result = self.sul.query(query)      hyp_result = hypothesis.compute_output_seq(hypothesis.initial_state, query)      if sul_result != hyp_result:        return query    return Nonedef main():  regex = r'((0x[0-9a-fA-F]+)|infinity)'  alphabet = [c for c in "x0123456789abcdefABCDEFintyINTY"]  regex_sul = RegexSUL(regex)  sample_inp = ("i", "n", "f", "i", "n", "i", "t", "y",)  sample_out = (False, False, False, False, False, False, False, True,)  samples = [(sample_inp, sample_out)]  sp_oracle = StatePrefixEqOracle(alphabet, regex_sul, walks_per_state=1000,walk_len=20)  dk_oracle = DomainKnowledgeOracle(alphabet, regex_sul, [sample_inp])  eq_oracle = ChainedOracle(alphabet, regex_sul, [dk_oracle, sp_oracle])  learned_dfa: Dfa = run_Lstar(alphabet, regex_sul, eq_oracle, automaton_type='dfa', samples=samples)  print("automaton states:")  print(learned_dfa.states)  visualize_automaton(learned_dfa, "test", "pdf")if __name__ == "__main__":  main()

As for the conversion between DFAs and RegEx: I'm afraid I can't help you with this one since I'm not really a DFA person...

You must be logged in to vote

0 replies

Comment options

emuskardin
Nov 3, 2023
Maintainer

Hi,

another alternative is to useProvidedSequencesOracleWrapper:
https://github.com/DES-Lab/AALpy/blob/master/aalpy/oracles/ProvidedSequencesOracleWrapper.py

You just wrap one oracle and this counterexample in this one, and pass it to the learning algorithm.

samples parameter adds samples to the cache, therefore making learning more efficient, but those are not included in the EqOracle.

You must be logged in to vote

0 replies

Answer selected byleonbett

Comment options

leonbett
Nov 3, 2023
Author

Hi,

Thanks for your quick replies!

I am now using ProvidedSequencesOracleWrapper as@emuskardin suggested, and it works.

I have a follow-up question. Suppose I want to learn the regexr'((0x[0-9a-fA-F]+)|infinity|ainfinity|infinityb)'.
I passed 'infinity' to theProvidedSequencesOracleWrapper. The learned regex is:0x(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)*|infinityb?
That is, it learns the suffix afterinfinity, but not the prefix. Is there a way to explore both prefixes and suffixes?

Thanks!

You must be logged in to vote

3 replies

Comment options

emuskardin Nov 4, 2023
Maintainer

I am not sure if I get what you mean a 100%, but "ainfinity" would then be a new long and hard to randomly sample cex.

Inifinityb is still long and hard, especially with such big input alphabet, but it is basically a new state reached after reaching infinity, so state coverage oracles can find it (with sufficient testing).

It seems to me, that if you are learning this kind of languages, a extension to the EqOracle would be needed.

Something whete you take elements of characterization set (see randomWEqOracle) or previous counterexamples, and make explicit tests with those.

If you paste your code, I can take a closer look on Mon/Tue :)

Comment options

leonbett Nov 6, 2023
Author

Hi,

Please find my code below. I've included adfa_to_regex function that makes it easier to debug.
It would be great if it were possible to learn "ainfinity" in this example. I think it's really cool that it's already able to learn "infinityb", so this is not urgent (and I don't know if this is query is feasible).

Thanks for your help!
Leon

# If you want to install automata, use this version: pip install automata-lib==6.0.0# Latest version is buggy w.r.t non-alphanumeric charactersimport reimport stringimport aalpyfrom aalpy.base import SUL, Automatonfrom aalpy.automata import Dfafrom aalpy.automata.Dfa import DfaStatefrom aalpy.oracles import StatePrefixEqOracle, UserInputEqOracle, ProvidedSequencesOracleWrapperfrom aalpy.learning_algs import run_Lstar, run_KVfrom aalpy.utils.FileHandler import visualize_automatonfrom automata.fa.gnfa import GNFAfrom automata.fa.dfa import DFAclass RegexSUL(SUL):    """    An example implementation of a system under learning that can be used to learn any regex expression.    Note that the $ is added to the expression as in this SUL only exact matches are learned.    """    def __init__(self, regex: str):        super().__init__()        self.regex = regex if regex[-1] == '$' else regex + '$'        self.string = ""    def pre(self):        self.string = ""        pass    def post(self):        pass    def step(self, letter):        if letter is not None:            self.string += str(letter)        return True if re.match(self.regex, self.string) else Falsedef dfa_to_regex(learned_dfa: Dfa):  states = set()  input_symbols = set()  transitions = dict()  initial_state: str = learned_dfa.initial_state.state_id  final_states = set()  state: DfaState  for state in learned_dfa.states:    states.add(state.state_id)    if state.is_accepting:       final_states.add(state.state_id)    if state.state_id not in transitions:       transitions[state.state_id] = dict()    next_state: DfaState    for inp, next_state in state.transitions.items():      # "Escaping" reserved characters.      inp = inp.replace('{', '⓪')      inp = inp.replace('}', '①')      inp = inp.replace('(', '②')      inp = inp.replace(')', '③')      inp = inp.replace('[', '④')      inp = inp.replace(']', '⑤')      inp = inp.replace('.', '⑥')      inp = inp.replace('+', '⑦')      inp = inp.replace('*', '⑧')      inp = inp.replace('|', '⑨')      inp = inp.replace('?', 'Ⓐ')      input_symbols.add(inp)      assert inp not in transitions[state.state_id] # deterministic automaton      transitions[state.state_id][inp] = next_state.state_id  assert initial_state in states  dfa =DFA(    states=states,    input_symbols=input_symbols,    transitions=transitions,    initial_state=initial_state,    final_states=final_states  )  gnfa = GNFA.from_dfa(dfa)  regex = gnfa.to_regex()  regex = regex.replace('⓪', '{')  regex = regex.replace('①', '}')  regex = regex.replace('②', '(')  regex = regex.replace('③', ')')  regex = regex.replace('④', '[')  regex = regex.replace('⑤', ']')  regex = regex.replace('⑥', ".")  regex = regex.replace('⑦', "+")  regex = regex.replace('⑧', "*")  regex = regex.replace('⑨', "|")  regex = regex.replace('Ⓐ', "?")  return regexdef main():  regex = r'((0x[0-9a-fA-F]+)|infinity|ainfinity|infinityb)'  alphabet = [c for c in "x0123456789abcdefABCDEFintyINTY"]  regex_sul = RegexSUL(regex)  eq_oracle = StatePrefixEqOracle(alphabet, regex_sul, walks_per_state=10000,walk_len=50)  wrapped_eq_oracle = ProvidedSequencesOracleWrapper(alphabet, regex_sul, eq_oracle, [["i", "n", "f", "i", "n", "i", "t", "y"]])  learned_dfa: Dfa = run_Lstar(alphabet, regex_sul, wrapped_eq_oracle, automaton_type='dfa', print_level=3)  print("automaton states:")  print(learned_dfa.states)  print("learned regex:")  regex = dfa_to_regex(learned_dfa)  print(regex)if __name__ == "__main__":  main()

Comment options

emuskardin Nov 8, 2023
Maintainer

You get this
0x(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)*|infinityb?,
but would this be considered as correct
a*(0x(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)*|infinityb?)?

If yes, replaceStatePrefixEqOracle withRandomWMethodEqOracle, with the same parameterization.
eq_oracle = RandomWMethodEqOracle(alphabet, regex_sul, walks_per_state=10000, walk_len=12)

Movatterモバイル変換

Active learning of RegEx with existing samples#53

Uh oh!

leonbettNov 3, 2023

Replies: 3 comments· 3 replies

Uh oh!

zwergziegeNov 3, 2023 Collaborator

Uh oh!

emuskardinNov 3, 2023 Maintainer

Uh oh!

leonbettNov 3, 2023 Author

Uh oh!

Uh oh!

emuskardinNov 4, 2023 Maintainer

Uh oh!

Uh oh!

leonbettNov 6, 2023 Author

Uh oh!

emuskardinNov 8, 2023 Maintainer

Uh oh!

leonbett
Nov 3, 2023

Replies: 3 comments 3 replies

zwergziege
Nov 3, 2023
Collaborator

emuskardin
Nov 3, 2023
Maintainer

leonbett
Nov 3, 2023
Author

emuskardin Nov 4, 2023
Maintainer

leonbett Nov 6, 2023
Author

emuskardin Nov 8, 2023
Maintainer