Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Active learning of RegEx with existing samples#53

Answeredbyemuskardin
leonbett asked this question inQ&A
Discussion options

Hi!

Suppose I want to learn the following regex
regex = r'((0x[0-9a-fA-F]+)|infinity)'

The string "infinity" is fairly long and hard to bruteforce. I know that it's accepted by the regex I want to learn though, and I would like to pass this information to the regex learner.

I've tried to pass it via thesamples argument to the StatePrefixEqOracle, which does not seem to work. Please find the complete program below:

import reimport stringimport aalpyfrom aalpy.base import SULfrom aalpy.automata import Dfafrom aalpy.oracles import StatePrefixEqOraclefrom aalpy.learning_algs import run_Lstarclass RegexSUL(SUL):    def __init__(self, regex: str):        super().__init__()        self.regex = regex if regex[-1] == '$' else regex + '$'        self.string = ""    def pre(self):        self.string = ""        pass    def post(self):        pass    def step(self, letter):        if letter is not None:            self.string += str(letter)        return True if re.match(self.regex, self.string) else Falsedef main():  regex = r'((0x[0-9a-fA-F]+)|infinity)'  alphabet = [c for c in "x0123456789abcdefABCDEFintyINTY"]  regex_sul = RegexSUL(regex)  sample_inp = ("i", "n", "f", "i", "n", "i", "t", "y",)  sample_out = (False, False, False, False, False, False, False, True,)  samples = [(sample_inp, sample_out)]  eq_oracle = StatePrefixEqOracle(alphabet, regex_sul, walks_per_state=1000,walk_len=20)  learned_dfa: Dfa = run_Lstar(alphabet, regex_sul, eq_oracle, automaton_type='dfa', print_level=3, samples=samples)  print("automaton states:")  print(learned_dfa.states)if __name__ == "__main__":  main()

I'd appreciate your suggestions very much.

On a side note, is there a recommended way to convert the learned DFA to a RegEx? Currently I am using the Pythonautomata package for this. If there is a better way, kindly let me know.

Best regards,
Leon

You must be logged in to vote

Hi,

another alternative is to useProvidedSequencesOracleWrapper:
https://github.com/DES-Lab/AALpy/blob/master/aalpy/oracles/ProvidedSequencesOracleWrapper.py

You just wrap one oracle and this counterexample in this one, and pass it to the learning algorithm.

samples parameter adds samples to the cache, therefore making learning more efficient, but those are not included in the EqOracle.

Replies: 3 comments 3 replies

Comment options

You could create a custom oracle that checks whether the model conforms to your domain knowledge in addition to using some other strategy. The code below should do the trick.

import refrom aalpy.base import Oraclefrom aalpy.automata import Dfafrom aalpy.oracles import StatePrefixEqOraclefrom aalpy.learning_algs import run_Lstarfrom aalpy.utils import visualize_automatonfrom aalpy.SULs import RegexSULclass ChainedOracle(Oracle):  def __init__(self, alphabet, sul, oracles : list[Oracle]):    super().__init__(alphabet, sul)    self.oracles = oracles  def find_cex(self, hypothesis):    for oracle in self.oracles:      cex = oracle.find_cex(hypothesis)      if cex is not None:        return cexclass DomainKnowledgeOracle(Oracle):  def __init__(self, alphabet, sul, queries):    super().__init__(alphabet, sul)    self.queries = queries  def find_cex(self, hypothesis):    for query in self.queries:      sul_result = self.sul.query(query)      hyp_result = hypothesis.compute_output_seq(hypothesis.initial_state, query)      if sul_result != hyp_result:        return query    return Nonedef main():  regex = r'((0x[0-9a-fA-F]+)|infinity)'  alphabet = [c for c in "x0123456789abcdefABCDEFintyINTY"]  regex_sul = RegexSUL(regex)  sample_inp = ("i", "n", "f", "i", "n", "i", "t", "y",)  sample_out = (False, False, False, False, False, False, False, True,)  samples = [(sample_inp, sample_out)]  sp_oracle = StatePrefixEqOracle(alphabet, regex_sul, walks_per_state=1000,walk_len=20)  dk_oracle = DomainKnowledgeOracle(alphabet, regex_sul, [sample_inp])  eq_oracle = ChainedOracle(alphabet, regex_sul, [dk_oracle, sp_oracle])  learned_dfa: Dfa = run_Lstar(alphabet, regex_sul, eq_oracle, automaton_type='dfa', samples=samples)  print("automaton states:")  print(learned_dfa.states)  visualize_automaton(learned_dfa, "test", "pdf")if __name__ == "__main__":  main()

As for the conversion between DFAs and RegEx: I'm afraid I can't help you with this one since I'm not really a DFA person...

You must be logged in to vote
0 replies
Comment options

Hi,

another alternative is to useProvidedSequencesOracleWrapper:
https://github.com/DES-Lab/AALpy/blob/master/aalpy/oracles/ProvidedSequencesOracleWrapper.py

You just wrap one oracle and this counterexample in this one, and pass it to the learning algorithm.

samples parameter adds samples to the cache, therefore making learning more efficient, but those are not included in the EqOracle.

You must be logged in to vote
0 replies
Answer selected byleonbett
Comment options

Hi,

Thanks for your quick replies!

I am now using ProvidedSequencesOracleWrapper as@emuskardin suggested, and it works.

I have a follow-up question. Suppose I want to learn the regexr'((0x[0-9a-fA-F]+)|infinity|ainfinity|infinityb)'.
I passed 'infinity' to theProvidedSequencesOracleWrapper. The learned regex is:0x(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)*|infinityb?
That is, it learns the suffix afterinfinity, but not the prefix. Is there a way to explore both prefixes and suffixes?

Thanks!

You must be logged in to vote
3 replies
@emuskardin
Comment options

I am not sure if I get what you mean a 100%, but "ainfinity" would then be a new long and hard to randomly sample cex.

Inifinityb is still long and hard, especially with such big input alphabet, but it is basically a new state reached after reaching infinity, so state coverage oracles can find it (with sufficient testing).

It seems to me, that if you are learning this kind of languages, a extension to the EqOracle would be needed.

Something whete you take elements of characterization set (see randomWEqOracle) or previous counterexamples, and make explicit tests with those.

If you paste your code, I can take a closer look on Mon/Tue :)

@leonbett
Comment options

Hi,

Please find my code below. I've included adfa_to_regex function that makes it easier to debug.
It would be great if it were possible to learn "ainfinity" in this example. I think it's really cool that it's already able to learn "infinityb", so this is not urgent (and I don't know if this is query is feasible).

Thanks for your help!
Leon

# If you want to install automata, use this version: pip install automata-lib==6.0.0# Latest version is buggy w.r.t non-alphanumeric charactersimport reimport stringimport aalpyfrom aalpy.base import SUL, Automatonfrom aalpy.automata import Dfafrom aalpy.automata.Dfa import DfaStatefrom aalpy.oracles import StatePrefixEqOracle, UserInputEqOracle, ProvidedSequencesOracleWrapperfrom aalpy.learning_algs import run_Lstar, run_KVfrom aalpy.utils.FileHandler import visualize_automatonfrom automata.fa.gnfa import GNFAfrom automata.fa.dfa import DFAclass RegexSUL(SUL):    """    An example implementation of a system under learning that can be used to learn any regex expression.    Note that the $ is added to the expression as in this SUL only exact matches are learned.    """    def __init__(self, regex: str):        super().__init__()        self.regex = regex if regex[-1] == '$' else regex + '$'        self.string = ""    def pre(self):        self.string = ""        pass    def post(self):        pass    def step(self, letter):        if letter is not None:            self.string += str(letter)        return True if re.match(self.regex, self.string) else Falsedef dfa_to_regex(learned_dfa: Dfa):  states = set()  input_symbols = set()  transitions = dict()  initial_state: str = learned_dfa.initial_state.state_id  final_states = set()  state: DfaState  for state in learned_dfa.states:    states.add(state.state_id)    if state.is_accepting:       final_states.add(state.state_id)    if state.state_id not in transitions:       transitions[state.state_id] = dict()    next_state: DfaState    for inp, next_state in state.transitions.items():      # "Escaping" reserved characters.      inp = inp.replace('{', '⓪')      inp = inp.replace('}', '①')      inp = inp.replace('(', '②')      inp = inp.replace(')', '③')      inp = inp.replace('[', '④')      inp = inp.replace(']', '⑤')      inp = inp.replace('.', '⑥')      inp = inp.replace('+', '⑦')      inp = inp.replace('*', '⑧')      inp = inp.replace('|', '⑨')      inp = inp.replace('?', 'Ⓐ')      input_symbols.add(inp)      assert inp not in transitions[state.state_id] # deterministic automaton      transitions[state.state_id][inp] = next_state.state_id  assert initial_state in states  dfa =DFA(    states=states,    input_symbols=input_symbols,    transitions=transitions,    initial_state=initial_state,    final_states=final_states  )  gnfa = GNFA.from_dfa(dfa)  regex = gnfa.to_regex()  regex = regex.replace('⓪', '{')  regex = regex.replace('①', '}')  regex = regex.replace('②', '(')  regex = regex.replace('③', ')')  regex = regex.replace('④', '[')  regex = regex.replace('⑤', ']')  regex = regex.replace('⑥', ".")  regex = regex.replace('⑦', "+")  regex = regex.replace('⑧', "*")  regex = regex.replace('⑨', "|")  regex = regex.replace('Ⓐ', "?")  return regexdef main():  regex = r'((0x[0-9a-fA-F]+)|infinity|ainfinity|infinityb)'  alphabet = [c for c in "x0123456789abcdefABCDEFintyINTY"]  regex_sul = RegexSUL(regex)  eq_oracle = StatePrefixEqOracle(alphabet, regex_sul, walks_per_state=10000,walk_len=50)  wrapped_eq_oracle = ProvidedSequencesOracleWrapper(alphabet, regex_sul, eq_oracle, [["i", "n", "f", "i", "n", "i", "t", "y"]])  learned_dfa: Dfa = run_Lstar(alphabet, regex_sul, wrapped_eq_oracle, automaton_type='dfa', print_level=3)  print("automaton states:")  print(learned_dfa.states)  print("learned regex:")  regex = dfa_to_regex(learned_dfa)  print(regex)if __name__ == "__main__":  main()
@emuskardin
Comment options

You get this
0x(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)*|infinityb?,
but would this be considered as correct
a*(0x(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)(0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f|A|B|C|D|E|F)*|infinityb?)?

If yes, replaceStatePrefixEqOracle withRandomWMethodEqOracle, with the same parameterization.
eq_oracle = RandomWMethodEqOracle(alphabet, regex_sul, walks_per_state=10000, walk_len=12)

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
3 participants
@leonbett@zwergziege@emuskardin
Converted from issue

This discussion was converted from issue #52 on November 03, 2023 16:19.


[8]ページ先頭

©2009-2025 Movatter.jp