should be able to recognize CFG with left-recursion, hopefully in O(n^3) time. there's a benchmark there for the ambiguous "n+n+n+"-grammar, and that seems to fit best in O(n^2) actually
@jamey I should probably write something proper about it. it might be hard to understand if you're not in my head
@judofyr This is super cool, and, you're right, I don't entirely understand it. Like, I'm ignoring the context-free rule stuff entirely for the moment. 😅
Your computation of the first/last/pair sets is clear and follows nicely from the definitions of the corresponding operators, which is really neat to see!
But am I guessing correctly that if the same literal character appears multiple places in the grammar, that this will accept strings which it shouldn't?
@jamey I don't think it will be a problem. do you have an example of what it might accept then?
@judofyr I _think_ the regex "aab" would match the input "ab" as well as "aaaaab"? and I think the regex "aba" would match any length of "a(ba)*". if it doesn't do that, then that's just another thing I don't yet understand about how it works. 😁
@jamey ah, it's crucial that the literal character parsers are not re-used in the grammar, but you can just instantiate two different "a"-parsers.
c("a") >> c("a") >> c("b")
a = c("a")
a >> a >> c("b")
actually, Rust's type system would be perfect for this case. Seq/Alt would take ownership of the children to ensure correctness.
@judofyr ohhh, the transitions are over objects, not characters, I guess? I had thought _compute_transitions was just looping over characters. but if this is actually forming a graph between Char objects that effectively encodes the NFA, then this makes a lot more sense to me!
@jamey yess! Char#first_set returns a set of self and not a set of the character. the NFA will have n number of states where n = number of Char objects
@judofyr and Grammar's active_states variable is a set that quite directly represents the set of states that are currently viable in the NFA. which seems like on each character in the input, it would do work proportional to how much ambiguity there is in the grammar, and that seems good. this is cool!
@jamey yes! I think patterns like (a*a*a*)* are causing the most transitions, but even then I think the work we do per character is quadratic in terms of grammar size (and not dependent on the length)
want a couple of friendly challenges for additional features? 😁
@jamey hah. what are you thinking of? extracting parse tree?
@judofyr oh yeah, that'd be a good one too 😁
first off, I'm not sure how to fit intersection and complement operators into this approach. do you think that's feasible?
second, I think you could do the semiring-weights thing if you replace active_states with a map/dictionary/whatever it's called in Ruby, and make the values be the weight currently stored at that state. when multiple transitions enter the same target state, you'd "add" the weights together. does that make sense?
@jamey not quite sure about intersection. I guess it’s doable by introducing two end states (one per branch) and then have a special case which says that if those two are active at the same time, we proceed as usual. would definitely be easier if active_states was a bitset (which it should be anyway)
why is intersection interesting btw? I looked into Boolean grammars a few years ago, but I failed to see the use cases.
I haven’t looked into weights at all :)
@judofyr I asked for a couple of reasons. the simple one is I'm just curious how easy or hard it is to slot into this approach. it was almost embarrassingly trivial with the "Play" paper approach.
the more complicated answer is http://www.megacz.com/berkeley/research/papers/sbp-preprint.pdf; context-free grammars can't parse things like the longest-match rule for identifiers without having a lexer in front helping them out, and I don't like using two different tools to describe a language. boolean grammars can even parse indentation!
@jamey it's a bit more complicated here because we throw away the parser structure and base it only on each_pair/first_set/last_set. the automaton must then be able to handle boolean transitions.
I just felt it would be easier to extend the parser with support with precedence than deal with weird complement/intersection operators (at least from a practical point of view).
I've liked the approach of e.g. https://arxiv.org/abs/1207.0443
@judofyr on the one hand I definitely want practical solutions, and on the other hand I dislike tools that add weird special cases instead of working out a general solution. like http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.7828&rep=rep1&type=pdf which described a bunch of issues I care about and then proposed to solve them with very limited extensions to CFGs.
btw, precedence is a trivial rewrite if you have a Boolean parser: a>b => a|(b&~a). I was thinking of adding a convenience for it to my library.
@judofyr I tend to think that if the underlying theory has fewer special cases, that there will be more surprising things you'll discover you can do with it? for example, I feel like unparsing should be almost as easy as parsing in my library, although I haven't worked out how.
I'm particularly interested in the semiring stuff because that can encode a lot of context-sensitivity in ways I think feel reasonable--like carrying a symbol table around to disambiguate C's type-vs-variable thing.
@jamey in the Glushkov automaton the performance is relative to grammar size, and naive use of complement/intersection might cause grammar growth
@jamey I'm mainly afraid of the complexity of the resulting grammar (especially with respect to error messages, but also performance and debugability)
@judofyr yup, that's absolutely a fair concern!
also I want to back up and say again that I still think what you've written is awesome, even if you go in a different direction than I wanted to explore. you've captured an algorithm that made no sense to me and made it much more clear.
@jamey also: active_states should be a set to not completely explode on stuff like a*a*. wasn’t needed in my test cases since it only happens on obviously ambiguous grammars
A Mastodon instance for Rubyists & friends