Conversation
erezsh
left a comment
There was a problem hiding this comment.
Awesome! I only gave it a quick look so far, but looks pretty good.
I left a few comments. I'll take a deeper look later, and I might come up with more.
|
Btw, can we drop support for the now completely unsupported versions 3.6 and 3.7? https://devguide.python.org/versions/ The |
|
Yeah, I think we can. Users of <=3.7 will just stay stuck on this version of Lark, which is fine. But I think it should be in a separate PR. |
erezsh
left a comment
There was a problem hiding this comment.
Overall, looks really good!
There is one design idea that I have, I hope you'll keep an open mind. If we change the definition of text to text: str | TextSlice, where TextSlice has 3 fields: text, start, end.
That would follow the principal of keeping related data together, and also make all the None checks a little less awkward ( I think ).
And as a bonus, that way we don't have to add new parameters to every function on the way.
lark/lexer.py
Outdated
| if m: | ||
| return m.group(0), m.lastgroup | ||
|
|
||
| def search(self, text, start_pos, end_pos): |
There was a problem hiding this comment.
I propose a different way to write this function:
def search(self, text, start_pos, end_pos):
results = list(filter(None, [
mre.search(text, start_pos, end_pos)
for mre in self._mres
]))
if not results:
return None
best = min(results, key=lambda m: m.start())
return (best.group(0), best.lastgroup), best.start()| # We don't want to check early since this can be expensive. | ||
| valid_end = [] | ||
| ip = self.parse_interactive(text, start=chosen_start, start_pos=found.start_pos, end_pos=end_pos) | ||
| tokens = ip.lexer_thread.lex(ip.parser_state) |
There was a problem hiding this comment.
I had in mind that it doesn't work, gotta check.
Yep, sounds like a good idea. I think in a later PR it might also make sense to go a step further and make |
I'm not sure what exactly you have in mind, but overall it sounds like a positive thing. Are you referring to the text also accepting bytes, or something else? |
|
Specifically the think I had in mind is some kind of externally generated token stream with a custom lexer. |
|
Hmm.. I'm not sure I get it. |
|
Imagine using something like stdlib's |
|
I am going to bed for now, but an observation I am already making is that using |
Sounds like that might be a good idea. |
|
@MegaIng Do you have plans to submit the TextSlice PR in the near future? If not, maybe I'll take a crack at it. |
|
Sorry, no, not next-few-days-soon. Forgot about this. It was a bit more work than I thought (since it also needed to touch a chunk of earley code). I am not currently on my main PC, so I can't even provide the work-in-progress, but I would be able to in ~2 days. |
|
Sure, I'll wait until you're back to your PC. I'm not in a rush, I'm just in a mood to work on something, and might as well do this. The scan feature is pretty cool, so I'll be happy to unblock it if I can. |
|
If you want to work on something, maybe you can come up with a solution of the issue of multi-path imports, as seen in this SO question? The problem is that the same terminal gets imported with different names and then ofcourse collides with "itself". I don't really know if there is a good solution to this, maybe you can come up with something. |
|
I gave it some thought, and it's a tricky problem. I wrote my thoughts here: #1448 |
|
@erezsh Pushed a version of |
An implement of
Lark.scan. Also addsstart_posandend_postoLark.parse,Lark.parse_interactiveandLark.lex.TODO:
start_posandend_posmirroring the behavior of stdlibrewith regard to look behind and look ahead.But I do think the core logic is pretty stable and I would like a review of that already @erezsh.
Future work:
mmapto not have to load the text into memory at all (also involves checking up on the byte parsing implementation)tokenizemodule, which would have a few benefits especially with regard to the new f string syntax, and how well that would play with this feature.This PR is based on #1428, so merging it first would be better.