Add scanning by MegaIng · Pull Request #1429 · lark-parser/lark

MegaIng · 2024-06-20T01:29:31Z

An implement of Lark.scan. Also adds start_pos and end_pos to Lark.parse, Lark.parse_interactive and Lark.lex.

TODO:

add example
A bit more documentation for what exactly this function does
Notes about start_pos and end_pos mirroring the behavior of stdlib re with regard to look behind and look ahead.

But I do think the core logic is pretty stable and I would like a review of that already @erezsh.

Future work:

Check if it already works/What needs to be done to make this work with mmap to not have to load the text into memory at all (also involves checking up on the byte parsing implementation)
Check to see if I can implement a custom lexer that uses python's stdlib tokenize module, which would have a few benefits especially with regard to the new f string syntax, and how well that would play with this feature.

This PR is based on #1428, so merging it first would be better.

erezsh

Awesome! I only gave it a quick look so far, but looks pretty good.

I left a few comments. I'll take a deeper look later, and I might come up with more.

lark/parser_frontends.py

lark/lexer.py

MegaIng · 2024-06-20T12:37:23Z

Btw, can we drop support for the now completely unsupported versions 3.6 and 3.7? https://devguide.python.org/versions/

The regex stubs in typshed now uses the / positional only syntax, meaning mypy error exists because the stubs can't be used for 3.6, which we have as our lowest supported version.

erezsh · 2024-06-20T13:11:54Z

Yeah, I think we can. Users of <=3.7 will just stay stuck on this version of Lark, which is fine. But I think it should be in a separate PR.

erezsh

Overall, looks really good!

There is one design idea that I have, I hope you'll keep an open mind. If we change the definition of text to text: str | TextSlice, where TextSlice has 3 fields: text, start, end.

That would follow the principal of keeping related data together, and also make all the None checks a little less awkward ( I think ).

And as a bonus, that way we don't have to add new parameters to every function on the way.

erezsh · 2024-06-20T18:11:46Z

lark/lexer.py

            if m:
                return m.group(0), m.lastgroup

+    def search(self, text, start_pos, end_pos):


I propose a different way to write this function:

def search(self, text, start_pos, end_pos): results = list(filter(None, [ mre.search(text, start_pos, end_pos) for mre in self._mres ])) if not results: return None best = min(results, key=lambda m: m.start()) return (best.group(0), best.lastgroup), best.start()

erezsh · 2024-06-20T20:04:19Z

lark/parser_frontends.py

+            # We don't want to check early since this can be expensive.
+            valid_end = []
+            ip = self.parse_interactive(text, start=chosen_start, start_pos=found.start_pos, end_pos=end_pos)
+            tokens = ip.lexer_thread.lex(ip.parser_state)


Why not use ip.iter_parse()?

I had in mind that it doesn't work, gotta check.

MegaIng · 2024-06-20T21:13:06Z

There is one design idea that I have, I hope you'll keep an open mind. If we change the definition of text to text: str | TextSlice, where TextSlice has 3 fields: text, start, end.

That would follow the principal of keeping related data together, and also make all the None checks a little less awkward ( I think ).

Yep, sounds like a good idea. I think in a later PR it might also make sense to go a step further and make Lexer (and therefore Lark) generic over the type to parse, which should be quite reasonably possible without breaking changes thanks to Type Var defaults.

erezsh · 2024-06-20T21:32:19Z

make Lexer (and therefore Lark) generic over the type to parse

I'm not sure what exactly you have in mind, but overall it sounds like a positive thing. Are you referring to the text also accepting bytes, or something else?

MegaIng · 2024-06-20T21:33:39Z

Specifically the think I had in mind is some kind of externally generated token stream with a custom lexer.

erezsh · 2024-06-20T21:49:36Z

Hmm.. I'm not sure I get it.

MegaIng · 2024-06-20T22:21:28Z

Imagine using something like stdlib's tokenizer to produce a list of tokens, and using those as the input to Lark.parse instead of the raw text. Or using a bytecode stream from decompiling a python code object as the input to recover structures like loops.

MegaIng · 2024-06-20T23:37:51Z

I am going to bed for now, but an observation I am already making is that using TextSlice instead requires touching a lot more code. We have quite a few places where we slightly break layers of abstraction, which was fine because we always found either str or bytes. Now this is TextSlice instead which requires special handling. I think I might start from scratch tomorrow and separate TextSlice and scan into two different PRs.

erezsh · 2024-06-21T07:46:35Z

I think I might start from scratch tomorrow and separate TextSlice and scan into two different PRs.

Sounds like that might be a good idea.

erezsh · 2024-08-15T14:18:55Z

@MegaIng Do you have plans to submit the TextSlice PR in the near future? If not, maybe I'll take a crack at it.

MegaIng · 2024-08-15T14:31:34Z

Sorry, no, not next-few-days-soon. Forgot about this. It was a bit more work than I thought (since it also needed to touch a chunk of earley code). I am not currently on my main PC, so I can't even provide the work-in-progress, but I would be able to in ~2 days.

erezsh · 2024-08-15T14:51:12Z

Sure, I'll wait until you're back to your PC.

I'm not in a rush, I'm just in a mood to work on something, and might as well do this. The scan feature is pretty cool, so I'll be happy to unblock it if I can.

MegaIng · 2024-08-15T15:06:44Z

If you want to work on something, maybe you can come up with a solution of the issue of multi-path imports, as seen in this SO question?

The problem is that the same terminal gets imported with different names and then ofcourse collides with "itself". I don't really know if there is a good solution to this, maybe you can come up with something.

erezsh · 2024-08-15T16:57:45Z

I gave it some thought, and it's a tricky problem. I wrote my thoughts here: #1448

MegaIng · 2024-08-17T16:52:26Z

@erezsh Pushed a version of TextSlice I had into this branch if you want to use that as a jumping off point. The commit is called "temp", but we should probably never merge this specific PR anyway.

erezsh reviewed Jun 20, 2024

View reviewed changes

lark/parser_frontends.py Outdated Show resolved Hide resolved

lark/parser_frontends.py Outdated Show resolved Hide resolved

lark/lexer.py Outdated Show resolved Hide resolved

MegaIng and others added 4 commits June 20, 2024 13:53

Added basic .scan function to Lark

4468a12

Added start_pos and end_pos to lex,parse,parse_interactive

c3fc43e

Rework scan, add tests

df3f48f

Small fixes

ca0cd55

Address review comments

884d18b

MegaIng force-pushed the add-scanning branch from 5a89293 to 884d18b Compare June 20, 2024 12:59

MegaIng added 2 commits June 20, 2024 20:31

Add scan_wikitext example

d0d9fcc

Improve docs

04c2bf6

MegaIng requested a review from erezsh June 20, 2024 18:48

erezsh reviewed Jun 20, 2024

View reviewed changes

temp

d45cbf0

MegaIng marked this pull request as draft August 17, 2024 16:51

erezsh mentioned this pull request Aug 19, 2024

Added TextSlice; Lark can now parse/lex a text-slice #1452

Merged

Uh oh!

Conversation

MegaIng commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erezsh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MegaIng commented Jun 20, 2024

Uh oh!

erezsh commented Jun 20, 2024

Uh oh!

erezsh left a comment

Choose a reason for hiding this comment

Uh oh!

erezsh Jun 20, 2024

Choose a reason for hiding this comment

Uh oh!

erezsh Jun 20, 2024

Choose a reason for hiding this comment

Uh oh!

MegaIng Jun 20, 2024

Choose a reason for hiding this comment

Uh oh!

MegaIng commented Jun 20, 2024

Uh oh!

erezsh commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MegaIng commented Jun 20, 2024

Uh oh!

erezsh commented Jun 20, 2024

Uh oh!

MegaIng commented Jun 20, 2024

Uh oh!

MegaIng commented Jun 20, 2024

Uh oh!

erezsh commented Jun 21, 2024

Uh oh!

erezsh commented Aug 15, 2024

Uh oh!

MegaIng commented Aug 15, 2024

Uh oh!

erezsh commented Aug 15, 2024

Uh oh!

MegaIng commented Aug 15, 2024

Uh oh!

erezsh commented Aug 15, 2024

Uh oh!

MegaIng commented Aug 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MegaIng commented Jun 20, 2024 •

edited

Loading

erezsh commented Jun 20, 2024 •

edited

Loading