-
Notifications
You must be signed in to change notification settings - Fork 802
Add PDF syntax to Rouge #2058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+225
−0
Merged
Add PDF syntax to Rouge #2058
Changes from 11 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
02b4e29
Initial PDF COS rouge lexer
petervwyatt 82785e3
Update pdf.rb
petervwyatt e54e1d3
Create demo PDF (functional)
petervwyatt 91d499c
Update pdf.rb
petervwyatt 062647e
Add basic spec checker
petervwyatt 9cf372f
Fixups
petervwyatt 2488909
Altered tokens for better color
petervwyatt a8e8c8b
More complex PDF for visual test
petervwyatt 643179c
Added EOL to last line of PDF
petervwyatt bf5f842
Merge branch 'rouge-ruby:master' into feature.pdf
petervwyatt 3d374b0
Merge branch 'rouge-ruby:main' into feature.pdf
petervwyatt 7509fe3
Update lib/rouge/lexers/pdf.rb
petervwyatt 3bd9155
Update lib/rouge/lexers/pdf.rb
petervwyatt 4914c82
Update lib/rouge/lexers/pdf.rb
petervwyatt 2f803a3
Merge branch 'rouge-ruby:main' into feature.pdf
petervwyatt 43f3d71
Fix spelling. Ensure PERIOD in "%PDF-x.y". Comment added
petervwyatt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| %PDF-1.6 | ||
| %©©©© | ||
|
|
||
| 1 0 obj<</Type/Catalog/Pages 2 0 R/StructTreeRoot null/MarkInfo<</Marked false>>>> | ||
| endobj | ||
| 2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>> | ||
| endobj | ||
| 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[.0 0 200 200]/Contents 4 0 R/Resources<<>>>> | ||
| endobj | ||
| 4 0 obj<</Length 60>> | ||
| stream | ||
| +8 w 1 j | ||
| 1.0 0 0 rg | ||
| 0 0 1 RG | ||
| 10 10 180 180 re B | ||
| endstream | ||
| endobj | ||
| xref | ||
| 0 5 | ||
| 0000000000 65535 f | ||
| 0000000021 00000 n | ||
| 0000000113 00000 n | ||
| 0000000165 00000 n | ||
| 0000000261 00000 n | ||
| trailer | ||
| <</Root 1 0 R/Size 5/ID[<18D6B641245C03F28E67D93AD879D6EC><18D6B641245C03F28E67D93AD879D6EC>]>> | ||
| startxref | ||
| 371 | ||
| %%EOF |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # -*- coding: utf-8 -*- # | ||
| # frozen_string_literal: true | ||
| # vim: set ts=2 sw=2 et: | ||
|
|
||
| # PDF = Portable Document Format page description language | ||
| # As defined by ISO 32000-2:2020 including resolved errata from https://pdf-issues.pdfa.org/ | ||
| # | ||
| # The PDF syntax is also known as "COS" and can be used with FDF (Forms Data Field) files as | ||
| # per ISO 32000-2:2020 clause 12.7.8. | ||
| # | ||
| # This is a token-based parser ONLY! It is intended to syntax highlight full or partial fragments | ||
| # of nicely written hand-writteen PDF syntax in documentation such as ISO specifications. It is NOT | ||
| # intended to cope with real-world PDFs that will contain arbitrary binary data (that form invalid | ||
| # UTF-8 sequences and generate "ArgumentError: invalid byte sequence in UTF-8" Ruby errors) and | ||
| # other types of malformations or syntax errors. | ||
| # | ||
| # Author: Peter Wyatt, CTO, PDF Association. 2024 | ||
| # | ||
| module Rouge | ||
| module Lexers | ||
| class Pdf < RegexLexer | ||
| title "PDF" | ||
| desc "PDF - Portable Document Format (ISO 32000)" | ||
| tag 'pdf' | ||
| aliases "fdf", 'cos' | ||
| filenames '*.pdf', '*.fdf' | ||
| mimetypes 'application/pdf', 'application/fdf' # IANA registered media types | ||
|
|
||
| # PDF and FDF files must start with "%PDF-x.y" or "%FDF-x.y" | ||
| # where x is the single digit major version and y is the single digit minor version. | ||
| def self.detect?(text) | ||
| return true if /^%(P|F)DF-\d.\d/ =~ text | ||
| end | ||
|
|
||
| # PDF Delimiters (ISO 32000-2:2020, Table 1 and Table 2). | ||
| # Ruby whitespace "\s" is /[ \t\r\n\f\v]/ which does not include NUL (ISO 32000-2:2020, Table 1). | ||
| # PDF also support 2 character EOL sequences. | ||
|
|
||
| state :root do | ||
| # Start-of-file header comment is special (comment is up to EOL) | ||
| rule %r/^%(P|F)DF-\d\.\d.*$/, Comment::Preproc | ||
|
|
||
| # End-of-file marker comment is special (comment is up to EOL) | ||
| rule %r/^%%EOF.*$/, Comment::Preproc | ||
|
|
||
| # PDF only has single-line comments: from "%" to EOL | ||
| rule %r/%.*$/, Comment::Single | ||
|
|
||
| # PDF Boolean and null object keywords | ||
| rule %r/(false|true|null)/, Keyword::Constant | ||
|
|
||
| # PDF Dictionary and array object start and end tokens | ||
| rule %r/(<<|>>|\[|\])/, Punctuation | ||
|
|
||
| # PDF Hex string - can contain whitespace and span multiple lines. | ||
| # This rule must be after "<<"/">>" | ||
| rule %r/<[0-9A-Fa-f\s]*>/m, Str::Other | ||
|
|
||
| # PDF literal strings are complex (multi-line, escapes, etc.). Use separate state machine. | ||
| rule %r/\(/, Str, :stringliteral | ||
|
|
||
| # PDF Name objects - can be empty (i.e., nothing after "/"). | ||
| # No special processing required for 2-digit hex codes that start with "#". | ||
| rule %r/\/[^\(\)<>\[\]\/%\s]*/, Name::Other | ||
|
|
||
| # PDF objects and stream (no checking of object ID) | ||
| # Note that object number and generation numbers do not have sign. | ||
| rule %r/\d+\s\d+\sobj/, Keyword::Declaration | ||
| rule %r/(endstream|endobj|stream)/, Keyword::Declaration | ||
petervwyatt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # PDF conventional file layout keywords | ||
| rule %r/(startxref|trailer|xref)/, Keyword::Declaration | ||
petervwyatt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # PDF cross reference section entries (20 bytes including EOL). | ||
| # Explicit single SPACE separators. | ||
| rule %r/^\d{10} \d{5} (n|f)\s*$/, Keyword::Namespace | ||
|
|
||
| # PDF Indirect reference (lax, allows zero as the object number). | ||
| # Requires terminating delimiter lookahead to disambiguate from "RG" operator | ||
| rule %r/\d+\s\d+\sR(?=[\(\)<>\[\]\/%\s])/, Name::Decorator | ||
|
|
||
| # PDF Real object | ||
| rule %r/(\-|\+)?([0-9]+\.?|[0-9]*\.[0-9]+|[0-9]+\.[0-9]*)/, Num::Float | ||
|
|
||
| # PDF Integer object | ||
| rule %r/(\-|\+)?[0-9]+/, Num::Integer | ||
|
|
||
| # A run of non-delimiters is most likely a PDF content stream | ||
| # operator (ISO 32000-2:2020, Annex A). | ||
| rule %r/[^\(\)<>\[\]\/%\s]+/, Operator::Word | ||
|
|
||
| # Whitespace (except inside strings and comments) is ignored = /[ \t\r\n\f\v]/. | ||
| # Ruby doesn't include NUL as whitespace (vs ISO 32000-2:2020 Table 1) | ||
| rule %r/\s+/, Text::Whitespace | ||
| end | ||
|
|
||
| # PDF literal string. See ISO 32000-2:2020 clause 7.3.4.2 and Table 3 | ||
| state :stringliteral do | ||
| rule %r/\(/, Str, :stringliteral # recursive for internal bracketed strings | ||
| rule %r/\\\(/, Str::Escape, :stringliteral # recursive for internal escaped bracketed strings | ||
| rule %r/\)/, Str, :pop! | ||
| rule %r/\\\)/, Str::Escape, :pop! | ||
| rule %r/\\([0-7]{3}|n|r|t|b|f|\\)/, Str::Escape | ||
| rule %r/[^\(\)\\]+/, Str | ||
| end | ||
| end | ||
| end | ||
| end | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # -*- coding: utf-8 -*- # | ||
| # frozen_string_literal: true | ||
|
|
||
| describe Rouge::Lexers::Pdf do | ||
| let(:subject) { Rouge::Lexers::Pdf.new } | ||
|
|
||
| describe 'guessing' do | ||
| include Support::Guessing | ||
|
|
||
| it 'guesses by filename' do | ||
| assert_guess :filename => 'foo.pdf' | ||
| assert_guess :filename => 'foo.fdf' | ||
| end | ||
|
|
||
| it 'guesses by mimetype' do | ||
| assert_guess :mimetype => 'application/pdf' | ||
| assert_guess :mimetype => 'application/fdf' | ||
| end | ||
|
|
||
| it 'guesses by source' do | ||
| assert_guess :source => '%PDF-1.6' | ||
| assert_guess :source => '%PDF-2.0' | ||
| assert_guess :source => '%PDF-0.3' # Fake PDF version | ||
| assert_guess :source => '%PDF-6.8' # Fake PDF version | ||
| assert_guess :source => '%FDF-1.2' | ||
| end | ||
| end | ||
|
|
||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| %PDF-1.7 | ||
| %©© | ||
| 1 0 obj | ||
| <</Type/Catalog/MarkInfo<<%comment after dictionary start | ||
| /Marked true/Suspects true%comment after a boolean | ||
| /UserProperties true>>/StructTreeRoot null/AA<</WP<</S/JavaScript/JS(//JavaScript comment | ||
| app.alert\( "Document Will-Print Action!!"\))>>>>/Pages 3 0 R>>%comment after dictionary close | ||
| endobj | ||
| 2 0 obj | ||
| null%comment after null | ||
| endobj | ||
| 3 0 obj | ||
| <</FakeBigDataArray[true[[[]]]true<686931>null<686932>null[/Dummy](hi3)[(hi4)(hi5)true(hi6)null(hi7)12(hi8)]-1.<</ABC +.123/DEF +.0>>[](hi99)[]null[]<</DEF null>>true<</GHI/JKL>>[<</MNO +.0>>]<686933>1 0 R[.1 -2 +.3]6 0 R<686934>4 0 R(hi9)2 0 R<</QRS true>>[true]<</TUV true>><686935><</XYZ true>>3 0 R<</AAB true>>(hi10)<</AAC true>>null<686936>true(hi11)<686937>(hi12)+.0<686938>] | ||
| /Type/Pages/Count 1/Kids[4 0 R%comment after indirect ref | ||
| ]>>endobj | ||
| 4 0 obj | ||
| <</Type/Page/Parent 3 0 R/MediaBox[%comment after array start | ||
| +0 .0 999 999.]%comment after array end token | ||
| /CropBox[+0 .0 999%comment after an integer | ||
| 999.]/Contents[5 0 R]/UserUnit +0.88 | ||
| /Resources<</Pattern<<>>/ProcSet[null]/ExtGState<</ 6 0 R>>/Font<</F1<</Type/Font/Subtype/Type1/BaseFont/Times-Bold/Encoding/WinAnsiEncoding>>>>>>>> | ||
| endobj | ||
| 5 0 obj | ||
| <</Length 757 >> | ||
| stream | ||
| BX /BreakMyParser <</FakeBigDataArray[true[[[]]]true<686931>null<686932>null[/Dummy](hi3)[(hi4)(hi5)true(hi6)null(hi7)12(hi8)]-1.<</ABC +.123/DEF +.0>>[](hi99)[]null[]<</DEF null>>true<</GHI/JKL>>[<</MNO +.0>>]<686933>[1 2 3]<686934>(hi9)<</QRS true>>[true]<</TUV true>><686935><</XYZ true>><</AAB true>>(hi10)<</AAC true>>null<686936>true(hi11)<686937>(hi12)+.0<686938>]>> DP EX | ||
| BT/F1 30 Tf 0 Tr 1 0 0 1 10 950 Tm(PDF Ruby Rouge test file)Tj 1 0 0 1 10 900 Tm | ||
| (This file must NOT be resaved or modified by any tool!!)Tj ET% 3 colored vector graphic squares that are clipped | ||
| / gs q 40 w 75 75 400 400 re W S % stroke then clip a path with a wide black border | ||
| 1 0. .0 rg 75 75 200 200 re f 0 1 0 rg 275 75 200 200 re f .0 0 1 rg 275 275 200 200 re f Q | ||
| endstream | ||
| endobj | ||
| 6 0 obj<</Type/ExtGState/ca 0.33/CA 0.66%comment after a real | ||
| >> | ||
| endobj | ||
| 7 0 obj | ||
| <</Subject(Compacted Syntax v3.0)%comment after literal string end | ||
| /Title<436f6d7061637465642073796e746178>%comment after hex string end | ||
| /Keywords(PDF,Compacted,Syntax,ISO 32000-2:2020)/CreationDate(D:20200317)/Author(Peter Wyatt)/Creator< 48616e | ||
| 642d65646974>/Producer<48616e 6 4 2 d 6 5646974>>> | ||
| endobj | ||
| xref | ||
| 0 8 | ||
| 0000000000 65535 f | ||
| 0000000017 00000 n | ||
| 0000000332 00000 n | ||
| 0000000374 00000 n | ||
| 0000000837 00000 n | ||
| 0000001198 00000 n | ||
| 0000002009 00000 n | ||
| 0000002084 00000 n | ||
| trailer | ||
| <</Root 1 0 R/Info%comment after name | ||
| 7 0 R/ID[<18D6B6412 | ||
| 45C033A6E67D93AD879D6EC><18D 6B 641245C033A6E67D93AD879D6EC>]/Size 8>> | ||
| startxref | ||
| 2403 | ||
| %%EOF |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.