Commits · master · xentrac / izodparse

Jul 08, 2021

Fix PDF/PS nested paren handling and add floats · f678ac3c
Kragen Javier Sitaker authored 3 years ago
```
This way of doing nested parens was way easier than what I was trying
before.
```
f678ac3c

Make PDF trailer parsing lazy · e39478bb

Kragen Javier Sitaker authored 3 years ago

This facilitates exploring PDF files that I can't actually parse yet;
I can still use the Pdf object to look at parts of the file.  For example:

    >>> d = pdf.read('../Descargas/dercuano.20191230.pdf')
    >>> d.trailer
    Traceback (most recent call last):
    ...
      File "/home/compu/izodparse/izodparse/pdf.py", line 202, in <lambda>
	parenstring.xform = lambda d: ('str', bytes(d[0][1])) # XXX croaks on anything with \()
    TypeError: 'tuple' object cannot be interpreted as an integer
    >>> d.get_indirect_obj(440)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/compu/izodparse/izodparse/pdf.py", line 290, in get_indirect_obj
	offset, plumb = result
    TypeError: cannot unpack non-iterable NoneType object
    >>> d.xrefs
    <izodparse.pdf.XrefSection object at 0x7feaef606a30>
    >>> d.xrefs[440]
    b'0000095363 00000 n\r\n'
    >>> d.xrefs.offset_of(440)
    95363
    >>> d.read(_)
    b'440 0 obj\r\n<< /Border [ 0 0 .1 ] /C [ .6 .6 1 ] /Contents (notes'

In this case there are two separate problems: I need to fix
paren-string parsing for the trailer, and I need to be able to read
fractions to read the .6.

e39478bb

Clean up refactored structure · 3f38b4dd
Kragen Javier Sitaker authored 3 years ago
```
Now it's izodparse.pdf.read instead of izodparse.pdftour.read_pdf.
```
3f38b4dd
Refactor PEG parser out of CMap and PDF grammar · e1da6aec
Kragen Javier Sitaker authored 3 years ago

e1da6aec
Make izodparse a Python package · 92544e8c
Kragen Javier Sitaker authored 3 years ago

92544e8c
Split out pdftour into new izodparse repo · 590ea9d8
Kragen Javier Sitaker authored 3 years ago

590ea9d8
Clean up pdftour code slightly · c8ba6bf5
Kragen Javier Sitaker authored 3 years ago

c8ba6bf5

Update pdftour comments and make it handle PDF comments · 9f48c99f

Kragen Javier Sitaker authored 3 years ago

I pointed it at a second file, and it immediately croaked on a
comment, so now it supports comments. But now I need to properly
decode backslash-escapes and nested parens inside strings.

9f48c99f

Update pdftour comments a bit · 5bffcf2a
Kragen Javier Sitaker authored 3 years ago

5bffcf2a
Implement stream object decompression in pdftour · 82b63209
Kragen Javier Sitaker authored 3 years ago
```
Not, sadly, object stream decompression, yet.
```
82b63209

Actually get a CMap out of a PDF with pdftour · e91c364e

Kragen Javier Sitaker authored 3 years ago

I renamed `parsecmaps.py` to `pdftour.py`.  Now it's possible to use
it to navigate the structure of at least one PDF file well enough to
parse a CMap out of it.

This involved adding some stream support, which involved tweaking the
parsing engine a bit.  In keeping with the rest of the
fast-and-loose-exploration nature of the program, it doesn't even
check for `endstream` after the end of the stream, much less `endobj`.

With that, and a bit of tweaking, my `cmaps_for_pages` code from last
week runs now!

e91c364e

Enable parsecmaps to navigate PDF file graph structure · d7c5547f
Kragen Javier Sitaker authored 3 years ago

d7c5547f

Add PDF file traversal skeleton to parsecmaps · ce731ba0

Kragen Javier Sitaker authored 3 years ago

Now I can open a PDF file and parse some objects out of it.  Soon I'll
be able to traverse the object graph.

ce731ba0

Jun 28, 2021

Add readable debug display of grammars · 81a38e26

Kragen Javier Sitaker authored 3 years ago

Now we can see what grammar `csranges_to_grammar` has constructed for
us from the CMap file and whether it makes sense.

81a38e26

Jun 25, 2021
- Remove Python3.8 dependency from parsecmaps.py · a43833df
  Kragen Javier Sitaker authored 3 years ago
  
  a43833df
- Add initial spike of parsing CMaps · 01710fa8
  Kragen Javier Sitaker authored 3 years ago
  
  01710fa8