- Jul 08, 2021
-
-
Kragen Javier Sitaker authored
This way of doing nested parens was way easier than what I was trying before.
-
Kragen Javier Sitaker authored
This facilitates exploring PDF files that I can't actually parse yet; I can still use the Pdf object to look at parts of the file. For example: >>> d = pdf.read('../Descargas/dercuano.20191230.pdf') >>> d.trailer Traceback (most recent call last): ... File "/home/compu/izodparse/izodparse/pdf.py", line 202, in <lambda> parenstring.xform = lambda d: ('str', bytes(d[0][1])) # XXX croaks on anything with \() TypeError: 'tuple' object cannot be interpreted as an integer >>> d.get_indirect_obj(440) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/compu/izodparse/izodparse/pdf.py", line 290, in get_indirect_obj offset, plumb = result TypeError: cannot unpack non-iterable NoneType object >>> d.xrefs <izodparse.pdf.XrefSection object at 0x7feaef606a30> >>> d.xrefs[440] b'0000095363 00000 n\r\n' >>> d.xrefs.offset_of(440) 95363 >>> d.read(_) b'440 0 obj\r\n<< /Border [ 0 0 .1 ] /C [ .6 .6 1 ] /Contents (notes' In this case there are two separate problems: I need to fix paren-string parsing for the trailer, and I need to be able to read fractions to read the .6.
-
Kragen Javier Sitaker authored
Now it's izodparse.pdf.read instead of izodparse.pdftour.read_pdf.
-
Kragen Javier Sitaker authored
-
Kragen Javier Sitaker authored
-
Kragen Javier Sitaker authored
-
Kragen Javier Sitaker authored
-
Kragen Javier Sitaker authored
I pointed it at a second file, and it immediately croaked on a comment, so now it supports comments. But now I need to properly decode backslash-escapes and nested parens inside strings.
-
Kragen Javier Sitaker authored
-
Kragen Javier Sitaker authored
Not, sadly, object stream decompression, yet.
-
Kragen Javier Sitaker authored
I renamed `parsecmaps.py` to `pdftour.py`. Now it's possible to use it to navigate the structure of at least one PDF file well enough to parse a CMap out of it. This involved adding some stream support, which involved tweaking the parsing engine a bit. In keeping with the rest of the fast-and-loose-exploration nature of the program, it doesn't even check for `endstream` after the end of the stream, much less `endobj`. With that, and a bit of tweaking, my `cmaps_for_pages` code from last week runs now!
-
Kragen Javier Sitaker authored
-
Kragen Javier Sitaker authored
Now I can open a PDF file and parse some objects out of it. Soon I'll be able to traverse the object graph.
-
- Jun 28, 2021
-
-
Kragen Javier Sitaker authored
Now we can see what grammar `csranges_to_grammar` has constructed for us from the CMap file and whether it makes sense.
-
- Jun 25, 2021
-
-
Kragen Javier Sitaker authored
-
Kragen Javier Sitaker authored
-