Commits · 3d043dacc35940eaf3d52caecbb56e4564d87100 · Sven M. Hallberg / pdf

Mar 14, 2023

comment · 3d043dac
Sven M. Hallberg authored 2 years ago

3d043dac

no need to check for presence of trailer dictionary · e1d957b9

The parser does this for us, duh (rule xr_td): If there is an xref
section, it is composed of xref entries and a trailer dictionary.
We can use the H_INDEX() macro to access it and cast to the desired
token type, which also includes all appropriate assertions.

e1d957b9

process page content after main parser and only for text extraction · 76e546ce

Sven M. Hallberg authored 2 years ago

Factors this code out of parse_xrefs() where it never belonged, into a
new function process_page_content() that is called from main after the
main parse has succeeded and only if text extraction was requested, i.e.
-x or -X was passed on the command line.

Also adjusts the code for style and drops some related XXXs.

Fixes #49.

76e546ce

comments · 5ec93b40
Sven M. Hallberg authored 2 years ago

5ec93b40

only call parse_catalog if /Root entry is present and of expected type · 9ce4e0f3

Sven M. Hallberg authored 2 years ago

Fixes #42.

NB: The code in parse_xrefs() assumed the presence of a /Root entry in the
trailer dictionary. Such code arguably does not belong in that function,
whose purpose is to obtain the xref data, not to process the document. But
where ever it belongs, the assumption is incorrect, so this fixes the
immediate issue in place.

9ce4e0f3

Mar 01, 2023

bring the two loops in act_txtobj into alignment · 222781c8

Sven M. Hallberg authored 2 years ago

The function act_txtobj() performs two passes over the sequence of
text operators. The first pass tracks position and length of the
string. Then an array is allocated of the indicated length and a
second pass fills it with the actual characters.

The two passes must be close carbon copies of each other or a mismatch
between the predetermined length and the actual number of characters
produced might occur and cause the assertion "txtlen == idx" to fail.

This code structure is obviously bad, a text book example of why code
duplication should be avoided. Nevertheless, before rewriting it
entirely, this patch at least corrects an immediate bug.

Fixes #44.

222781c8

get rid of ts variable in act_txtobj · fbbf2c68

Sven M. Hallberg authored 2 years ago

The node member of TextState_T seems to be written only in one place in
parse_pagenode(), and in such a way that the assertion "ts->node == node"
must always hold. So we might as well assign node where it said ts->node
and get rid of ts completely.

fbbf2c68

access text state consistently through node · 76cb6955

Sven M. Hallberg authored 2 years ago

This one line going through another variable seems to spurious.
Note that ts is always equal to &node->ts.

76cb6955

add some defensive asserts in act_textobj · df03908f
Sven M. Hallberg authored 2 years ago

df03908f

avoid an assert in parse_fonts · dee6c415

Sven M. Hallberg authored 2 years ago

This is an assertion of the type that catches an error (in user-supplied
data) that should be handled, namely the case where the /Font entry
of a dictionary is expected to be itself a dictionary but isn't.
The code already contains a path for the case where the /Font entry
is missing (return false) and I suppose the same, including the
TODO item "figure out how to handle", might as well apply instead
of the assertion.

Fixes #45.

dee6c415

Feb 28, 2023
- validate that xref entry types cannot overflow our type field · bf2abc90
  Sven M. Hallberg authored 2 years ago
```
The grammar accepts uint64_t, but our field is an enum, i.e. int.
```
  bf2abc90
- print unknown xref stream entry types · c486ecbe
  Sven M. Hallberg authored 2 years ago
```
Instead of throwing an assertion failure.
Fixes #46.
```
  c486ecbe
Feb 27, 2023
- add missing newline in log message · 5d3897a4
  Sven M. Hallberg authored 2 years ago
  
  5d3897a4
- validate number of objects in object streams · 7a830639
  Sven M. Hallberg authored 2 years ago
```
The validation compares the number of elements in the object and index
sequences. The number of index entries is fixed to the number N given in the
stream dictionary, cf. p_objstm__m().

Fixes #48.
```
  7a830639
- validate the number of entries in xref subsections · e0db6374
  Sven M. Hallberg authored 2 years ago
```
Fixes #47.
```
  e0db6374
Feb 17, 2023
- add note about h_bytes · 6f53ff16
  Sven M. Hallberg authored 2 years ago
  
  2023-02-27_RELEASE
  
  6f53ff16
Jan 13, 2023

remove making obj LALR from TODO · 91fe41c8

Sven M. Hallberg authored 2 years ago

Since we can parse incrementally with packrat now, we can do what is
already suggested in the TODO file and abandon this item.

Closes #13.

91fe41c8

remove elemr and elemd · 11d8191b

Sven M. Hallberg authored 2 years ago

If we are using packrat, we can use objs = MANY_WS(obj) for the object
streams as well.

11d8191b

change array rule back to use h_many · 7e628fb0

Sven M. Hallberg authored 2 years ago

This reverts the part of commit f7dbb2ac that reworked the definition
of arrays into explicit grammar recursion in order to make 'obj'
compile with LALR. That project never came to fruition and with
packrat it causes a recursive function call for every array element,
exhausting the stack with large arrays.

Fixes #26.

This does not yet remove the explicitly recursive rules elemd and
elemr because the latter is still used by the object stream parser.

7e628fb0

add some notes about earlier LALR work for future archeologists · 6f77a691
Sven M. Hallberg authored 2 years ago

6f77a691

Jan 06, 2023
- add manpage in mdoc format · 1deb5ccf
  Sven M. Hallberg authored 2 years ago
```
Moves program invocation details from README to pdf.1.mdoc.

Includes the generated ASCII output for convenience. Make sure to regenerate
with 'make doc' after changing the mdoc source.
```
  1deb5ccf
- document generalized behavior of oid argument · 97152f35
  Sven M. Hallberg authored 2 years ago
  
  97152f35
- allow oid argument to select object to print in general · 604e40e8
  Sven M. Hallberg authored 2 years ago
  
  604e40e8
Jan 05, 2023
- lzw.pdf: fix mislabeling of content stream as type ObjStm · ff715e84
  Sven M. Hallberg authored 2 years ago
  
  ff715e84
- add -ds to dump stream data · 19c7e2b9
  Sven M. Hallberg authored 2 years ago
  
  19c7e2b9
Dec 21, 2022

lzw: make sure 'earlychange' can only be 0 or 1 · d8daf230

Sven M. Hallberg authored 2 years ago

The code in pdf.c actually does this already, but there is no reason
not to be defensive here.

Just for completeness' sake: There is nothing theoretically wrong with
having even "earlier changes" (earlychange > 1), but we don't want that.

d8daf230

lzw: just return NULL in act_clear · 0a0ea6bc

Sven M. Hallberg authored 2 years ago

Returning an empty HBytes was an artefact to satisfy the earlier
structure of the grammar and is no longer necessary.

0a0ea6bc

comment fixes · 62d01e7b
Sven M. Hallberg authored 2 years ago

62d01e7b
style fixes · 9b6088c4
Sven M. Hallberg authored 2 years ago
```
Oh. My. God.
```
9b6088c4

Dec 20, 2022

style/comment · e10c9f98
Sven M. Hallberg authored 2 years ago

e10c9f98

lzw: simplify grammar by using h_bind to choose code length · 8e6bcd9c

Sven M. Hallberg authored 2 years ago

This replaces the validations on code9 etc. with one continuation that
picks the appropriate parser.

Also relaxes the parser to allow further output codes after the table is
full. Looking at the spec, it seems to me at this times that the
requirement for a clear code when the table is full is a requirement on
producers of PDF files, but not on the file format itself. As far as I
understand, conforming files can be created by a non-conforming process.

Note: The implementation uses a slight trick to handle the last code
(4095) correctly. Quoting the comment in act_output():

    Rather than going through the effort of ensuring that the last
    code is only updated once, we simply assign one more code as a
    dummy.

So, the table is now 4097 entries in actual size. The last one will
receive a bogus update every cycle, so that the last real code does not.
This is less work than actually detecting and avoiding the bogus updates.

8e6bcd9c

Dec 19, 2022
- rename LZW_context_T to struct context · 515fc9ed
  Sven M. Hallberg authored 2 years ago
```
Since we don't expose the struct (any more), we might as well pick a
simpler name for it.
```
  515fc9ed
- style and comments · a0f2cb6d
  Sven M. Hallberg authored 2 years ago
  
  a0f2cb6d
- remove unused debug code · a7211f06
  Sven M. Hallberg authored 2 years ago
  
  a7211f06
- comment fix · 43c6932a
  Sven M. Hallberg authored 2 years ago
  
  43c6932a
- use init_LZW_context in init_LZW_parser · 793a2723
  Sven M. Hallberg authored 2 years ago
  
  793a2723
- trigger an assert if malloc fails · b3d817dd
  Sven M. Hallberg authored 2 years ago
```
Also removes an unneeded memset.
```
  b3d817dd
- lzw: parse/process input in blocks · b1c02c91
  Sven M. Hallberg authored 2 years ago
```
This avoids creating an HBytes for each and every code word. Instead, the
code words are collected into blocks behind each clear code and translated
together into a single HBytes per block.
```
  b1c02c91
- lzw: switch to a fixed-size table with internally linked codes · a6ea35eb
  Sven M. Hallberg authored 2 years ago
```
This saves us from allocating and freeing the HBytes that were stored in
the table. It should also save memory since it essentially shares common
prefixes between codes.

The only remaining call to malloc() is the one allocating the global
context object itself.
```
  a6ea35eb
- correct a comment · a8734305
  Sven M. Hallberg authored 2 years ago
  
  a8734305