Commits · 1afde767c483e9672beed15c549717de2db061da · Sven M. Hallberg / pdf

Apr 14, 2023

print an error message if /Root not found · 1afde767
Sven M. Hallberg authored 1 year ago
```
If we are actually processing page content, that is.
```
2023-04-28_RELEASE

1afde767

correctly look for /Root in the last trailer section · f0c8a473

A mistake snuck into commit 76e546ce, taking the last element of
the xrefs array as the "last" trailer section. But the array is
filled in reverse order by following the chain of startxref and
/Prev pointers, so the (logical) last/latest section is xrefs[0].

f0c8a473

Apr 13, 2023

fix format specifier for printing HBytes · 06ed0943

Sven M. Hallberg authored 1 year ago

Since HBytes is a length/pointer pair and not a null-terminated
string, we must pass the length as an argument to printf. The correct
format specifier for that is "%.*s" (string with "precision" = length),
not "%*s" (string with minimum field width).

06ed0943

add missing printf argument · 11e873cc
Sven M. Hallberg authored 1 year ago
```
Forgotten in b3dda3fe when adding the input file name to error messages.
```
11e873cc

Mar 30, 2023

remove stale comment · 656f5a3f

Sven M. Hallberg authored 1 year ago

Finished reviewing past modifications to parse_xrefs().

NB: All code attributed to Sumit Ray has been removed from this function.

656f5a3f

improve handling of parse errors in xref stream data · a1014f81

Sven M. Hallberg authored 1 year ago

Improve on the bugfix in commit a5abf1e2:

- Reinstate the assert for 'res->ast != NULL'. If it fails, there is a bug
  in the parser, not an error in the input file.
- Provide a distinct error message for the case where p_xref fails on a
  cross-reference stream because of invalid data.
- Only skip storing the invalid section. Try to follow the /Prev entry in
  the stream dictionary to find more sections.

a1014f81

remove a comment · 512de3c2

Sven M. Hallberg authored 1 year ago

I cannot tell what this refers to. The (nonexistent) else case of the
if statement above it is simply the case of the object number in question
not falling within this subsection.

Anyway, the function lookup_xref() is a low-level utility used during
parsing, not a place to produce error messages.

512de3c2

comments regarding act_ks_value · c8be9e84

Sven M. Hallberg authored 1 year ago

HParseResult was introduced in 6b54ebfa (generally parse stream objects)
to hold the result of parsing the stream data, including the application
of any filters. This is produced in act_ks_value(). The fact that parse
errors in stream data are thus detectable is in fact significant for
xref stream processing, so we should not just return the bare data on
error.

c8be9e84

adjust comments · e6196639
Sven M. Hallberg authored 1 year ago

e6196639

don't emulate VIOL in error messages · b3dda3fe

Sven M. Hallberg authored 1 year ago

While it might seem like a good idea to "grade" errors by severity,
we are not *really* in any place to do so accurately. Our tasks are

 (a) to decide, internally, whether to print a message or silently
     ignore a malformation, and
 (b) to ultimately judge the file valid or invalid as a whole.

Note that the latter part, as stated before, is not the responsibility
of parse_xrefs().

Reinstate the input file name in these error messages. That information
is useful when running the program on multiple files from a script, as
we have been doing.

While we're at it, fix style (line lengths).

b3dda3fe

add test cases for out-of-bounds xref pointers · 9ff8c465
Sven M. Hallberg authored 1 year ago
```
Both currently fail because the parser proper does not validate
these offsets.
```
9ff8c465

drop use of h_seek in parse_xrefs · 9196b5c2

Sven M. Hallberg authored 1 year ago

Now that we are validating the offset ourselves, we no longer need
h_seek() to do our bounds checking. But add a defensive assert just
in case.

9196b5c2

bounds-check /Prev pointers · dd3c8e62

Sven M. Hallberg authored 1 year ago

Mirrors the check for startxref. I considered unifying the two into one
test at the start of the loop, but then we would lose the information
whether we got the offset from startxref or a /Prev.

dd3c8e62

report location of invalid startxref · aa405607

Sven M. Hallberg authored 1 year ago

This is useful information, especially in hex, when looking into the file.
The invalid value itself, on the other hand, is not so useful.

aa405607

adjust error message · 550c070d

Sven M. Hallberg authored 1 year ago

The correct and standard format specifier for values of type size_t is %zu.
There is no need to point out the valid bounds.
Match style with the other messages.

550c070d

remove useless/erroneous condition · 431c7db3

Sven M. Hallberg authored 1 year ago

The offset can never be negative (size_t is unsigned).
And this treated offset = 0 as out of bounds, which is nonsense.
In fact, offset == size is also not invalid (it is the end of file).

431c7db3

Mar 28, 2023

revert parse_xrefs to its original signature · 9883a543

Sven M. Hallberg authored 1 year ago

Passing the aux struct by reference may look cleaner, but it was
deliberate to keep parse_xrefs() independent of that struct, since the
latter is conceptually part of the parser's interface and the former is
not.

Also, this way parse_xrefs() has a proper return value that signals
success or failure.

Plus, no ugly indirection or temporary variable is needed to access sz.

9883a543

move parse_xrefs back next to main · 81dc4dba

Sven M. Hallberg authored 1 year ago

Move parse_xrefs() back in its proper place as a helper to main(),
including the definition of the global variable 'infile' with the rest
of the command line arguments.

It had been moved in fbbe953f when the content processing code was
confusedly hooked into the function.

Also removes marker comments about "Start/End xref parsing". The code
between them is not exclusively concerned with xrefs and their sheer
size clashes with the rest of the coding style.

81dc4dba

restore single exit point in parse_xrefs · 91473d5f

Sven M. Hallberg authored 1 year ago

It turns out that this function was in fact meant to always assign a
result (NULL/0 on failure), accomplished by having a single exit point.
This was changed in 517b81ad for no reason. Reverting.

I'm guessing the goto was considered disagreeable, so I'll explain the
rationale. The function accumulates its result in the *local* variables
xrefs and n. This mainly makes the code nicer to read than writing to
the output directly. Having a single exit point, a property that is easy
to verify, ensures that no update to the local variables can get lost,
i.e. they serve as de-facto aliases for the outputs.

91473d5f

allow partial results from parse_xrefs · f844fce3

Sven M. Hallberg authored 1 year ago

This was intended to be a 'break' statement, returning any xref sections
parsed up to that point. Note that parse_xrefs() is *not* supposed to be
the validating parser proper for the document. It is a utility that is
needed before the actual parser can run, so if it returns partial data
in a best effort, that is fine.

f844fce3

remove erroneous comment · bc05e4c6

Sven M. Hallberg authored 1 year ago

As far as I can tell, this is not a case for SEV_DONTCARE. It's
(conceptually) a parse error.

bc05e4c6

comments · cbf3a1bb
Sven M. Hallberg authored 1 year ago

cbf3a1bb
expand a comment · ecb984e9
Sven M. Hallberg authored 1 year ago

ecb984e9

revert to shadowing a variable instead of overwriting it · 3de6efd6

Sven M. Hallberg authored 1 year ago

As indicated by the XXX, this line of code was never meant to be
considered as "proper", mainly because it allocates a new parser
that is never freed for every xref section in the file.

The reason that this line was introduced in place of the commented-out
original above it is that the latter does not bounds-check the offset
but h_seek() does. It was a quick and reliable way to make invalid
offsets fail the parse.

Anyway, creating a new HParser *p that incidentally shadows the
previous occurance was meant to signal that this is just a temporary
name that we need real quick (on the next line), not a "proper" variable,
and that it probably wasn't there to stay. So with that sense in mind, I
am putting it back.

Remove -Wshadow from CFLAGS.

3de6efd6

remove superfluous assignment · 83bdf6cd
Sven M. Hallberg authored 1 year ago
```
How did that get in there? The very next line overwrites it.
```
83bdf6cd

parse_xrefs: remove unconditional initialization of result · b606013c

Sven M. Hallberg authored 1 year ago

The purpose of parse_xrefs() is not to initialize the aux environment,
that is done in main. The design of the function is to fill a particular
part of that environment *if* it succeeds and not touch it otherwise.
Thus the late write to 'aux' at the end of the function.

b606013c

Mar 14, 2023

always create/truncate text extraction output file(s) · d2961dc2
Sven M. Hallberg authored 2 years ago

d2961dc2
comment · 3d043dac
Sven M. Hallberg authored 2 years ago

3d043dac

no need to check for presence of trailer dictionary · e1d957b9

Sven M. Hallberg authored 2 years ago

The parser does this for us, duh (rule xr_td): If there is an xref
section, it is composed of xref entries and a trailer dictionary.
We can use the H_INDEX() macro to access it and cast to the desired
token type, which also includes all appropriate assertions.

e1d957b9

process page content after main parser and only for text extraction · 76e546ce

Sven M. Hallberg authored 2 years ago

Factors this code out of parse_xrefs() where it never belonged, into a
new function process_page_content() that is called from main after the
main parse has succeeded and only if text extraction was requested, i.e.
-x or -X was passed on the command line.

Also adjusts the code for style and drops some related XXXs.

Fixes #49.

76e546ce

comments · 5ec93b40
Sven M. Hallberg authored 2 years ago

5ec93b40

only call parse_catalog if /Root entry is present and of expected type · 9ce4e0f3

Sven M. Hallberg authored 2 years ago

Fixes #42.

NB: The code in parse_xrefs() assumed the presence of a /Root entry in the
trailer dictionary. Such code arguably does not belong in that function,
whose purpose is to obtain the xref data, not to process the document. But
where ever it belongs, the assumption is incorrect, so this fixes the
immediate issue in place.

9ce4e0f3

Mar 01, 2023

bring the two loops in act_txtobj into alignment · 222781c8

Sven M. Hallberg authored 2 years ago

The function act_txtobj() performs two passes over the sequence of
text operators. The first pass tracks position and length of the
string. Then an array is allocated of the indicated length and a
second pass fills it with the actual characters.

The two passes must be close carbon copies of each other or a mismatch
between the predetermined length and the actual number of characters
produced might occur and cause the assertion "txtlen == idx" to fail.

This code structure is obviously bad, a text book example of why code
duplication should be avoided. Nevertheless, before rewriting it
entirely, this patch at least corrects an immediate bug.

Fixes #44.

222781c8

get rid of ts variable in act_txtobj · fbbf2c68

Sven M. Hallberg authored 2 years ago

The node member of TextState_T seems to be written only in one place in
parse_pagenode(), and in such a way that the assertion "ts->node == node"
must always hold. So we might as well assign node where it said ts->node
and get rid of ts completely.

fbbf2c68

access text state consistently through node · 76cb6955

Sven M. Hallberg authored 2 years ago

This one line going through another variable seems to spurious.
Note that ts is always equal to &node->ts.

76cb6955

add some defensive asserts in act_textobj · df03908f
Sven M. Hallberg authored 2 years ago

df03908f

avoid an assert in parse_fonts · dee6c415

Sven M. Hallberg authored 2 years ago

This is an assertion of the type that catches an error (in user-supplied
data) that should be handled, namely the case where the /Font entry
of a dictionary is expected to be itself a dictionary but isn't.
The code already contains a path for the case where the /Font entry
is missing (return false) and I suppose the same, including the
TODO item "figure out how to handle", might as well apply instead
of the assertion.

Fixes #45.

dee6c415

Feb 28, 2023
- validate that xref entry types cannot overflow our type field · bf2abc90
  Sven M. Hallberg authored 2 years ago
```
The grammar accepts uint64_t, but our field is an enum, i.e. int.
```
  bf2abc90
- print unknown xref stream entry types · c486ecbe
  Sven M. Hallberg authored 2 years ago
```
Instead of throwing an assertion failure.
Fixes #46.
```
  c486ecbe
Feb 27, 2023
- add missing newline in log message · 5d3897a4
  Sven M. Hallberg authored 2 years ago
  
  5d3897a4