- Apr 14, 2023
-
-
Sven M. Hallberg authored
If we are actually processing page content, that is.
-
Sven M. Hallberg authored
A mistake snuck into commit 76e546ce, taking the last element of the xrefs array as the "last" trailer section. But the array is filled in reverse order by following the chain of startxref and /Prev pointers, so the (logical) last/latest section is xrefs[0].
-
- Apr 13, 2023
-
-
Sven M. Hallberg authored
Since HBytes is a length/pointer pair and not a null-terminated string, we must pass the length as an argument to printf. The correct format specifier for that is "%.*s" (string with "precision" = length), not "%*s" (string with minimum field width).
-
Sven M. Hallberg authored
Forgotten in b3dda3fe when adding the input file name to error messages.
-
- Mar 30, 2023
-
-
Sven M. Hallberg authored
Finished reviewing past modifications to parse_xrefs(). NB: All code attributed to Sumit Ray has been removed from this function.
-
Sven M. Hallberg authored
Improve on the bugfix in commit a5abf1e2: - Reinstate the assert for 'res->ast != NULL'. If it fails, there is a bug in the parser, not an error in the input file. - Provide a distinct error message for the case where p_xref fails on a cross-reference stream because of invalid data. - Only skip storing the invalid section. Try to follow the /Prev entry in the stream dictionary to find more sections.
-
Sven M. Hallberg authored
I cannot tell what this refers to. The (nonexistent) else case of the if statement above it is simply the case of the object number in question not falling within this subsection. Anyway, the function lookup_xref() is a low-level utility used during parsing, not a place to produce error messages.
-
Sven M. Hallberg authored
HParseResult was introduced in 6b54ebfa (generally parse stream objects) to hold the result of parsing the stream data, including the application of any filters. This is produced in act_ks_value(). The fact that parse errors in stream data are thus detectable is in fact significant for xref stream processing, so we should not just return the bare data on error.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
While it might seem like a good idea to "grade" errors by severity, we are not *really* in any place to do so accurately. Our tasks are (a) to decide, internally, whether to print a message or silently ignore a malformation, and (b) to ultimately judge the file valid or invalid as a whole. Note that the latter part, as stated before, is not the responsibility of parse_xrefs(). Reinstate the input file name in these error messages. That information is useful when running the program on multiple files from a script, as we have been doing. While we're at it, fix style (line lengths).
-
Sven M. Hallberg authored
Both currently fail because the parser proper does not validate these offsets.
-
Sven M. Hallberg authored
Now that we are validating the offset ourselves, we no longer need h_seek() to do our bounds checking. But add a defensive assert just in case.
-
Sven M. Hallberg authored
Mirrors the check for startxref. I considered unifying the two into one test at the start of the loop, but then we would lose the information whether we got the offset from startxref or a /Prev.
-
Sven M. Hallberg authored
This is useful information, especially in hex, when looking into the file. The invalid value itself, on the other hand, is not so useful.
-
Sven M. Hallberg authored
The correct and standard format specifier for values of type size_t is %zu. There is no need to point out the valid bounds. Match style with the other messages.
-
Sven M. Hallberg authored
The offset can never be negative (size_t is unsigned). And this treated offset = 0 as out of bounds, which is nonsense. In fact, offset == size is also not invalid (it is the end of file).
-
- Mar 28, 2023
-
-
Sven M. Hallberg authored
Passing the aux struct by reference may look cleaner, but it was deliberate to keep parse_xrefs() independent of that struct, since the latter is conceptually part of the parser's interface and the former is not. Also, this way parse_xrefs() has a proper return value that signals success or failure. Plus, no ugly indirection or temporary variable is needed to access sz.
-
Sven M. Hallberg authored
Move parse_xrefs() back in its proper place as a helper to main(), including the definition of the global variable 'infile' with the rest of the command line arguments. It had been moved in fbbe953f when the content processing code was confusedly hooked into the function. Also removes marker comments about "Start/End xref parsing". The code between them is not exclusively concerned with xrefs and their sheer size clashes with the rest of the coding style.
-
Sven M. Hallberg authored
It turns out that this function was in fact meant to always assign a result (NULL/0 on failure), accomplished by having a single exit point. This was changed in 517b81ad for no reason. Reverting. I'm guessing the goto was considered disagreeable, so I'll explain the rationale. The function accumulates its result in the *local* variables xrefs and n. This mainly makes the code nicer to read than writing to the output directly. Having a single exit point, a property that is easy to verify, ensures that no update to the local variables can get lost, i.e. they serve as de-facto aliases for the outputs.
-
Sven M. Hallberg authored
This was intended to be a 'break' statement, returning any xref sections parsed up to that point. Note that parse_xrefs() is *not* supposed to be the validating parser proper for the document. It is a utility that is needed before the actual parser can run, so if it returns partial data in a best effort, that is fine.
-
Sven M. Hallberg authored
As far as I can tell, this is not a case for SEV_DONTCARE. It's (conceptually) a parse error.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
As indicated by the XXX, this line of code was never meant to be considered as "proper", mainly because it allocates a new parser that is never freed for every xref section in the file. The reason that this line was introduced in place of the commented-out original above it is that the latter does not bounds-check the offset but h_seek() does. It was a quick and reliable way to make invalid offsets fail the parse. Anyway, creating a new HParser *p that incidentally shadows the previous occurance was meant to signal that this is just a temporary name that we need real quick (on the next line), not a "proper" variable, and that it probably wasn't there to stay. So with that sense in mind, I am putting it back. Remove -Wshadow from CFLAGS.
-
Sven M. Hallberg authored
How did that get in there? The very next line overwrites it.
-
Sven M. Hallberg authored
The purpose of parse_xrefs() is not to initialize the aux environment, that is done in main. The design of the function is to fill a particular part of that environment *if* it succeeds and not touch it otherwise. Thus the late write to 'aux' at the end of the function.
-
- Mar 14, 2023
-
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
The parser does this for us, duh (rule xr_td): If there is an xref section, it is composed of xref entries and a trailer dictionary. We can use the H_INDEX() macro to access it and cast to the desired token type, which also includes all appropriate assertions.
-
Sven M. Hallberg authored
Factors this code out of parse_xrefs() where it never belonged, into a new function process_page_content() that is called from main after the main parse has succeeded and only if text extraction was requested, i.e. -x or -X was passed on the command line. Also adjusts the code for style and drops some related XXXs. Fixes #49.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
Fixes #42. NB: The code in parse_xrefs() assumed the presence of a /Root entry in the trailer dictionary. Such code arguably does not belong in that function, whose purpose is to obtain the xref data, not to process the document. But where ever it belongs, the assumption is incorrect, so this fixes the immediate issue in place.
-
- Mar 01, 2023
-
-
Sven M. Hallberg authored
The function act_txtobj() performs two passes over the sequence of text operators. The first pass tracks position and length of the string. Then an array is allocated of the indicated length and a second pass fills it with the actual characters. The two passes must be close carbon copies of each other or a mismatch between the predetermined length and the actual number of characters produced might occur and cause the assertion "txtlen == idx" to fail. This code structure is obviously bad, a text book example of why code duplication should be avoided. Nevertheless, before rewriting it entirely, this patch at least corrects an immediate bug. Fixes #44.
-
Sven M. Hallberg authored
The node member of TextState_T seems to be written only in one place in parse_pagenode(), and in such a way that the assertion "ts->node == node" must always hold. So we might as well assign node where it said ts->node and get rid of ts completely.
-
Sven M. Hallberg authored
This one line going through another variable seems to spurious. Note that ts is always equal to &node->ts.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
This is an assertion of the type that catches an error (in user-supplied data) that should be handled, namely the case where the /Font entry of a dictionary is expected to be itself a dictionary but isn't. The code already contains a path for the case where the /Font entry is missing (return false) and I suppose the same, including the TODO item "figure out how to handle", might as well apply instead of the assertion. Fixes #45.
-
- Feb 28, 2023
-
-
Sven M. Hallberg authored
The grammar accepts uint64_t, but our field is an enum, i.e. int.
-
Sven M. Hallberg authored
Instead of throwing an assertion failure. Fixes #46.
-
- Feb 27, 2023
-
-
Sven M. Hallberg authored
-