- Mar 28, 2023
-
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
As indicated by the XXX, this line of code was never meant to be considered as "proper", mainly because it allocates a new parser that is never freed for every xref section in the file. The reason that this line was introduced in place of the commented-out original above it is that the latter does not bounds-check the offset but h_seek() does. It was a quick and reliable way to make invalid offsets fail the parse. Anyway, creating a new HParser *p that incidentally shadows the previous occurance was meant to signal that this is just a temporary name that we need real quick (on the next line), not a "proper" variable, and that it probably wasn't there to stay. So with that sense in mind, I am putting it back. Remove -Wshadow from CFLAGS.
-
Sven M. Hallberg authored
How did that get in there? The very next line overwrites it.
-
Sven M. Hallberg authored
The purpose of parse_xrefs() is not to initialize the aux environment, that is done in main. The design of the function is to fill a particular part of that environment *if* it succeeds and not touch it otherwise. Thus the late write to 'aux' at the end of the function.
-
- Mar 14, 2023
-
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
The parser does this for us, duh (rule xr_td): If there is an xref section, it is composed of xref entries and a trailer dictionary. We can use the H_INDEX() macro to access it and cast to the desired token type, which also includes all appropriate assertions.
-
Sven M. Hallberg authored
Factors this code out of parse_xrefs() where it never belonged, into a new function process_page_content() that is called from main after the main parse has succeeded and only if text extraction was requested, i.e. -x or -X was passed on the command line. Also adjusts the code for style and drops some related XXXs. Fixes #49.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
Fixes #42. NB: The code in parse_xrefs() assumed the presence of a /Root entry in the trailer dictionary. Such code arguably does not belong in that function, whose purpose is to obtain the xref data, not to process the document. But where ever it belongs, the assumption is incorrect, so this fixes the immediate issue in place.
-
- Mar 01, 2023
-
-
Sven M. Hallberg authored
The function act_txtobj() performs two passes over the sequence of text operators. The first pass tracks position and length of the string. Then an array is allocated of the indicated length and a second pass fills it with the actual characters. The two passes must be close carbon copies of each other or a mismatch between the predetermined length and the actual number of characters produced might occur and cause the assertion "txtlen == idx" to fail. This code structure is obviously bad, a text book example of why code duplication should be avoided. Nevertheless, before rewriting it entirely, this patch at least corrects an immediate bug. Fixes #44.
-
Sven M. Hallberg authored
The node member of TextState_T seems to be written only in one place in parse_pagenode(), and in such a way that the assertion "ts->node == node" must always hold. So we might as well assign node where it said ts->node and get rid of ts completely.
-
Sven M. Hallberg authored
This one line going through another variable seems to spurious. Note that ts is always equal to &node->ts.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
This is an assertion of the type that catches an error (in user-supplied data) that should be handled, namely the case where the /Font entry of a dictionary is expected to be itself a dictionary but isn't. The code already contains a path for the case where the /Font entry is missing (return false) and I suppose the same, including the TODO item "figure out how to handle", might as well apply instead of the assertion. Fixes #45.
-
- Feb 28, 2023
-
-
Sven M. Hallberg authored
The grammar accepts uint64_t, but our field is an enum, i.e. int.
-
Sven M. Hallberg authored
Instead of throwing an assertion failure. Fixes #46.
-
- Feb 27, 2023
-
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
The validation compares the number of elements in the object and index sequences. The number of index entries is fixed to the number N given in the stream dictionary, cf. p_objstm__m(). Fixes #48.
-
Sven M. Hallberg authored
Fixes #47.
-
- Feb 17, 2023
-
-
Sven M. Hallberg authored
-
- Jan 13, 2023
-
-
Sven M. Hallberg authored
Since we can parse incrementally with packrat now, we can do what is already suggested in the TODO file and abandon this item. Closes #13.
-
Sven M. Hallberg authored
If we are using packrat, we can use objs = MANY_WS(obj) for the object streams as well.
-
Sven M. Hallberg authored
This reverts the part of commit f7dbb2ac that reworked the definition of arrays into explicit grammar recursion in order to make 'obj' compile with LALR. That project never came to fruition and with packrat it causes a recursive function call for every array element, exhausting the stack with large arrays. Fixes #26. This does not yet remove the explicitly recursive rules elemd and elemr because the latter is still used by the object stream parser.
-
Sven M. Hallberg authored
-
- Jan 06, 2023
-
-
Sven M. Hallberg authored
Moves program invocation details from README to pdf.1.mdoc. Includes the generated ASCII output for convenience. Make sure to regenerate with 'make doc' after changing the mdoc source.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
-
- Jan 05, 2023
-
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
-
- Dec 21, 2022
-
-
Sven M. Hallberg authored
The code in pdf.c actually does this already, but there is no reason not to be defensive here. Just for completeness' sake: There is nothing theoretically wrong with having even "earlier changes" (earlychange > 1), but we don't want that.
-
Sven M. Hallberg authored
Returning an empty HBytes was an artefact to satisfy the earlier structure of the grammar and is no longer necessary.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
Oh. My. God.
-
- Dec 20, 2022
-
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
This replaces the validations on code9 etc. with one continuation that picks the appropriate parser. Also relaxes the parser to allow further output codes after the table is full. Looking at the spec, it seems to me at this times that the requirement for a clear code when the table is full is a requirement on producers of PDF files, but not on the file format itself. As far as I understand, conforming files can be created by a non-conforming process. Note: The implementation uses a slight trick to handle the last code (4095) correctly. Quoting the comment in act_output(): Rather than going through the effort of ensuring that the last code is only updated once, we simply assign one more code as a dummy. So, the table is now 4097 entries in actual size. The last one will receive a bogus update every cycle, so that the last real code does not. This is less work than actually detecting and avoiding the bogus updates.
-
- Dec 19, 2022
-
-
Sven M. Hallberg authored
Since we don't expose the struct (any more), we might as well pick a simpler name for it.
-
Sven M. Hallberg authored
-
Sven M. Hallberg authored
-