TODO
Import of the TODO file in the repository
-
fix the object stream parser to split input at logical boundaries, as provided by the object index ("N pairs of integers") at the beginning of the stream data. #7 Click to expand
this follows discussion with peter wyatt where he initially said that the objects should be delimited by normal PDF token rules, but PDFA then came to the conclusion that, in fact, this was a mistake and the logical begin/end info should delimit things. i.e. if your index says that an object begins at offset 0 and ends at offset 3, followed by one that ends at 6, and the input is "123456", this parses as two numbers, 123 and 456.
currently the code follows the incorrect former approach, (re-) using the "elemr" parser that is otherwise used with arrays. the above example would parse as one element, the number 123456, in contradiction to the index (which we parse but ignore).
we have to explicitly walk the index, run our "obj" parser on each respective snippet of input, and wrap the results up in a parse result. we should also validate conditions on the index beforehand. these are thankfully sane (monotonic offsets etc.) and mentioned in the spec.
-
move main routine(s) and filter implementation(s) into separate source files. e.g.: - main.c: main function and helpers; starting from its include block
- pdf.c: parser proper; grammar and basic semantic actions
- filter.c: filters
- maybe another file just for xref or stream stuff? #8
-
refactor / clean up the (ascii) filter implementations. #9 -
rework VIOL to produce a "violation" token in the AST (via h_action). then, a validation (h_attr_bool) should let the parse fail if applicable (severity vs. strictness). non-fatal violations should be extracted and printed to stderr after the parse. #10 -
somehow rid VIOL() of the internal parser for getting at the severity parameter. this is, i guess, an artefact of h_action() taking a single void pointer of context, so it was not trivial to pass two arguments (message and severity) to the action. #10 -
(maybe?) change stream parsing to just stop at "endstream endobj" when /Length is indirect and the filter or postordinate parser doesn't delimit itself. this is not strictly to-spec, but probably an OK restriction to make in practice. a consistency checks can be made against the length after all objects have been parsed. #17 #15 Click to expand
note: the current design aims to follow the spec to the letter in that the /Length entry of a stream determines its length, and nothing else. from this it follows that we must find and parse these lengths in "island style". thus, the current code is a hybrid of linear and island parsing. if the reliance on /Length can be broken, the island-based resolver can go and we can have a proper split between two separate parsers - one pure linear and one pure island.
-
parse and print content streams. #11 - Note: text extraction already parses some content streams (in text objects). Merging the two passes, and extracting the text content by querying the AST should deduplicate some code - Pomp
-
parse/validate additional stream types/filters (images...). #12 -
consider reviving the effort to get "obj" to parse with LALR. the messy grammar for arrays with "elemd", "elemr", etc. still stems from project, as does the explicit handling of whitespace -- note that TOK() is only used in KW() and that no instances of KW() remain under "obj". #13 (closed) Click to expand
alternatively, consider fully reverting the grammar to its clearer PEG form. i would probably keep the explicit whitespace, though.
what stopped me before was the difficulty to resolve some things without precedence rules; specifically line endings in string literals. is a "crlf" or a "cr" followed by an "lf"? LALR cannot decide unless you encode that anything following a "cr" doesn't start with . string literals are currently defined differently. the best way to do it, AFAICS, would be to match (in string literals) all subsequent line endings in one nonterminal and to encode there that a plain "cr" is never followed by "lf".
FWIW, the motivation for LALR parsing of "obj" was the prospect of parsing an object stream incrementally, as chunks come in from the decompressor (or an arbitrary filter chain).
NB: the reason why we must distinguish "crlf" from "cr" "lf" at all is of course that in a string literal, the former means "\n" and the latter means "\n\n".
-
implement random-access ("island") parser (walking objects from /Root). i'm not sure how much we need to know about the "DOM" for this. maybe nothing? since everything is built out of basic objects and we can just blindly follow references? #14 - Note: if I recall, text extraction uses the page catalog for finding text objects to some extent - Pomp
-
check linear and random-access parses for consistency. -
replace disparate parsing routines (applied to different pieces of input) with one big HParser that uses h_seek() to move around. this will enable packrat to cache, for instance, the xref tables instead of us parsing them once to resolve references and again as part of the linear parse. #16 -
parse stream objects without reference to their /Length entry by simply trying all possible ways and consistency-checking them against the xref table in the end, via h_attr_bool(). #17 XXX is this actually possible (without unreasonable complications)? -
investigate memory use on big documents (millions of objects). - Note: this effort spawned the debug scripts for GDB. I think it already allows gathering individual HParser-level memory statistics - Pomp
-
make custom token types for all appropriate parts of the parse result so that they can be properly distinguished in the output. #18 -
include position information, at least for objects, in the (JSON) output. #19