progressive.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281
  1. How to do progressive loading with MuPDF.
  2. =========================================
  3. What is progressive loading?
  4. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  5. The idea of progressive loading is that as you download a PDF file
  6. into a browser, you can display the pages as they appear.
  7. MuPDF can make use of 2 different mechanisms to achieve this. The
  8. first relies on the file being "linearized", the second relies on
  9. the caller of MuPDF having fine control over the http fetch and on
  10. the server supporting byte-range fetches.
  11. For optimum performance a file should be both linearized and be
  12. available over a byte-range supporting link, but benefits can still
  13. be had with either one of these alone.
  14. Progressive download using "linearized" files
  15. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  16. Adobe defines "linearized" PDFs as being ones that have both a
  17. specific layout of objects and a small amount of extra
  18. information to help avoid seeking within a file. The stated aim
  19. is to deliver the first page of a document in advance of the whole
  20. document downloading, whereupon subsequent pages will become
  21. available. Adobe also refers to these as "Optimized for fast web
  22. view" or "Web Optimized".
  23. In fact, the standard outlines (poorly) a mechanism by which 'hints'
  24. can be included that enable the subsequent pages to be found within
  25. the file too. Unfortunately this is very poorly supported with
  26. many tools, and so the hints have to be treated with suspicion.
  27. MuPDF will attempt to use hints if they are available, but will also
  28. use a linear search of the file to discover pages if not. This means
  29. that the first page will be displayed quickly, and then subsequent
  30. ones will appear with 'incomplete' renderings that improve over time
  31. as more and more resources are gradually delivered.
  32. Essentially the file starts with a slightly modified header, and the
  33. first object in the file is a special one (the linearization object)
  34. that a) indicates that the file is linearized, and b) gives some
  35. useful information (like the number of pages in the file etc).
  36. This object is then followed by all the objects required for the
  37. first page, then the "hint stream", then sets of object for each
  38. subsequent page in turn, then shared objects required for those
  39. pages, then various other random things.
  40. [Yes, really. While page 1 is sent with all the objects that it
  41. uses, shared or otherwise, subsequent pages do not get shared
  42. resources until after all the unshared page objects have been
  43. sent.]
  44. The Hint Stream
  45. ~~~~~~~~~~~~~~~
  46. Adobe intended Hint Stream to be useful to facilitate the display
  47. of subsequent pages, but it has never used it. Consequently you
  48. can't trust people to write it properly - indeed Adobe outputs
  49. something that doesn't quite conform to the spec.
  50. Consequently very few people actually use it. MuPDF will use it
  51. after sanity checking the values, and should cope with illegal/
  52. incorrect values.
  53. So how does MuPDF handle progressive loading?
  54. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  55. MuPDF has made various extensions to its mechanisms for handling
  56. progressive loading.
  57. + Progressive streams
  58. At its lowest level MuPDF reads file data from a fz_stream,
  59. using the fz_open_document_with_stream call. (fz_open_document
  60. is implemented by calling this). We have extended the fz_stream
  61. slightly, giving the system a way to ask for meta information
  62. (or perform meta operations) on a stream.
  63. Using this mechanism MuPDF can query:
  64. + whether a stream is progressive or not (i.e. whether the
  65. entire stream is accessible immediately)
  66. + what the length of a stream should ultimately be (which an
  67. http fetcher should know from the Content-Length header),
  68. With this information MuPDF can decide whether to use its normal
  69. object reading code, or whether to make use of a linearized
  70. object. Knowing the length enables us to check with the length
  71. value given in the linearized object - if these differ, the
  72. assumption is that an incremental save has taken place, thus the
  73. file is no longer linearized.
  74. When data is pulled from a progressive stream, if we attempt to
  75. read data that is not currently available, the stream should
  76. throw a FZ_ERROR_TRYLATER error. This particular error code
  77. will be interpreted by the caller as an indication that it
  78. should retry the parsing of the current objects at a later time.
  79. When a MuPDF call is made on a progressive stream, such as
  80. fz_open_document_with_stream, or fz_load_page, the caller should
  81. be prepared to handle a FZ_ERROR_TRYLATER error as meaning that
  82. more data is required before it can continue. No indication is
  83. directly given as to exactly how much more data is required, but
  84. as the caller will be implementing the progressive fz_stream
  85. that it has passed into MuPDF to start with, it can reasonably
  86. be expected to figure out an estimate for itself.
  87. + Cookie
  88. Once a page has been loaded, if its contents are to be 'run'
  89. as normal (using e.g. fz_run_page) any error (such as failing
  90. to read a font, or an image, or even a content stream belonging
  91. to the page) will result in a rendering that aborts with an
  92. FZ_ERROR_TRYLATER error. The caller can catch this and display
  93. a placeholder instead.
  94. If each pages data was entirely self-contained and sent in
  95. sequence this would perhaps be acceptable, with each page
  96. appearing one after the other. Unfortunately, the linearization
  97. procedure as laid down by Adobe does NOT do this: objects shared
  98. between multiple pages (other than the first) are not sent with
  99. the pages themselves, but rather AFTER all the pages have been
  100. sent.
  101. This means that a document that has a title page, then contents
  102. that share a font used on pages 2 onwards, will not be able to
  103. correctly display page 2 until after the font has arrived in
  104. the file, which will not be until all the page data has been
  105. sent.
  106. To mitigate against this, MuPDF provides a way whereby callers
  107. can indicate that they are prepared to accept an 'incomplete'
  108. rendering of the file (perhaps with missing images, or with
  109. substitute fonts).
  110. Callers prepared to tolerate such renderings should set the
  111. 'incomplete_ok' flag in the cookie, then call fz_run_page etc
  112. as normal. If a FZ_ERROR_TRYLATER error is thrown at any point
  113. during the page rendering, the error will be swallowed, the
  114. 'incomplete' field in the cookie will become non-zero and
  115. rendering will continue. When control returns to the caller
  116. the caller can check the value of the 'incomplete' field and
  117. know that the rendering it received is not authoritative.
  118. Progressive loading using byte range requests
  119. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  120. If the caller has control over the http fetch, then it is possible
  121. to use byte range requests to fetch the document 'out of order'.
  122. This enables non-linearized files to be progressively displayed as
  123. they download, and fetches complete renderings of pages earlier than
  124. would otherwise be the case. This process requires no changes within
  125. MuPDF itself, but rather in the way the progressive stream learns
  126. from the attempts MuPDF makes to fetch data.
  127. Consider for example, an attempt to fetch a hypothetical file from
  128. a server.
  129. + The initial http request for the document is sent with a "Range:"
  130. header to pull down the first (say) 4k of the file.
  131. + As soon as we get the header in from this initial request, we can
  132. respond to meta stream operations to give the length, and whether
  133. byte requests are accepted.
  134. - If the header indicates that byte ranges are acceptable the
  135. stream proceeds to go into a loop fetching chunks of the file
  136. at a time (not necessarily in-order). Otherwise the server
  137. will ignore the Range: header, and just serve the whole file.
  138. - If the header indicates a content-length, the stream returns
  139. that.
  140. + MuPDF can then decide how to proceed based upon these flags and
  141. whether the file is linearized or not. (If the file contains a
  142. linearized object, and the content length matches, then the file
  143. is considered to be linear, otherwise it is not).
  144. If the file is linear:
  145. - we proceed to read objects out of the file as it downloads.
  146. This will provide us the first page and all its resources. It
  147. will also enable us to read the hint streams (if present).
  148. - Once we have read the hint streams, we unpack (and sanity
  149. check) them to give us a map of where in the file each object
  150. is predicted to live, and which objects are required for each
  151. page. If any of these values are out of range, we treat the
  152. file as if there were no hint streams.
  153. - If we have hints, any attempt to load a subsequent page will
  154. cause MuPDF to attempt to read exactly the objects required.
  155. This will cause a sequence of seeks in the fz_stream followed
  156. by reads. If the stream does not have the data to satisfy that
  157. request yet, the stream code should remember the location that
  158. was fetched (and fetch that block in the background so that
  159. future retries will succeed) and should raise an
  160. FZ_ERROR_TRYLATER error.
  161. [Typically therefore when we jump to a page in a linear file
  162. on a byte request capable link, we will quickly see a rough
  163. rendering, which will improve fairly fast as images and fonts
  164. arrive.]
  165. - Regardless of whether we have hints or byte requests, on every
  166. fz_load_page call MuPDF will attempt to process more of the
  167. stream (that is assumed to be being downloaded in the
  168. background). As linearized files are guaranteed to have pages
  169. in order, pages will gradually become available. In the absence
  170. of byte requests and hints however, we have no way of getting
  171. resources early, so the renderings for these pages will remain
  172. incomplete until much more of the file has arrived.
  173. [Typically therefore when we jump to a page in a linear file
  174. on a non byte request capable link, we will see a rough
  175. rendering for that page as soon as data arrives for it (which
  176. will typically take much longer than would be the case with
  177. byte range capable downloads), and that will improve much more
  178. slowly as images and fonts may not appear until almost the
  179. whole file has arrived.]
  180. - When the whole file has arrived, then we will attempt to read
  181. the outlines for the file.
  182. For a non-linearized PDF on a byte request capable stream:
  183. - MuPDF will immediately seek to the end of the file to attempt
  184. to read the trailer. This will fail with a FZ_ERROR_TRYLATER
  185. due to the data not being here yet, but the stream code should
  186. remember that this data is required and it should be prioritized
  187. in the background fetch process.
  188. - Repeated attempts to open the stream should eventually succeed
  189. therefore. As MuPDF jumps through the file trying to read first
  190. the xrefs, then the page tree objects, then the page contents
  191. themselves etc, the background fetching process will be driven
  192. by the attempts to read the file in the foreground.
  193. [Typically therefore the opening of a non-linearized file will
  194. be slower than a linearized one, as the xrefs/page trees for a
  195. non-linear file can be 20%+ of the file data. Once past this
  196. initial point however, pages and data can be pulled from the
  197. file almost as fast as with a linearized file.]
  198. For a non-linearized PDF on a non-byte request capable stream:
  199. - MuPDF will immediately seek to the end of the file to attempt
  200. to read the trailer. This will fail with a FZ_ERROR_TRYLATER
  201. due to the data not being here yet. Subsequent retries will
  202. continue to fail until the whole file has arrived, whereupon
  203. the whole file will be instantly available.
  204. [This is the worst case situation - nothing at all can be
  205. displayed until the entire file has downloaded.]
  206. A typical structure for a fetcher process (see curl-stream.c in
  207. mupdf-curl as an example) might therefore look like this:
  208. + We consider the file as an (initially empty) buffer which we are
  209. filling by making requests. In order to ensure that we make
  210. maximum use of our download link, we ensure that whenever
  211. one request finishes, we immediately launch another. Further, to
  212. avoid the overheads for the request/response headers being too
  213. large, we may want to divide the file into 'chunks', perhaps 4 or 32k
  214. in size.
  215. + We can then have a receiver process that sits there in a loop
  216. requesting chunks to fill this buffer. In the absence of
  217. any other impetus the receiver should request the next 'chunk'
  218. of data from the file that it does not yet have, following the last
  219. fill point. Initially we start the fill point at the beginning of
  220. the file, but this will move around based on the requests made of
  221. the progressive stream.
  222. + Whenever MuPDF attempts to read from the stream, we check to see if
  223. we have data for this area of the file already. If we do, we can
  224. return it. If not, we remember this as the next "fill point" for our
  225. receiver process and throw a FZ_ERROR_TRYLATER error.