diff --git a/CFGBoltzmann.py b/CFGBoltzmann.py index ed0d9953308971b84639b3f2207a87fef3c5100f..04602ef5036cb795bae9b1cff8f752a2ad9584d8 100644 --- a/CFGBoltzmann.py +++ b/CFGBoltzmann.py @@ -471,6 +471,82 @@ rulepack_cooked = z.preprocessor() qq = z.Gzero_shimmed(nonterminals.EXPRESSION, 11) print ("AND THE FINAL ANSWER IS","\n\n\n",qq, "\n\n\n") + +# We now can calculate the number of strings generated c(n) for each length n. We also can generate +# a random string from the language given n. We now add the Boltzmann-sampling proper. + + +# We have the generating function for a combinatorial class: +# G(z) = ∑ c(n) * z^n (from n=0 to n=infinity) +# where n is a length, z is the parameter, and c(n) is the number of members of that class with +# length exactly n. This sum converges for valid values of the parameter z to a value, call it +# G. + +# To generate a Boltzmann sample, we choose a random real number t∈[0,G] and then find the +# first integer K such that the partial sum of the first K terms of the sum G(z) is greater than t +# K will be the length of our sample, and to generate the actual object, we randomly choose +# amongst the members of the combinatorial class with length K and return it. + +# The difficulty here lies with the limit of the sum G(z) -- it is not clear how to find a +# closed-form expression for it for an arbitrary grammar. However, we can estimate it +# with a procedure that is adequate enough for our needs: + +# To better deal with these series and their sums, we define the following notation: +# +# G(z, a, b) is the partial sum of the generating function, with parameter z, calculated +# on indices [a,b] inclusive. + +# n = b +# _________ +# \ / +# \ +# \ n +# G(z,a,b)= / c(n) * z +# / +# / +# -------\ +# n = a + +# Motivated by https://notebook.drmaciver.com/posts/2020-07-11-10:49.html, we note that +# c(n) is always less than or equal to ||A||^n -- the maximum number of strings with +# length n composed of an alphabet with size ||A||: + +# c(n) * z^n ≤ ||A||^n * z^n = ( ||A|| * z )^n + +# We note that each term of G(z) is less than or equal to the corresponding +# term of a a geometric series with ratio R = ||A|| * z + +# We have closed form expressions for partial sums of this geometric series GS: + +# GS(R, 0, B - 1) = (1 -R^N) / (1-R) + +# and its limit + +# GS(R, 0, infinity) = 1 / (1-R) + +# To bound the limit of the generating function, G(z, 0, infinity), we find two sequences that +# both have limits of G(z, 0, infinity) with the following property to use the sandwich theorem: + +# LB(z, 0, b) ≤ G(z, 0, b) ≤ UB(z, 0, b) for all b. + +# For LB(z, 0, b), we take G(z, 0, b) itself -- which trivially satisfies the inequality condition +# and the same-limit condition. + +# UB(z, 0, b) is defined as G(z, 0, b) + GS(z, b+1, infinity). We can see that this satisfies the +# inequality condition because GS(z, b+1, infinity) will always have non-negative terms, so it cannot +# force UB(z, 0, b) to be smaller than G(z, 0, b). + +# As for the limit condition, we note that UB is constructed in a special way -- the terms +# [0, b] are from the original generating function, G(z), but the terms from [b + 1, infinity] +# are from the geometric series. As b tends to infinity + + + + + + + + # Furthermore, we also note that the description of a context-free grammar is *itself* context-free # so if we take the CFG-description grammar (BNF or something isomorphic to it), and use Boltzmann # sampling on *it*, we will generate candidate grammars; which will be used to test the FPGA-based