\( \newcommand{\Ds}{\displaystyle} \newcommand{\PP}{{\mathbb P}} \newcommand{\RR}{{\mathbb R}} \newcommand{\KK}{{\mathbb K}} \newcommand{\CC}{{\mathbb C}} \newcommand{\ZZ}{{\mathbb Z}} \newcommand{\NN}{{\mathbb N}} \newcommand{\TT}{{\mathbb T}} \newcommand{\QQ}{{\mathbb Q}} \newcommand{\Abs}[1]{{\left|{#1}\right|}} \newcommand{\v}[1]{{{\mathbf #1}}} \newcommand{\Floor}[1]{{\left\lfloor{#1}\right\rfloor}} \newcommand{\Ceil}[1]{{\left\lceil{#1}\right\rceil}} \newcommand{\sgn}{{\rm sgn\,}} \newcommand{\Set}[1]{{\left\{{#1}\right\}}} \newcommand{\Norm}[1]{{\left\|{#1}\right\|}} \newcommand{\Prob}[1]{{{{\mathbb P}}\left[{#1}\right]}} \newcommand{\Mean}[1]{{{{\mathbb E}}\left[{#1}\right]}} \newcommand{\cis}{{\rm cis}\,} \newcommand{\one}{{\mathbf 1}} \renewcommand{\Re}{{\rm Re\,}} \renewcommand{\Im}{{\rm Im\,}} \renewcommand{\arg}{{\rm arg\,}} \renewcommand{\Arg}{{\rm Arg\,}} \renewcommand{\deg}{{\rm deg\,}} \newcommand{\ft}[1]{\widehat{#1}} \newcommand{\FT}[1]{\left(#1\right)^\wedge} \newcommand{\Lone}[1]{{\left\|{#1}\right\|_{1}}} \newcommand{\Linf}[1]{{\left\|{#1}\right\|_\infty}} \newcommand{\inner}[2]{{\langle #1, #2 \rangle}} \newcommand{\Inner}[2]{{\left\langle #1, #2 \right\rangle}} \)

Probability in Data Science

A short course

Spring 2024-25

KU Eichstätt - Ingolstadt


Teacher: Mihalis Kolountzakis

Class diary

In reverse chronological ordering

Thursday, 15 May 2025

• Today we started by proving the theorem that we stated last time. To prove it the key is to apply Markov's inequality to the variable $e^{\lambda S}$ for an appropriately chosen parameter $\lambda$. Since $S=X_1+\cdots+X_n$ and the $X_j$ are independent we can write $$ \Mean{e^{\lambda S}} = \Mean{e^{\lambda X_1}} \cdots \Mean{e^{\lambda X_n}} = \left( \Mean{e^{\lambda X_1}} \right)^n, $$ and using the inequality $\frac{e^{-\lambda}+e^\lambda}{2} \le e^{\lambda^2/2}$ (easy to prove using the power series expansion for the exponential function) we deduce, choosing now $\lambda=a/n$, the desired inequality of the theorem.

• We used the same theorem then to prove that if $A$ is a $n\times m$ matrix with 0 or 1 entries then there is a vector of signs $b \in \Set{-1, 1}^m$ such that $\Linf{A b} \le \sqrt{4 m \log n}$. The trick is again to take $b$ to be a random vector of signs (independently for all coordinates and with equal probability for $\pm 1$). It follows that $\Mean{(Ab)_i} = 0$ for all $i=1,\ldots,n$, and we would like all random variables $(Ab)_i$, $i=1,\ldots,n$, to be close to their expected value. The probability that this fails for a given $i$ is given to us by using this theorem. It turns out that the sum of the probabilities of these "bad events" is smaller than 1, so that with positive probability none of these bad events happens. You can read more details in Theorem 4.11 of the book "Mitzenmacher and Upfal, Probability for Computing, 2ed".

• Then we stated the following large deviation inequality without proof. If $X_1, \ldots, X_n$ are independent indicator (0 or 1-valued) random variables and $S = X_1+\cdots+X_n$ with $\mu = \Mean{S}$ then for any $\delta \in (0, 1)$ we have $$ \Prob{\Abs{S-\mu} \le \delta \mu} \le 2 e^{-\mu \frac{\delta^3}{3}}. $$

• We used this result next to estimate the number $n$ of repetitions of a randomized algorithm, which produces the correct result with probability $p\gt 1/2$, that we need to make in order to select as the correct result the one which appears more than half the times. Writing $X_j = 1$ if the $j$-th repetition of the randomized algorithm gave the correct result and 0 else, and $S=X_1+\cdots+X_n$, we observe that $\Mean{S} = np$ and we would like that the following bad event does not happen: $\Set{S \le n/2}$. This bad event is included in the bad event $\Set{\Abs{S-np} \gt n(p-\frac12)} = \Set{\Abs{S-np} \gt \frac{p-\frac12}{p} np}$ which, by our inequality, is at most $2 \exp(-n \frac{(p-1/2)^3}{p^2})$ and we would like this to be at most a desired $\epsilon$. We can solve this inequality for $\epsilon$ to get that $$ n \gt \log\frac{2}{\epsilon} \frac{p^2}{(p-1/2)^3} $$ is enough.

• Last we talked about a theorem of Erdos on the existence of additive bases of order 2 for the natural numbers. This was proved using the probabilistic method where we repeatedly use the previous theorem for bounding large deviations. You can read about this result in section 3.2 in this paper.

Tuesday, 13 May 2025

We quickly summarized what we talked about last time, namely applications of the probabilistic method, in its simplest "average-value argument" form, to prove some theorems that have initially nothing to do with probability.

Today we gave one more example of this method. This differs from the ones we saw last time in that the random object constructed is not guaranteed to have the properties that we want. Instead we need to make some modifications of it and then it works. These modifications need to be controlled and this is again done use the average value argument.

• If the graph $G$ has $n$ vertices and $nd/2$ edges then it contains an independent set of size $\ge n/(2d)$.

Proof: Take a random subset $S$ of the vertices by keeping each vertex with probability $p$. Let $X$ be its size and $Y$ be the number of edges in $G$ restricted to $S$. It turns out that $$ \Mean{Y} = \frac{nd}{2}p^2. $$ Since $\Mean{X} = pn$ we have $\Mean{X-Y}=np-\frac{nd}{2}p^2$. Choose $p=1/d$ to make this as large as possible we obtain $\Mean{X-Y}=n/(2d)$. Delete one vertex from each edge to be left with an independent set of size $\ge n/(2d)$.

• The next thing we saw is the so-called derandomization by conditional probabilities which is a general method of turning some probabilitic proofs of existence into constructive methods (algorithms) of finding the good object whose proof we already have. Read about it here (section 1.3).

• We observed next that the average value argument cannot possibly help us construct an interesting object if the number of quantities we want to control (to have in a certain range, for example) are more than 1. For this we need deviation inequalities: inequalities that give us upper bounds on the probability that a ceratin random variable is far from its mean. The most well known (and usually weak) such inequalitites are Markov's inequality and Chebyshev's inequality, inequalities we usually encounter in any standard course of Probability. When, however, the random variable $X$ in question is a sum of many independent parts then we have the phenomenon of concentration of $X$ close to its mean $\Mean{X}$. We gave (without proof) an example of such an inequality:

• If $S = X_1+\cdots+X_n$ and the $X_j$ are independent and equal to $\pm 1$ with equal probability then $$ \Prob{\Abs{S} \ge a} \le 2 e^{-a^2/2n}. $$

We then compared the inequality given by this theorem (which we will prove next time) to the result given by Chebyshev's inequality and observed that this is dramatically better.

Thursday, 8 May 2025

Today we examined first some randomized algorithms (algorithms that use probability while running, but may fail to work with a very small probability) and thensaw a few proofs (probabilistic constructions of objects with some desired properties).
  1. Suppose we have two long strings of numbers $a = a_0 a_1 a_2 \ldots, a_N$ and $b = b_0 b_1 b_2 \ldots b_N$ (imagine a large number $N$) and we want to decide if the two strings are the same. String $a$ resides on our local computer but string $b$ on a remote, slowly accessible computer. We could transfer $b$ to our local computer and check that $a_n = b_j$ for $j=1, \ldots, N$, at a cost of $O(N)$ units of communication (which we are assuming are very costly, and we would like to minimize). Instead of transfering the whole string $b$ we do the following. We define the two polynomials $a(x) = \sum_{j=0}^N a_j x^j$ and $b(x) = \sum_{j=0}^N b_j x^j$. Obviously the two strings are identical if and only if the two polynomials are identical, or, if the polynomial $P(x) = a(x)-b(x)$ of degree $\le N$ is identically 0. Let the random number $X$ be uniform in the set $S = \Set{1, 2, \ldots, 2N}$. Since $P$ has at most $N$ roots in $S$ it follows that $\Prob{P(X) = 0} \le \frac12$.

    So our algorithm for checking the identity of the two strings is the following. For a large number $k$ let the remote computer evaluate the numbers $b(X_1), b(X_2), \ldots, b(X_k)$, where the $X_j$ are produced by us (local) chosen independently and uniformly from $S$. The the remote computer sends these values to us and we check if $a(X_j) = b(X_j)$, for $j=1, \ldots, k$. If the two strings are different the polynomial $P(x)$ is not identically zero, so the probability of this happening is $1/2^k$ at most. We pick the number $k$ large enough so that we can tolerate failure probability $1/2^k$ (for instance choosing $k$ roughly 300 guarantees failure probability $10^{-100}$). Let us point out that the two sides may still to do work $O(N)$ in order to evaluate the two polynomials but the slow part, the communication, is only $O(k)$.

    As a demonstration of the principle of checking identities up to a probability of error, and thus saving communication, we mentioned the well-established program rsync. Read about it here.

  2. Suppose someone claims that $A B = C$, for three given $N \times N$ matrices of real numbers. We would like to verify this claim, but we cannot afford to recalculate the product $A B$ from the matrices $A, B$, which takes about $N^3$ time. We can however do this much faster, using only a few matrix-vector multiplications (time $O(N^2)$ each) and with a very small probability of failure. Read about it here. We also showed and ran a small python program that implements this algorithm. It is here.
  3. We saw the algorithm called "reservoir sampling". See the description of the problem here and the solution we gave here.
The next examples are typical of uses of the probabilistic method relying only on the expected value (mean) of one random variable. The following are probabilistic constructions. Setting up the right probabilistic experiment we construct an object with certain properties based on the expected value of some quantity being in the right range.
  1. We proved that for every $n$ we can find an edge-coloring with two colors, red and blue, of $K_n$, the complete graph of $n$ vertices, such that the number of monochromatic triangles is $\le \frac14 {n \choose 3}$. A set of three vertices is called a monochromatic triangle of the three respective edges are colored with the same color. The way to prove this is to take a random coloring of the edges, with both colors having the same probability and independendently for all edges. The we proved that the expected value of monochromatic triangles is precisely $\frac14 {n \choose 3}$. Therefore there is at least one color assignment which leads to the number of monochromatic triangles being at most $\frac14 {n \choose 3}$.
  2. For any graph of $n$ vertices and $e$ edges we can find a bipartite subgraph with at least $e/2$ edges. The way to prove this is to separate the vertices of the graph at random into two "sides" L and R and keep only those edges that go from an L-vertex to an R-vertex (crossing edges). The expected value of the number of crossing edges, it is easy to see, is precisely $e/2$, therefore there is a separation of the vertices into L and R sides that leads to at least $e/3$ crossing edges.
  3. Given $n$ unit-length vectors $v_1, \ldots, v_n \in \RR^n$ we saw that we can always choose signs $\epsilon_1, \ldots, \epsilon_n = \pm 1$ such that the vector $v = \epsilon_1 v_1 + \cdots + \epsilon_n v_n$ has Euclidean norm $\le \sqrt{n}$. It is also possible to choose the signes in such a way that $v$ has Euclidean norm $\ge \sqrt{n}$. To prove this we chose random signs $\epsilon_j$ and we computed the expected value $$ \Mean{\Abs{v}^2} = \Mean{\Abs{\epsilon_1 v_1 + \cdots + \epsilon_n v_n}^2} = \Mean{\Inner{v}{v}} = n. $$ This means that there is a choice of the signs so that $\Abs{v}^2 \le n$ and also another choice of the signs that leads to $\Abs{v}^2 \ge n$.

    It is easy to see by taking the vectors $v_j$ to be the coordinate vectors $e_j$ that these bounds cannot be improved as, in that case, no matter what the signs we have $\Abs{v}=\sqrt{n}$.

Tuesday, 6 May 2025

We started by giving several examples of uses of probability as well as puzzles. Here they are briefly described with pointers to more details elsewhere.

  1. How to conduct a survey when one answer might be sensitive? Suppose you want to measure statistically the percentage of a population who have committed something bad. The participants (the randomly sampled members of the population who are asked if they committed this bad act) are willing to help you measure this number but everyone wants deniability: if they answer YES to the question ("Have you done something bad?") they want to be able to deny this if confronted with this later on. How do we achieve this?

    Read more here.

  2. The young man with two lady friends. Here is the problem listed from here:
    A young man lives in Manhattan near a subway express station. He has two girlfriends, one in Brooklyn, one in the Bronx. To visit the girl in Brooklyn, he takes a train on the downtown side of the platform; to visit the girl in the Bronx, he takes a train on the uptown side of the same platform. Since he likes both girls equally well, he simply takes the first train that comes along. In this way, he lets chance determine whether he rides to the Bronx or to Brooklyn. The young man reaches the subway platform at a random moment each Saturday afternoon. Brooklyn and Bronx trains arrive at the station equally often—every 10 minutes. Yet for some obscure reason he finds himself spending most of his time with the girl in Brooklyn: in fact on the average he goes there 9 times out of 10. Can you think of a good reason why the odds so heavily favor Brooklyn?
    Read the solution to the puzzle from the Wikipedia article linked to from the same page.
  3. We discussed the Monty Hall Problem, a classic. Read about it here but do not read too much. Think of the three strategies we talked about:
    1. Choose a box at random and stay.
    2. Choose a box at random and then, after one empty box is out of the way, choose again at random from the remaining two.
    3. Choose a box at random and then always switch to the other closed box.
    Convince yourselves that the probability of winning with these three strategies are 1/3, 1/2 and 2/3 respectively.
  4. The host of this game holds (hidden) in his hands two amounts of money: some unknown amount and its double. We do not know which amount is in which hand. After we choose a hand at random, the host opens the hand we chose and we see 1 euro. We are then given the option to change our selection before the contents of the other hand are revealed. Should we change or not?

    The wrong way to think about this is as follows: since we see 1 euro in the open hand then the contents of the other hand are either 1/2 euro or 2 euros, each happening with probability 1/2. Thereore, if we switch to the other hand the expected gain will be $\frac12 \cdot \frac12 + \frac12\cdot 2 = \frac54$ which is $\gt 1$, hence we should switch.

    In reality it does not make any difference if we switch or not (statistically -- of course it can matter at a specific instance of the game). So where is the error in the above calculation?

  5. Underneath two pieces of paper are written two different real numberas $a$ and $b$, unknown to us.

    We randomly choose a piece of paper, look at the number underneath, and win a candy if we correctly state that the number we see is the larger or smaller of the two.

    There is an easy way to win the candy with probability $1/2$. We just always say "larger". Or always say "smaller". Or we flip a fair coin and say "larger" or "smaller" depending on what the coin came up with, heads or tails. (This way we don't even take into account the number we saw written under the piece of paper we chose.)

    Is there a way to play this game so that the probability of getting the candy is greater than $1/2$?

    The surprising answer is yes (but we cannot guarantee how much larger than 1/2 this probability of winning is).

    We can do this as follows: Let $X$ be the number we saw after we chose a piece of paper (so $X$ is a random variable taking the two values $a$ and $b$ with equal probability). Let also $Y$ be a random variable whose density on the real axis is everywhere strictly positive. For instance we may take $Y$ to follow the standard normal distribution $\mathcal{N}(0, 1)$. We run the experiment that produces $Y$ and then say "larger" if $X \ge Y$ or smaller if $X \lt Y$. One can easily check that the probability of success is strictly larger than 1/2.

  6. If we have at our disposal a random number generator which produces a random variable $X$ uniformly distributed in $[0, 1]$ we saw how to choose a function $\phi(x)$ so that the random variable $Y = \phi(X)$ follows a desired target distribution (e.g. the standard normal distribution $\mathcal{N}(0, 1)$).

    Read about it here.

  7. How to convince your color-blind friend that the two, otherwise identical, balls you are holding in your hands are green and red?

    Read about it here.

  8. We mentioned very briefly the Miller-Rabin primality test: given an integer $n$ decide if $n$ is prime or not.

    Read about it here.