index.html

<!doctype html>
<html>
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

		<title>Streaming Topic Model Training and Inference</title>

    <meta name="description" content="Streaming Topic Model Training and Inference">
    <meta name="author" content="Suneel Marthi; Joey Frazee">

    <meta name="apple-mobile-web-app-capable" content="yes"/>
    <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"/>

    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">

    <link rel="stylesheet" href="font-awesome-4.7.0/css/font-awesome.min.css">
		<link rel="stylesheet" href="css/reveal.css">
		<link rel="stylesheet" href="css/theme/bbuzz17.css">

		<!-- Theme used for syntax highlighting of code -->
		<link rel="stylesheet" href="lib/css/zenburn.css">

		<!-- Printing and PDF exports -->
		<script>
			var link = document.createElement( 'link' );
			link.rel = 'stylesheet';
			link.type = 'text/css';
			link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
			document.getElementsByTagName( 'head' )[0].appendChild( link );
		</script>
    <!--[if lt IE 9]> <script src="lib/js/html5shiv.js"></script> <![endif]-->
	</head>
	<body>
		<div class="reveal">
			<div class="slides">
				<section data-background-image="img/barcelona.jpg">
          <br/>
          <br/>
          <br/>
          <h3>Streaming Topic Model Training and Inference</h3>
          <br/>
          <p>Suneel Marthi</p>
          <p style="font-size: 60%">March 21, 2019<br/>
              DataWorks Summit 2019 - Barcelona</p>
        </section>
        <section data-background-image="img/barcelona.jpg">
            <h3>$WhoAreWe</h3>
            <br/>
            <ul style='list-style: none;'>
                <li>
                    <h6 style='text-align: left; font-size: 80%;'>Suneel Marthi <br/><span style="font-size: 60%"><i class="fa fa-twitter" aria-hidden="true"></i> @suneelmarthi</span></h6>
                    <ul style='font-size: 70%;'>
                        <li>Member of Apache Software Foundation</li>
                        <li>Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams</li>
                    </ul>
                </li>
            </ul>
        </section>
        <section data-background-image="img/ZsMwlm2.jpg">
          <h3>Agenda</h3>
          <ul>
             <li>Motivation for Topic Modeling</li>
             <li>Existing Approaches for Topic Modeling</li>
             <li>Topic Modeling on Streams</li>
          </ul>
        </section>
        <section data-background-image="img/ZsMwlm2-purpleish.jpg">
            <h3>Motivation for Topic Modeling</h3>
        </section>

        <section data-background-image="img/ZsMwlm2-redish.jpg">
            <h3>Topic Models</h3>
            <ul>
                <li>Automatically discovering main topics in collections of documents
                </li>
                <ul>
                    <li>Overall</li>
                    <ul>
                        <li>What are the main themes discussed on a mailing list or a discussion forum ?
                        </li>
                        <li>
                            What are the most recurring topics discussed in research papers ?
                        </li>
                    </ul>
                </ul>
            </ul>
        </section>
        <section data-background-image="img/ZsMwlm2-yellowish.jpg">
                <h3>Topic Models (Contd)</h3>
                <ul>
                    <li>Per Document</li>
                    <ul>
                        <li>What are the main topics discussed in a certain (single) newspaper article ?
                        </li>
                        <li>What is the most distinctive topic that makes a certain research article more interesting than the others ?</li>
                    </ul>
                </ul>
        </section>
        <section data-background-image="img//ZsMwlm2-greenish.jpg">
            <h3>Uses of Topic Models</h3>
            <ul>
                <li>Search Engines to organize text</li>
                <li>Annotating Documents according to these topics</li>
                <li>Uncovering hidden topical patterns across a document collection</li>
            </ul>
        </section>
        <section data-background-image="img/ZsMwlm2-purpleish.jpg">
            <h3>Common Approaches to Topic Modeling</h3>
        </section>
        <section data-background-image="img/ZsMwlm2-redish.jpg">
            <ul>
                <li>Latent Dirichlet Allocation (LDA)</li>
                <ul>
                    <li>Topics are composed by probability distributions over words</li>
                    <li>Documents are composed by probability distributions over Topics</li>
                    <li>Batch Oriented approach</li>
                </ul>
            </ul>
        </section>
				<section data-background-image="img/ZsMwlm2-purpleish.jpg">
					<h3>LDA Model</h3>
					<table align="left">
						<tr>
							<td width=400 style="vertical-align: top"><img src="img/lda.png" border=0/></td>
							<td width=400 style="vertical-align: middle">
								<ul>
									<li style="font-size: 16pt"><em>N</em>: vocab size</li>
									<li style="font-size: 16pt"><em>M/D</em>: # of documents</li>
									<li style="font-size: 16pt"><em>alpha</em>: prior on per-document topic dist.</li>
									<li style="font-size: 16pt"><em>beta</em>: prior on topic-word dist.</li>
									<li style="font-size: 16pt"><em>theta</em>: per-document topic dist.</li>
									<li style="font-size: 16pt"><em>phi</em>: per-topic word dist.</li>
									<li style="font-size: 16pt"><em>z</em>: topic for word w</li>
								</ul>
							</td>
						</tr>
						<tr>
							<td colspan=2 style="font-size: 18pt;"><p><b>Intuition</b>: Documents are a mixture of topics, topics are a mixture of words. So documents can be generated by drawing a sample of topics, and then drawing a sample of words.</p></td>
						</tr>
					</table>
				</section>
        <section data-background-image="img/ZsMwlm2-yellowish.jpg">
            <ul>
                <li>Latent Semantic Analysis (LSA)</li>
                <ul>
                    <li style="font-size: 85%">SVDed TF-IDF Document-Term Matrix </li>
                    <li style="font-size: 85%">Batch Oriented approach</li>
                </ul>
            </ul>
        </section>
        <section data-background-image="img/ZsMwlm2-blueish.jpg">
            <ul>
                <li>Drawbacks of Traditional Topic Modeling Approach</li>
                <ul>
                    <li>Problem : labelling topics is often a manual task
                    </li>
                </ul>
                <pre><code data-trim data-noescape>
                    E.g. LDA topic:30% basket,15% ball,10% drunk,…
                                  can be tagged as basketball
                </code></pre>
            </ul>
        </section>
        <section data-background-image="img/ZsMwlm2-redish.jpg">
            <h3><s>Common Approaches</s> Deep Learning (of course)</h3>
            <ul>
                <li>Learn to perform LDA: LDA supervised DNN training (inference speed up)
                    <ul>
                        https://arxiv.org/abs/1508.01011
                    </ul>
                </li>
                <li>Deep Belief Networks for Topic Modelling
                    <ul>
                        <li>https://arxiv.org/abs/1501.04325</li>
                    </ul>
                </li>

            </ul>
        </section>

                <section data-background-image="img/ZsMwlm2-greenish.jpg">
                    <h3>What are Embeddings ?</h3>
                    <p align="left" style="font-size: 85%">Represent a document as a point in space, semantically similar docs are close together when plotted</p>
                    <p align="left" style="font-size: 85%">Such a representation is learned via a (shallow) neural network algorithm</p>
                </section>

        <section data-background-image="img/ZsMwlm2-greenish.jpg">
            <h4>Embeddings for Topic Modeling</h4>
            <ul>
                <li>LDA2Vec: Mixing LDA and word2vec word embeddings
                    <ul>
                        https://arxiv.org/abs/1605.02019
                    </ul>
                </li>
                <li>Navigating embeddings: geometric navigation in word
                    and doc embeddings space looking for topics
                </li>
            </ul>
            <p><img src='img/embeddings.png' style=' position: absolute; right: 0px; bottom: 0px; width: 30%; height: 30%; margin: 0px, 0px, 0px, 0px; border: none; background: none;box-shadow: none; ' align="right"/><br/></p>
        </section>
        <section data-background-image="img/ZsMwlm2-yellowish.jpg">
            <h4>Static vs Dynamic Corpora</h4>
            <ul>
                <li>Learning Topic Models over a fixed set of Documents
                    <ul>
                        <li>Fetch the Corpus</li>
                        <li>Train and Fit a topic model</li>
                        <li>Extract topics</li>
                        <li>...other downstream tasks</li>
                    </ul>
                </li>
            </ul>
        </section>
        <section data-background-image="img/ZsMwlm2-redish.jpg">
            <h4>Static vs Dynamic Corpora (Contd)</h4>
            <ul>
                <li>What to do with Corpus updates?
                    <ul>
                        <li>Document collections are not static in real world</li>
                        <li>There may be no document collections to begin with</li>
                        <li>Memory limits with large corpora</li>
                        <li>Reprocess entire corpora when new documents are added</li>
                    </ul>
                </li>
            </ul>
        </section>
        <section data-background-image="img/ZsMwlm2-blueish.jpg">
            <h4>Static vs Dynamic Corpora (Contd)</h4>
            <ul>
            <li>Streams, Streams, Streams
                <ul>
                    <li>Twitter stream</li>
                    <li>Reddit posts</li>
                    <li>Papers on ResearchGate and Arxiv</li>
                    <li>....</li>
                </ul>
            </li>
            </ul>
        </section>

        <section data-background-image="img/ZsMwlm2-greenish.jpg">
            <h3>Questions for Audience</h3>
            <ul>
                <li style="font-size: 85%">
                    Who’s doing inference or scoring on streaming data?<br/>
                </li>
                <li style="font-size: 85%">
                    Who’s doing online learning and optimization on streaming data?
                </li>
            </ul>
            <p>(2) is hard, sometimes because the algorithms aren’t there.
                But, also because people don’t know the algorithms are there.
            </p>
        </section>

        <section data-background-image="img/ZsMwlm2-purpleish.jpg">
            <h3>Topic Modeling on Streams</h3>
        </section>

        <section data-background-image="img/ZsMwlm2-blueish.jpg">
            <h4>2 Streaming Approaches</h4>
            <ul>
                <li>Learning Topics from Jira Issues</li>
                <li>Online LDA</li>
            </ul>
        </section>

        <section data-background-image="img/ZsMwlm2-purpleish.jpg">
            <h3>Learning Topics on Jira Issues</h3>
        </section>

        <section data-background-image="img/ZsMwlm2-greenish.jpg">
            <h3>Workflow</h3>
            <img src='img/Jira-Topic-Model.png' style='max-width: 80%; border: none; background: none;box-shadow: none; '/>
        </section>


        <section data-background-image="img/ZsMwlm2-purpleish.jpg">
            <h4>FLINK-9963: Add a single value table format factory</h4>
            <img src='img/Flink9963.png' style='max-width: 80%; border: none; background: none;box-shadow: none; '/>
        </section>

                <section data-background-image="img/ZsMwlm2-redish.jpg">
                    <h4>Topics extracted for Flink-9963</h4>
                    <ul>
                        <li style="font-size: 85%">format factory</li>
                        <li style="font-size: 85%">table schema</li>
                        <li style="font-size: 85%">table source</li>
                        <li style="font-size: 85%">sink factory</li>
                        <li style="font-size: 85%">table sink</li>
                    </ul>

                </section>

                <section data-background-image="img/ZsMwlm2-purpleish.jpg">
                    <h4>FLINK-8286: Fix Flink-Yarn-Kerberos integration for FLIP-6</h4>
                    <img src='img/Flink8286.png' style='max-width: 80%; border: none; background: none;box-shadow: none; '/>
                </section>

                <section data-background-image="img/ZsMwlm2-redish.jpg">
                    <h4>Topics extracted for Flink-8286</h4>
                    <ul>
                        <li style="font-size: 85%">yarn kerberos integration</li>
                        <li style="font-size: 85%">kerberos integrations</li>
                        <li style="font-size: 85%">yarn kerberos</li>
                    </ul>

                </section>

        <section data-background-image="img/ZsMwlm2-yellowish.jpg">
            <h3>Bayesian Inference</h3>
            <p><b>Goal</b>: Calculate the posterior distribution of a probabilistic model given data</p>
            <br/>
            <ul>
                <li style="font-size: 85%">Gibbs sampling/MCMC: Repeatedly sample from conditional distribution(s) to approximate the posterior of the joint distribution;
                    typically a batch process<br/>
                </li>
                <li style="font-size: 85%">
                    Variational Bayes: Optimize the parameters of a function that approximates joint distribution
                </li>
            </ul>
        </section>

        <section data-background-image="img/ZsMwlm2-greenish.jpg">
             <h3>Variational Bayes</h3>
             <ul>
                 <li style="font-size: 85%">Expectation Maximization-like optimization</li>
                 <li style="font-size: 85%">Faster and more accurate</li>
                 <li style="font-size: 85%">Can often be computed online (on windows over streams)</li>
             </ul>
        </section>

				<section data-background-image="img/ZsMwlm2-redish.jpg">
					<h3>Online LDA (1/2)</h3>
					<ul>
						<li style="font-size: 85%">Variational Bayes for LDA</li>
						<li style="font-size: 85%">Batch or online</li>
						<li style="font-size: 85%">Iterate between optimizing per-word topics and per-document topics (“E” step) and topic-word distributions (“M” step)</li>
						<li style="font-size: 85%">Default topic model optimization in Gensim, scikit-learn, Apache Spark MLlib -- not because it’s amenable to streaming per se, but more because it’s memory efficient and can be applied to large corpora.</li>
					</ul>
					<p style="font-size: 40%; text-align: left;"><a style="color: white;" href="https://www.di.ens.fr/~fbach/mdhnips2010.pdf">Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.</a></p>
				</section>

				<section data-background-image="img/ZsMwlm2-yellowish.jpg">
					<h3>Online LDA (2/2)</h3>
					<table align="left">
						<tr>
							<td width=400 style="vertical-align: top"><img src="img/online_vb_for_lda.png" border=0/></td>
							<td width=400 style="vertical-align: top">
								<ul>
									<li style="font-size: 14pt; list-style: none; padding-left: 0; margin-left: -16px">job-constant</li>
								  <li style="font-size: 11pt"><em>K</em>: number of topics</li>
								  <li style="font-size: 11pt"><em>alpha</em>: topic-document dist. prior</li>
								  <li style="font-size: 11pt"><em>eta</em>: topic-word dist. prior</li>
								  <li style="font-size: 11pt"><em>tau0</em>: learning rate</li>
								  <li style="font-size: 11pt"><em>kappa</em>: decay rate</li>
								</ul>
								<ul>
									<li style="font-size: 14pt; list-style: none; padding-left: 0; margin-left: -16px">(mini)batch-local</li>
									<li style="font-size: 11pt"><em>gamma</em>: theta prior</li>
									<li style="font-size: 11pt"><em>theta</em>: param. for per-document topic dist</li>
									<li style="font-size: 11pt"><em>phi</em>: param. for per-topic word dist.</li>
									<li style="font-size: 11pt"><em>rho</em>: update weight</li>
								</ul>
								<ul>
									<li style="font-size: 14pt; list-style: none; padding-left: 0; margin-left: -16px">job-global</li>
									<li style="font-size: 11pt"><em>lambda</em>: parameter for topic-word dist.</li>
									<li style="font-size: 11pt"><em>t</em>: documents/batches seen (index)</li>
								</ul>
							</td>
						</tr>
						<tr>
							<td width=400 style="font-size: 14pt">
								<p><b>Goal</b>: Optimize the topic-word dist. parameter lambda; it tells us what words belong to the topics.</p>
								<p style="font-size: 10pt">[Image source: Hoffman, Blei & Bach, 2010.]</p>
							</td>
							<td width=400 style="font-size: 14pt"><p style="padding-left: 32px">Job-global parameters can be tracked in a state store. Batch-local parameters can be discarded.</p></td>
						</tr>
					</table>
				</section>

				<section data-background-image="img/ZsMwlm2-greenish.jpg">
					<h2>Online LDA on Beam</h2>
					<table align="left">
						<tr>
							<td width=400 style="vertical-align: top">
								<ul>
									<li style="font-size: 14pt; list-style: none; padding-left: 0; margin-left: -16px">job-constant</li>
								  <li style="font-size: 11pt"><em>K</em>: number of topics</li>
								  <li style="font-size: 11pt"><em>alpha</em>: topic-document dist. prior</li>
								  <li style="font-size: 11pt"><em>eta</em>: topic-word dist. prior</li>
								  <li style="font-size: 11pt"><em>tau0</em>: learning rate</li>
								  <li style="font-size: 11pt"><em>kappa</em>: decay rate</li>
								</ul>
							</td>
							<td width=400><p style="font-size: 50%;">Set job parameters globally with setGlobalJobParameters() on execution environment, or pass as arguments to operators</p></td>
						</tr>
						<tr>
							<td width=400 style="vertical-align: top">
								<ul>
									<li style="font-size: 14pt; list-style: none; padding-left: 0; margin-left: -16px">(mini)batch-local</li>
									<li style="font-size: 11pt"><em>gamma</em>: theta prior</li>
									<li style="font-size: 11pt"><em>theta</em>: param. for per-document topic dist</li>
									<li style="font-size: 11pt"><em>phi</em>: param. for per-topic word dist.</li>
									<li style="font-size: 11pt"><em>rho</em>: update weight</li>
								</ul>
							</td>
							<td width=400><p style="font-size: 50%;">Operator/task-internal and don't need to be accessible by other tasks</p></td>
						</tr>
						<tr>
							<td width=400 style="vertical-align: top">
								<ul>
									<li style="font-size: 14pt; list-style: none; padding-left: 0; margin-left: -16px">job-global</li>
									<li style="font-size: 11pt"><em>lambda</em>: parameter for topic-word dist.</li>
									<li style="font-size: 11pt"><em>t</em>: documents/batches seen (index)</li>
								</ul>
							</td>
							<td width=400 style="vertical-align: top">
								<ul style="margin-left: 18px;">
									<li style="font-size: 14pt; list-style: none; padding-left: 0; margin-left: -24px">Several options:</li>
									<li style="font-size: 11pt">External key-value store</li>
									<li style="font-size: 11pt">Accumulators: <em>lambda</em> update as increment, accumulator + broadcast variables</li>
									<li style="font-size: 11pt">Broadcast state</li>
								</ul>
							</td>
						</tr>
						<tr>
							<td colspan=2><p style="font-size: 40%">And, (mini)batches selected by Window size</p></td>
						</tr>
					</table>
				</section>

                <section data-background-image="img/ZsMwlm2-purpleish.jpg">
                    <h3>OnlineLDA Pipeline</h3>
                    <img src='img/pipeline.png' style='max-width: 80%; border: none; background: none;box-shadow: none; '/>
                </section>

				<section data-background-image="img/ZsMwlm2-redish.jpg">
					<h2>So what?</h2>
				</section>

				<section data-background-image="img/ZsMwlm2-purpleish.jpg">
					<ul>
						<li style="font-size: 90%">For many tasks, the default is batch training/optimization</li>
						<li style="font-size: 90%">Often, even if good online training and optimization algorithms exist (why?)</li>
						<li style="font-size: 90%">So if your algorithms only require 1-step of parameter state, why not just do streaming?</li>
						<li style="font-size: 90%">You can get instant, always up-to-date results by putting the effort into using a different class of optimization algorithms an using a stream processing engine, so just do that</li>
					</ul>
				</section>

        <section data-background-image="img/ZsMwlm2-redish.jpg">
            <h2>Links</h2>
            <ul>
                <li>http://papers.nips.cc/paper/3902-online-learning-for-latentdirichlet-allocation
                </li><br/>
                <li>Learning from LDA using Deep Neural Networks - https://arxiv.org/pdf/1508.01011.pdf</li><br/>
                <li>Deep Belief Nets for Topic Modeling - https://arxiv.org/pdf/1501.04325.pdf</li><br/>
                <li>Topic Modeling on Jira issues: https://github.com/tteofili/jtm</li>
            </ul>
        </section>

        <section data-background-image="img/ZsMwlm2-blueish.jpg">
          <h3>Credits</h3>
          <ul>
            <li>Tommaso Teofili, Simone Tripodi (Adobe - Rome)</li>
            <li>Joern Kottmann, Bruno Kinoshita (Apache OpenNLP)</li>
            <li>Fabian Hueske (Apache Flink, Data Artisans)</li>
          </ul>
        </section>
        <section data-background-image="img/ZsMwlm2.jpg">
          <h2>Questions ???</h2>
        </section>

			</div>
		</div>

		<script src="lib/js/head.min.js"></script>
		<script src="js/reveal.js"></script>

		<script>
			// More info about config & dependencies:
			// - https://github.com/hakimel/reveal.js#configuration
			// - https://github.com/hakimel/reveal.js#dependencies
			Reveal.initialize({
        controls: false,
        progress: true,
        history: true,
        center: true,
        slideNumber: true,
        transition: 'slide', // none/fade/slide/convex/concave/zoom
        keyboard: true,
        touch: true,
        loop: false,
        fragments: true,
				dependencies: [
					{ src: 'plugin/markdown/marked.js' },
					{ src: 'plugin/markdown/markdown.js' },
					{ src: 'plugin/notes/notes.js', async: true },
					{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } }
				]
			});
		</script>
	</body>
</html>