Text By the Bay has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track B [clear filter]
Friday, April 24


NLP And Deep Learning: Working with Neural Word Embeddings
In this talk, we will cover how to work with neural nets in and text. This will encompass a comparison of the different algorithms, analogies to more traditional methods such as bag of words, followed by a basic idea of how to use 2 different architectures of neural networks for sequential and document classification.


Adam Gibson

@agibsoncccAdam is the founder and tech lead of Skymind, the commercial arm of the open source project deeplearning4j. He is also an orielly author of the upcoming book: Deep Learning: A Practcioner's Approach. 

Friday April 24, 2015 10:30am - 11:10am


Discovering Knowledge in Linked Data
By building on the foundations of the Semantic Web, we can create tools that help people explore relationships between data, connect information, and discover knowledge. In this talk, we'll look at how to search Wikidata from a graph database via a domain-specific language. We'll be able to ask simple questions such as "What happened on this day in history?", and tricky questions such as "What were some of the fields of work of physicists who worked at institutions where Richard Feynman also worked?".


James Earl Douglas

@jearldouglasA functional programmer with advanced experience developing production software in Scala and Java, James is passionate about continuous learning and keeping just outside of his comfort zone.  When not knee-deep in type theory, James can be found running, cycling, and... Read More →

Friday April 24, 2015 11:20am - 11:40am


Organizing Real Estate Photo Collections with Deep Learning
Real Estate Websites like Trulia and Zillow host millions of property listings, with each listing consisting of rich textual description and images of the property. While rich in information, the discoverability of this data is limited by its unstructured nature. For Example, How do we learn if "granite countertops" is an interesting real estate term. And if it is, how can we assign it to one of the many photos associated with the property.

In this talk we detail our approach to organize Trulia's unstructured content into rich photo collections similar to Houzz.com or Zillow Digs, without the need of any explicit user tagging.

By leveraging the recent advances in deep learning for computer vision and nap, we first automatically construct a knowledge base of relevant real estate terms and then annotate our photo collections by fusing knowledge from a deep convolutional network for image recognition and a word embedding model.

The novelty in our approach lies in our ability to scale to a large vocabulary of real estate terms without explicitly training a vision model for each one of them.


Shourabh Rawat

@shrawat87Shourabh Rawat is a senior data scientist at Trulia Inc based in San Francisco. He is an applied researcher at the intersection of machine learning, deep learning, NLP and computer vision.  He received his Masters in Language Technologies from Carnegie Mellon University... Read More →

Friday April 24, 2015 11:50am - 12:30pm


Unlocking Our Health Data: Transforming Unstructured Data at Scale
Each of us have a plethora of a health data that resides in unstructured, non-standard formats and silos. Bringing this data together can reveal powerful insights about our health, but proves to be a staggering technical challenge. Unstructured narratives contain key pieces of information that can not easily be extracted without additional processing.

We are building a system to organize this unstructured data, classify it into known topics, and apply additional levels of normalizations -- all in near real-time and at scale. This talk will cover some of the technical challenges we are facing and how we are solving them with machine learning and natural language processing techniques.

avatar for Ola Wiberg

Ola Wiberg

Co-founder, Human API
@olawiberg As Co-founder and VP of Engineering at Human API Ola Wiberg is responsible for infrastructure development, data management, and information security. 

Friday April 24, 2015 1:30pm - 2:10pm


Unsupervised NLP Tutorial using Apache Spark
Paraphrasing Tim O'Reilly, the person who has the most data wins. That's a neat slogan, but the more data one has, the more likely it is to be unlabeled. Unfortunately, there aren't that many unsupervised learning algorithms out there, for machine learning in general and for NLP in particular. Recent advances in deep learning provide new tools for text mining of large unsupervised datasets. In particular, I will talk about the math, intuition and implementation of the word2vec algorithm, its variants (skipgram and continuous bag of words), use cases, and extensions (e.g. paragraph2vec, doc2vec). I will wrap up with a simple demonstration at scale using Scala, Apache Spark, MLLib, and the Apache Zeppelin Notebook.

avatar for Marek Kolodziej

Marek Kolodziej

Principal Research Engineer, Nitro
Marek Kolodziej is a Principal Research Engineer at Nitro, Inc. He's been working on a diverse set of machine learning, distributed computing and big data problems for the past 6 years, and statistics and econometrics for the past 11. His current passion is deep learning and GPU computing... Read More →

Friday April 24, 2015 2:20pm - 3:00pm


Statistical Machine Translation Approach for Name Matching in Record Link
Record linkage, or entity resolution, is an important area of data mining. Name matching is a key component of systems for record linkage. Alternative spellings of the same name are a common occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name- matching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use information retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and recall. Our result can lead to a significant practical impact in entity resolution applications.


Jeffrey Sukhare

BS, MS Computer Science UC Santa Cruz, PhD candidate Computer Science UC Davis.  Senior Data Scientist at Ancestry.com working on record linkage applications. 

Friday April 24, 2015 3:20pm - 3:40pm


Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logic
Why do we want information? So that we can use it? So that our computers can use it? When we have access to rich, structured information we can make advanced applications that solve real-world pain points.

In this talk, I'll present an effective approach for automatically creating knowledge bases: databases of factual, general information. This relation extraction approach centers around the idea that we can use machine learning and natural language processing to automatically recognize information as it exists in real-world, unstructured text.

I'll cover the NLP tools, special ML considerations, and novel methods for creating a successful end-to-end relation extraction system. I will also cover experimental results with this system architecture in both big-data and a search-oriented environments.


Malcolm Greaves

Software engineer at Nitro. Works on crafting solutions to large-scale machine learning and natural language processing problems. Practitioner of functional programming practices. Keeps up to date with latest published algorithms and ideas, innovation through hacking novel solutions... Read More →

Friday April 24, 2015 4:00pm - 4:40pm
Saturday, April 25


Semantic Indexing of Four Million Documents with Apache Spark
Latent Semantic Analysis (LSA) is a technique in natural language processing and information retrieval that seeks to better understand the latent relationships and concepts in large corpuses. In this talk, we’ll walk through what it looks like to apply LSA to the full set of documents in English Wikipedia, using Apache Spark. Harnessing the Stanford CoreNLP library for lemmatization and MLlib’s scalable SVD implementation for uncovering a lower-dimensional representation of the data, we’ll undertake the modest task of enabling queries against the full extent of human knowledge, based on latent semantic relationships.


Sandy Ryza

@sandysiftingSandy is a data scientist at Cloudera focusing on Apache Spark and its ecosystem, and an author of the upcoming O'Reilly publication Advanced Analytics with Spark. He is a frequent Spark contributor as well as a member of the Apache Hadoop project management committee... Read More →

Saturday April 25, 2015 10:20am - 11:00am


Large Scale Topic Assignment on Multiple Social Networks
Interests and expertise is a challenging problem with applications in various data-powered products. In this talk, we present a full production system used at Lithium Technologies (Klout), which mines topical interests and expertise from multiple social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis.

The system generates a diverse set of features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using cross-network information with a diverse features for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network or any single source.


Adithya Rao

@AdithyaRaoAdithya is the Lead Research Engineer in the Data Science team at Lithium technologies. Some of the projects he has worked on includes the Klout score, topic extraction, user targeting and social content discovery. Before his current role, he graduated with a Masters degree... Read More →

Nemanja Spasojevic

@sofronijeNemanja is the Director of Data Science at Lithium Technologies, and leads all the data science and engineering efforts around building scalable systems for topical text mining, user scoring and content relevance. Prior to his current role, he was previously at Google for... Read More →

Saturday April 25, 2015 11:10am - 11:30am


Classifying Text without (many) Labels
Supervised text classification is often hampered by the need to acquire relatively expensive labeled training sets. In some embodiments of the systems and methods disclosed herein, pre-existing Word2Vec or similar algorithms are leveraged to create vector representations of documents that enable a model to be successfully trained with a drastically reduced training set. By using this technique the implementer can now devote low investment to acquiring a small volume of labeled data examples in order to train proximity thresholds, without devoting significant resources using traditional text classification machine learning algorithms which typically require training volume examples that are orders of magnitude larger.


Mike Tamir

@MikeTamirWith over a decade of teaching experience at the University level, Mike serves as CSO/Head of Education and Data Science Programing for Galvanize.  Mike also helped to found the GalvanizeU Masters program focused on developing the skills required of high performing Data... Read More →

Saturday April 25, 2015 11:40am - 12:20pm


Learning From the Diner's Experience
I will talk about how we are using data science to help transform OpenTable into a local dining expert who knows you very well, and can help you and others find the best dining experience wherever we travel! This entails a whole slew of tools from natural language processing, recommendation system engineering, sentiment analysis that have to work in synch to make that magical experience happen. One of our main sources of insight are the reviews left by diners on our website. In this talk, I will stress on what we are learning from our rich set of diner reviews, especially using topic modeling as a core tool. I will touch upon various possible applications of this technique that we are currently exploring in both restaurateur facing and diner facing features.

avatar for Sudeep Das

Sudeep Das

Data Scientist, OpenTable
@datamusingSudeep Das is a Data Scientist at OpenTable, where his main focus is on mining reviews and restaurant data to extract actionable insights and enable a personalized dining experience. He has broad experience with NLP methods, especially topic modeling and its applications... Read More →

Saturday April 25, 2015 1:20pm - 2:00pm


Identifying Events with Tweets
Gabor is a Staff Data Scientist at Twitter, and works on describing and predicting user behavior and modeling large-scale content dynamics on Twitter. Before that he worked on predicting content popularity in crowdsourced ecologies, on the network analysis of online services, and did research on large-scale social and biological systems. Before, he worked at HP Labs, Harvard Medical School, and the University of Notre Dame.


Gabor Szabo

@gaborjszaboGabor is a Staff Data Scientist at Twitter, and works on describing and predicting user behavior and modeling large-scale content dynamics on Twitter. Before that he worked on predicting content popularity in crowdsourced ecologies, on the network analysis of online services... Read More →

Saturday April 25, 2015 2:10pm - 2:50pm


How Terminal makes Machine Learning Fast and Fun
The talk will be to explain how Terminal works and how people are using it in Machine Learning applications or Big Data analysis (like, out of the box multi-tenant Spark clusters).


Varun Ganapathi

@varungpVarun Ganapathi was a PhD candidate at Stanford when he co-founded EyeApps and Numovis. EyeApps created a paid photography app, Pro HDR, for iPhone and Android which has sold more than a million copies. After Numovis was acquired by Google, Varun spent two years as a Senior... Read More →

Saturday April 25, 2015 3:00pm - 3:20pm


Identifying CrunchBase Entities in News Articles
We will discuss doing record linkage to entities identified in news articles scraped from the web. Further, we will discuss the challenges of working with user-edited entities, that are constantly changing.

avatar for Gershon Bialer

Gershon Bialer


Saturday April 25, 2015 3:30pm - 4:00pm


A Word is Worth a Thousand Vectors
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll speak about word2vec, related techniques, and try to convince you that word vectors give us a simple and flexible platform for understanding text.


Christopher Erick Moody

@chrisemoodyChris loves high performance computing, high dimensions & high fashion. He loves learning the beautiful symmetries between physics, data, and analytics. Hails from Spain & South Carolina. Went to Caltech, did astrostats & supercomputing, and now Data Labs at Stitch Fix... Read More →

Saturday April 25, 2015 4:00pm - 4:40pm