Text By the Bay has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track C [clear filter]
Friday, April 24


Learning The Semantics of Millions of Entities
What do "Software Developer", "MTS", and "Code Monkey" have in common? No, it's not the start to a bad joke. It's actually a situation where many unique entities turn out to have a small number of distinct semantics. This talk will present new techniques for mapping millions of such entities into a common semantic space with just a few thousand labels. We'll discuss how these techniques have been applied to job titles, skills, majors, and degrees to build candidate recommendation systems.


Vlad Giverts

@vladgivertsVlad is a Sr Director at Workday responsible for building prediction and recommendation systems for the company's cloud HR and Financial Management products. He was previously CTO at Identified, a data and analytics startup, which was acquired by Workday in early 2014... Read More →

Friday April 24, 2015 10:30am - 11:10am


Increasing Honesty in Airbnb Reviews
Reviews and reputation scores are increasingly important for decision-making, especially in the case of online marketplaces. However, online reviews may not provide an accurate depiction of the characteristics of a product, either because many people do not leave reviews or because some reviewers omit salient information.

At Airbnb, we study the causes and magnitude of bias in online reviews by using large-scale field experiments that change the incentives of buyers and sellers to honestly review each other. Natural language processing has allowed us to extend our analyses and study bias in reviews by using the written feedback guests and hosts write after a trip.


Dave Holtz

@daveholtzDave Holtz is a data scientist at Airbnb focusing on online reputation, and pricing. Previously, he worked as a data science engineer at Yub (acquired by Coupons.com) and as a data scientist and Product Manager at TrialPay. He is the instructor for Udacity’s Introduction... Read More →

Friday April 24, 2015 11:20am - 11:40am


Human Curated Linguistics - technology behind Cognitive Analytics
This presentation will provide an overview of Human Curated Linguistic (HCL) technology developed and used by DataLingvo in their Cognitive Analytics platform. HCL provides an industry-first real time free-form language comprehension and guaranteed answer correctness required for cognitive analytics applications.

Details of technical implementation and development stack will be discussed.

Live demonstration will be performed to show how HCL is answering questions asked in a free-form language about business data from Google Analytics and salesforce.com data sources.


Nikita Ivanov

@c64hackerNikita is a strategic technology advisor to DataLingvo, a cognitive analytics stealth startup that developed a breakthrough Human Curated Linguistics (HCL) technology. Nikita is also a founder and CTO for GridGain Systems, a company behind Apache Ignite, an industry leading... Read More →

Friday April 24, 2015 11:50am - 12:30pm


TopicStream, an Application and Architecture for Content Integration in Electronic Reading
The most popular ebook readers inherit from paper books the limiting concept of pagination. In electronic reading not only is pagination notoriously difficult for scientific/technical/medical (STM) but it locks content in one dimensionality of content consumption. We radically depart from pages and propose to split content into smaller, semantically self-contained 'tiles.' In contrast to pages, tiles can be more easily related to other tiles that can come from different books, GitHub repositories, StackOverflow discussions, Wikipedia, official documentation from the WWW, etc. Collections of documents from these other sources can be packaged as pre-tiled EPUB3 ebooks. The TopicStream app enables seamless navigation between book content & complementary documents without the need to explicitly open/close document collections. This approach adds value to commercial content in today's world where a lot of relevant information is available on-line.


Jacek Ambroziak

@JacekAmbroziakDuring the 90's Jacek was a Research Scientist with a Conceptual Indexing project at Sun Microsystems Labs. Then as a member of Sun's XML Tech Center he wrote an XML full-text search engine (used in JavaHelp and OpenOffice) and an XSLT to JVM optimizing compiler (now... Read More →

Friday April 24, 2015 1:30pm - 2:10pm


A Web Worth of Data: Common Crawl for NLP
The Common Crawl corpus contains petabytes of web crawl data and is a treasure trove of potential experiments. To introduce you to the possibilities that web crawl data has for NLP, we will take a detailed look at how the data has been used by various experiments and how to get started with the data yourself.

avatar for Stephen Merity

Stephen Merity

Senior Research Scientist, Salesforce Research
Stephen Merity is a senior research scientist at MetaMind, part of Salesforce Research, where he works on researching and implementing deep learning models for vision and text, with a focus on memory networks and neural attention mechanisms for computer vision and natural language... Read More →

Friday April 24, 2015 2:20pm - 3:00pm


Knowledge Maps for Content Discovery
Content discoverability and composition is a moving target in education, especially matching content complexity and media diversity to the needs of different students. In this talk I will describe our methods of performing large scale categorization of online courses using a crowd-sourced taxonomy. Our methodology is agnostic to the media of content, whether it is text, images, or video and uses Wikipedia as a taxonomy for semi-labeled categorization of content. I will also demo a visualization of Versals’ Knowledge Map, a "Google-Maps" for content exploration.


Oren Schaedel

Oren is a data scientist at Versal where he leads internal analytics, data engineering, and data-driven research projects. He received a PhD from Caltech focusing on statistical methods for understanding gene regulation of decision making in animals. He consults on integrating statistical... Read More →

Friday April 24, 2015 3:20pm - 3:40pm


Identity Resolution in the Sharing Economy
A growing sharing economy demands new, cost effective ways of establishing and checking identity, to allow services and participants to accurately assess risks and make good choices.

For example, Airbnb verifies offline identities using a scan of your driver’s license or passport. This is checked against templates designed to examine things like the layout and other government indicators of authenticity to help confirm that it appears to be valid. Crucially it involves checking an applicant’s entered name – often in Latin script – against their name on the scanned document, which may be in another script or language, and subject to potentially egregious OCR errors.

More generally, connecting the public and private traces that people, organizations and things — like vehicles — leave in various information stores is essential to delivering valuable analytics and novel services. This is often called entity analytics or identity resolution.

In this talk, we will explore enabling technology in both structured and unstructured contexts, discuss current challenges and limitations, and explore additional examples.


David Murgatroyd

Basis Technology
@dmurgaDave has been building NLP systems since 1998. As VP, Engineering at Basis Technology he gives leadership to a few dozen great engineers delivering NLP goodness to customers around the world tackling social media, watch-list screening, multilingual search, the sharing economy... Read More →

Friday April 24, 2015 4:00pm - 4:40pm
Saturday, April 25


Turning the Web into a Structured Database
For the web to truly progress, information must be able to seamlessly flow between your devices, services, and applications. A truly mainstream solution requires building a new kind of search, one that can see the entire web as structured information, rather than documents. This session highlights Diffbot’s novel approach to translating the web into a machine-readable format using a combination of NLP, computer vision, and machine learning.


Mike Tung

@miketungMike Tung is the founder and CEO of Diffbot, the leading web extraction platform, as well as, an adviser at StartX, Stanford’s startup accelerator, and the leader of Stanford's DARPA Robotics Challenge entry. In a previous life, he was a patent lawyer, a grad student in... Read More →

Saturday April 25, 2015 10:20am - 11:00am


Transforming Unstructured Offer Titles

VigLink helps publishers monetize content by affiliating existing commercial links and automatically identifying product references that can be linked to commercial sites. At VigLink, we have an ever growing catalogue of product offers (~330M) from multiple sources (including e.g. Amazon, Ebay, Shopzilla) and verticals (form Automotive to Consumer Electronics to Home & Garden).

Offers are usually unstructured text items. For many applications where similar offers should be found or offer titles need to be linked to websites, its beneficial to recognize the characteristics of individual offering instead of working on unstructured offer titles directly. In this talk I will discuss what the relevant aspects of an offer are and present an approach to automatically extract these pieces of information. I will also shortly touch upon possible applications on top of such structured offers.


Katrin Tomanek

Katrin is Sr Data Scientist at VigLink where she develops and optimizes NLP components for product entity recognition, their linking to offer databases, and the transformation of original offer titles into structured information. This includes amongst others statistical modeling... Read More →

Saturday April 25, 2015 11:10am - 11:30am


Introduction to RDF and Linked Data
For most NLP practitioners, the Web is seen as a Web of Documents where information is waiting to be extracted. But what if we had access to a Web of machine-readable Data instead? That is basically the promise behind RDF and Linked Data and it is already here.

In this talk, I will demystify those concepts and technologies, and show you the fascinating world of Linked Open Data.


Alexandre Bertails

@bertailsScala & Linked Data Dev. Former W3C.

Saturday April 25, 2015 11:40am - 12:20pm


Practical NLP Applications of Deep Learning
Deep Learning is the hot "new" technique in the world of Machine Learning, but most of the published benefits of Deep Learning has been tied to audio and visual data. There are, however, significant benefits users can draw from Deep Learning, particularly in the area of unsupervised representation learning. This talk focuses on the practical applications of these techniques, particularly neural network word embeddings. I also explore how Mattermark uses these techniques to perform many ML and NLP tasks.

avatar for Samiur Rahman

Samiur Rahman

Head of Data Engineering, Mattermark
Head of Data Engineering, Mattermark

Saturday April 25, 2015 1:20pm - 2:00pm


Deep Learning for Natural Language Processing
In this talk, I will describe deep learning algorithms that learn representations for language that are useful for solving a variety of complex language problems. I will focus on 3 tasks: Fine-Grained sentiment analysis; Question answering to win trivia competitions (like Whatson's Jeopardy system but with one neural network); Multimodal sentence-image embeddings (with a fun demo!) to find images that visualize sentences. I will also show some demos of how deepNLP can be made easy to use with MetaMind.io's software.


Richard Socher

@RichardSocherRichard Socher is the CTO and founder of MetaMind, a startup that seeks to improve artificial intelligence and make it widely accessible. He obtained his PhD from Stanford working on deep learning with Chris Manning and Andrew Ng. He is interested in developing new AI... Read More →

Saturday April 25, 2015 2:10pm - 2:50pm


Extended Swadesh List
Do differences between natural languages increase as time passes? An improved version of an extended Swadesh list of basic meanings has been developed to help answer this question by means of lexicostatistical analysis. The new instrument and the development process behind it will be of interest to researchers who study stability of meaning-word pairs and develop NLP methods for identification of cognates.


Dmitry Gusev

Purdue University
@dmitri_a_gusevDmitri A. Gusev received his Ph.D. in Computer Science from Indiana University in 1999 and worked for Eastman Kodak Company as image processing scientist in 1999-2007. Since 2013, he is an Associate Professor in Computer and Information Technology (CIT) at Purdue University... Read More →

Saturday April 25, 2015 3:00pm - 3:20pm


Incentivized Question and Answer Data
Poshly incentivizes its users to participate in online, dynamically generated surveys. The questions and answers are written by our team of subject matter experts from the cosmetic industry. This talk will cover the basics of our approach. The way our content is created, and how content is selected. Depending on time and interest we can also touch on aspects of our data pipeline which was developed in Scala.


Matthew Drescher

After getting kicked off his corner by an angry mime while busking with his saxophone in Victoria B.C., Matthew decided that it was time to change gears. Escaping music school with the mime hot on his tail, Matthew discovered mathematics and theoretical computer science and graduated... Read More →

Saturday April 25, 2015 3:30pm - 4:00pm


ML Scoring: Where Machine Learning Meets Search
Search can be viewed as a combination of a) A problem of constraint satisfaction, which is the process of finding a solution to a set of constraints (query) that impose conditions that the variables (fields) must satisfy with a resulting object (document) being a solution in the feasible region (result set), plus b) A scoring/ranking problem of assigning values to different alternatives, according to some convenient scale. This ultimately provides a mechanism to sort various alternatives in the result set in order of importance, value or preference. In particular scoring in search has evolved from being a document centric calculation (e.g. TF-IDF) proper from its information retrieval roots, to a function that is more context sensitive (e.g. include geo-distance ranking) or user centric (e.g. takes user parameters for personalization) as well as other factors that depend on the domain and task at hand. However, most system that incorporate machine learning techniques to perform classification or generate scores for these specialized tasks do so as a post retrieval re-ranking function, outside of search! In this talk I show ways of incorporating advanced scoring functions, based on supervised learning and bid scaling models, into popular search engines such as Elastic Search and SOLR. I'll provide practical examples of how to construct such "ML Scoring" plugins in search to generalize the application of a search engine as a model evaluator for supervised learning tasks. This will facilitate the building of systems that can do computational advertising, recommendations and specialized search systems, applicable to many domains.


Joaquin Delgado

Verizon OnCue
@joaquindJoaquin A. Delgado, PhD. is currently Director of Advertising and Recommendations at OnCue (acquired by Verizon).  Previous to that he held CTO positions at AdBrite, Lending Club and TripleHop Technologies (acquired by Oracle). He was also Director of Engineering and Sr... Read More →

Diana Hu

Exploring the depths through breadth in flow, perception and data

Saturday April 25, 2015 4:00pm - 4:40pm