Text By the Bay has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track A [clear filter]
Friday, April 24


Label Quality in the Era of Big Data
Organizations that develop and use technologies around information retrieval, machine learning, recommender systems, and natural language processing depend on labels for engineering and experimentation. These labels, usually gathered via human computation, are used in machine learned models for prediction and evaluation purposes. In such scenarios collecting high quality labels is a very important part of the overall process. We elaborate on these challenges and explore possible solutions for collecting high quality labels at large scale.


Omar Alonso

Tech lead at Microsoft.@elunca

Friday April 24, 2015 10:30am - 11:10am


Reviving the Traditional Russian Orthography for the 21st Century
Virtually all the world-famous classic works of Russian literature were written in what is now called "old" Russian orthography, which was banned in the Soviet Russia in 1918. For this reason, the traditional Russian orthography has no support in modern operating systems and text processing software. To publish a critical edition of Tolstoy or Dostoyevsky in the original is a veritable challenge today! I describe my efforts to bring the "old" Russian orthography to the desktop. I will demonstrate a keyboard layout, a spelling dictionary, and a converter between old and modern spelling.

avatar for Sergei Winitzki

Sergei Winitzki

Senior Software Engineer, Workday Inc.
Theoretical physicist turned software engineer, passionate for functional programming, functional type theory, and declarative domain-specific languages.

Friday April 24, 2015 11:20am - 11:40am


Teaching Machines to Read for Fun and Profit
Kang Sun from the R\&D Machine Learning group will speak about Bloomberg’s current projects in the area of Machine Learning and Natural Language Processing, such as sentiment analysis of financial news, market impact predictions, question answering, etc. There will be a discussion of future directions and as well as a Q\&A session.


Kang Sun


Friday April 24, 2015 11:50am - 12:30pm


The Art of PDF Processing
The algorithms that power PDF understanding.


Roman Lasskly

@rlasskiyRoman's software lets you see most of the PDFs in the world. 

Friday April 24, 2015 1:30pm - 2:10pm


Near-Realtime Webpage Recommendations “One at a time” Using Content Features
Today, information overload is a problem pertinent to most information systems being used on a daily basis, with the World Wide Web chief among them. One of the key goals of Stumbleupon, a web content recommendation platform, is to ease this overload, while empowering discovery of relevant information. Our subscription to the “one recommendation at a time” concept focuses in producing an experience of serendipity as users continue to surf the web, while giving us the flexibility to reactively make recommendations near-realtime. In this presentation we will present the challenges that need to be addressed to extract content features from a web page and making near-realtime recommendations using them. We will describe the main algorithmic approach as well as the general architecture motivating our choices of tools, languages and platforms.


Ashok Venkatesan

@vashokAshok Venkatesan is Senior Research Engineer at StumbleUpon Inc. He has extensively worked in the areas of Recommendation Systems, Topic Modeling, Text Mining and Machine Learning. Prior to his experience in the Industry, he completed a MS in Computer Science at Arizona State... Read More →

Friday April 24, 2015 2:20pm - 3:00pm


A High Level Overview of Genomics in Personalized Medicine
Nearly twenty years ago president Clinton announced the completion of one of the largest public/private collaborative efforts in history, the first draft of the human genome. This work promised to bring forth a new era of totally personalized medicine, where the unique blueprint for your body is used to determine the most effective treatment options for you as an individual. Finally this promise is starting to be realized in the field of oncology, among others. I will give a high level overview of medical genomics with an emphasis on my area of expertise, using it to guide decision making in oncology.


John St. John

Driver Genomics
John is the Director of Bioinformatics at Driver Group, a new Startup in the cancer genomics and therapeutics space. Driver Group is currently delivering cutting edge personalized drug recommendations to cancer patients, and identifying opportunities to bring new kinds of drugs to... Read More →

Friday April 24, 2015 3:20pm - 3:40pm


Scalable Genome Analysis With ADAM
Thanks to substantial improvements in the cost and throughput of DNA sequencing machines, genomic data may soon make personalized medicine a reality. However, significant processing is needed to turn raw DNA strings captured by sequencers into clinically useful data, and modern DNA processing software can take up to a week to run. In this talk, we'll look at how we reconstruct genomes from the raw sequence data, and we introduce ADAM, an Apache Spark-based API for accelerating genome processing pipelines.


Frank Nothaft

AMPLab, UC Berkeley
@fnothaftFrank Austin Nothaft is a MS/PhD student in Computer Science at UC Berkeley. Frank's research focuses on optimizing commodity distributed systems for scientific applications, and then using these systems to explore biological phenomena. Frank works with Professor David Patterson... Read More →

Friday April 24, 2015 4:00pm - 4:40pm
Saturday, April 25


Learning Compositionality with Scala
Logical and statistical approaches to computational semantics have usually been considered orthogonal, but recent proposals are considering a synthesis of these perspectives by developing statistical models that are able to learn compositional semantics. In this talk we will show how it is possible to implement some of these techniques in the Synthesis framework proposed by Liang and Potts (2015) with a statically typed, functional language as Scala, and we will explore the extension of the implementation with algebraic constructs using a category-theoretic perspective. In particular, we will argue that precisely the functional paradigm with static typing provide natural solutions that are of great interest to many aspects of computational semantics and pragmatics.

avatar for Ignacio Cases

Ignacio Cases

PhD student, Stanford University
@ignaciocasesIgnacio Cases is a Ph.D. student in Computational Linguistics at Stanford, working with Professors Chris Potts and Dan Jurafsky, and member of the Stanford NLP group of the Artificial Intelligence Lab. His research interests are in Computational Linguistics, Natural Language... Read More →

Saturday April 25, 2015 10:20am - 11:00am


Transforming an Algorithm for Online Recommendations into a Multi-lingual Syntax Parser
You need solid syntax parsing to really understand the nuance of language. Complicated negation patterns, relationships between entities, entity sentiment assignment (and many other things) are all examples for which having sophisticated syntax understanding is important. The question then is how to get an understanding of syntax across many languages, content types, and contexts. Most traditional model-based approaches require manually coded syntax trees, which are costly to generate, as they require relatively expensive linguist time. These trees exist for some languages, and some content types; but not for, say, German Tweets, or Swedish biotech. It turns out that the problem can be stated as a “similarity” problem, which then looks like a recommendation problem. This presentation will discuss how we leveraged a matrix factorization recommendation algorithm to create a highly efficient, easily extensible syntax parser.


Seth Redmore

@sredmoreSeth Redmore has over 20 years of experience in product marketing and over 10 years of experience in text analytics - from the perspective of a user as well as a vendor. Seth has worked in a number of executive roles at both hardware and software companies, including co-founding... Read More →

Saturday April 25, 2015 11:10am - 11:30am


Measuring Well-Being Using Social Media
Social media such as Twitter and Facebook provide a rich, if imperfect portal onto people's lives. We analyze tens of millions of Facebook posts and billions of tweets to study variation in language use with age, gender, personality, and mental and physical well-being. Word clouds visually illustrate the big five personality traits (e.g., "What is it like to be neurotic?"), while correlations between language use and county level health data suggest connections between health and happiness, including potential psychological causes of heart disease. Similar analyses are useful in many fields.


Lyle Ungar

University of Pennsylvania
Dr. Lyle Ungar is a Professor of Computer and Information Science at the University of Pennsylvania. He received a B.S. from Stanford University and a Ph.D. from MIT. Dr. Ungar directed Penn's Executive Masters of Technology Management (EMTM) Program for a decade, and served as Associate... Read More →

Saturday April 25, 2015 11:40am - 12:20pm


Building the world’s Largest Database of Car Features from PDFs
We will discuss a new system that supports editors creating a database of the features and options available across car models, creating structured data by semi-automated information extraction from lengthy PDF documents.

Edmunds.com is an industry-leading website for car shoppers. To effectively support the car-purchasing process, Edmunds needs to understand the features and options available on the myriad different models offered by manufacturers each year. This critical structured database supports faceted search of models, searching available inventory, and other strategic uses.

This end-to-end capability supports robust processing of unstructured data to identify properties like “air conditioning” and “climate control,” and understand that they are the same underlying feature. For Edmunds, this meant an ~85% reduction in the time it now takes them to get information about a new car model online, from 2 weeks to just 1-2 days. We will also discuss how the NLP models can be re-used across other data, mapping Edmunds’ detailed ontology to a variety of unstructured data sources.


John Akred

Silicon Valley Data Science
@BigDataAnalysisJohn Akred, CTO SVDS - With over 15 years in advanced analytical applications and architecture, John is dedicated to helping organizations become more data-driven. He combines deep expertise in analytics and data science with business acumen and dynamic engineering... Read More →

Rob Munro

@WWRobRob Munro, CEO Idibon - Rob serves as the Chief Executive Officer of Idibon, Inc. He is a World Leader in applying big data analytics to human communications, having worked in many diverse environments, from Sierra Leone, Haiti and the Amazon to London, Sydney and San Franc... Read More →

Saturday April 25, 2015 1:20pm - 2:00pm


The Ingenuity Biomedical Knowledge-Base: Advantages of Modeling Knowledge in an Ontology

Today’s enormous corpus of biomedical knowledge presents amazing opportunities to improve human health.  However, the knowledge’s fragmentation across the literature and numerous databases poses serious challenges to those opportunities. The Ingenuity Biomedical Knowledgebase (KB) addresses those challenges, providing a framework to model biomedical knowledge in a unified system – implemented as a frame-based ontology. That structure facilitates powerful inference and quality-control features.

Using the Ingenuity KB, Ingenuity Systems (now a part of QIAGEN) provides software solutions to interpret biological datasets. By aligning those datasets (e.g., raw research observations or clinical genomic-testing data) to the KB, it can be viewed, analyzed, and interpreted in the context of relevant biological and biomedical knowledge. I’ll discuss the Ingenuity ontology structure, building process, maintenance regime, and several use-cases. 

avatar for Jeff Lerman

Jeff Lerman

Staff Ontology Engineer, QIAGEN Silicon Valley (formerly Ingenuity Systems)
Biomedical ontology developer for eight years, focusing on knowledge models for diseases, gene products, and genetic variants. NLP projects include development of word-sense disambiguation approaches to identify genes discussed in biomedical publications, and ontology-leveraged scientific... Read More →

Saturday April 25, 2015 2:10pm - 2:50pm


Using Big Data to Identify the World's Top Experts
In this talk, we report on our implementation of a big data system that is able to automatically identify and rank experts in a large number of categories by ingesting and analyzing millions of pieces of content published across the Web every day.

We adopt a principled approach to defining who an expert is. An expert is someone who (a) writes consistently about a small set of tightly related topics; if you are an expert in everything, you are an expert in nothing, and (b) who has a loyal following that engages with her contents consistently and finds them useful, and © who actually expresses opinions on the topics he writes about rather than merely breaking the news.

Formulating the above criteria, and implementing it at scale, is a daunting big data task. Firstly we needed to form a rather comprehensive picture of the body of works published by authors that often write on many different outlets and at times under different aliases. Secondly, we had to create a dynamic topical model that learns the relationship between tens of thousands of topics by analyzing millions of documents. Thirdly, we had to come up with a formula that results in a stable, consistent, ranking, that is robust to fluctuations in publishing patterns and engagement data, yet is adaptable to allow in for new experts and their voices to be heard.

  • Experts vs. Influencers: defining who an expert is
  • Unifying identities of authors across sites
  • A dynamic topical model that scales
  • Projection of topics onto authors
  • Opinion vs. Sentiment vs. Statement of Facts
  • Putting it all together
  • A note on architecture


Nima Sarshar

@nimilinimoAs inPowered’s CTO, Nima leads the development of the core technologies at the heart of inPowered, and oversees all aspects of engineering and product development. Once a tenured professor with 50+ peer-reviewed publications, Nima left academia to found Haileo Inc., a... Read More →

Saturday April 25, 2015 3:00pm - 3:20pm


Topic-Based Sentiment Analysis in Customer Feedback
Much of customer support at LinkedIn is done via some form of online communication such as online feedback forms or email between members and support agents. Topic-based sentiment analysis of member feedback is critical since a single piece of feedback may address several different topics with different sentiment expressed in each. This talk addresses the topic-based sentiment analysis of customer support feedback focusing on the following questions 1) how do we find the most relevant topics of a product in question 2) how do we ensure to attribute sentiment to these specific topics as opposed to the feedback as a whole 3) how do we leverage natural language processing tools such as key phrase extraction and synonym identification to make the obtained topic-sentiment information best suitable for human consumption. The model proposed here is extendable to mining sentiment in reviews or any other sentiment-bearing text.


Saturday April 25, 2015 3:30pm - 4:00pm


Scalable Online Learning of Topic Models with Spark
This talk deals with the problem of how to learn topic models from large text corpora that are constantly growing such as with online forums. As documents stream into your corpus it is much more efficient to update your already learned topic model rather than batch processing your entire corpus. Furthermore, Apache Spark can be used to perform the sequential updates in a distributed fashion. The talk will also include a discussion on how to use your learned topic model to classify the documents in your corpus based on the topics they contain.


Alex Minnaar

Vertical Scope
Software engineer at VerticalScope Inc.  Previously MSc in Machine Learning at University College London, BSc in Math & Engineering from Queen's University. 

Saturday April 25, 2015 4:00pm - 4:40pm