Loading…
This event has ended. Create your own event → Check it out
This event has ended. Create your own
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Friday, April 24
 

8:45am

Opening Remarks
A grand welcome and introductory remarks by Alexy Khrabrov, Chief Scientist, Nitro and Principal, By The Bay.

Speakers
avatar for Alexy Khrabrov

Alexy Khrabrov

Chief Scientist, Nitro/By the Bay
Chief Scientist at Nitro, founder and organizer, SF {Scala, Text, Spark, Reactive}, {Scala, Big Data Scala, Text, Data, ...} By the Bay.


Friday April 24, 2015 8:45am - 9:00am
Lobby Galvanize

9:00am

Host Sponsor Welcome
Welcome from the CTO of Nitro, the host sponsor of the conference.

Speakers
TB

Tihomir Bajić

Nitro | @tihomirb


Friday April 24, 2015 9:00am - 9:10am
Lobby

9:10am

Keynote Address: Now is the Golden Age of Text Analysis
Now is the Golden Age of Text Analysis

Like steam power in 1780, telegraphy in 1835, or telephony in 1875, computational text analysis is entering a golden age of invention and application. Methods, infrastructure, and needs have come together to create an extraordinary range of opportunities. This talk will sketch the history, predict the future, warn of dangers, and speculate about challenges.

Speakers
ML

Mark Liberman

Mark (@mark_liberman) is a Christopher H. Browne Distinguished Professor of Linguistics and Professor of Computer Science at the University of Pennsylvania. Mark is also the Director of Linguistic Data Consortium. His research interests include corpus-based phonetics, the phonology and phonetics of lexical tone, and its relationship to intonation; gestural, prosodic, morphological and syntactic ways of marking focus, and their use in discourse... Read More →


Friday April 24, 2015 9:10am - 10:00am
Lobby

10:30am

Label Quality in the Era of Big Data
Organizations that develop and use technologies around information retrieval, machine learning, recommender systems, and natural language processing depend on labels for engineering and experimentation. These labels, usually gathered via human computation, are used in machine learned models for prediction and evaluation purposes. In such scenarios collecting high quality labels is a very important part of the overall process. We elaborate on these challenges and explore possible solutions for collecting high quality labels at large scale.

Speakers
OA

Omar Alonso

Tech lead at Microsoft. | @elunca


Friday April 24, 2015 10:30am - 11:10am
Boardroom

10:30am

NLP And Deep Learning: Working with Neural Word Embeddings
In this talk, we will cover how to work with neural nets in and text. This will encompass a comparison of the different algorithms, analogies to more traditional methods such as bag of words, followed by a basic idea of how to use 2 different architectures of neural networks for sequential and document classification.

Speakers
AG

Adam Gibson

Skymind
@agibsonccc | Adam is the founder and tech lead of Skymind, the commercial arm of the open source project deeplearning4j. He is also an orielly author of the upcoming book: Deep Learning: A Practcioner's Approach. 


Friday April 24, 2015 10:30am - 11:10am
Theater

10:30am

Learning The Semantics of Millions of Entities
What do "Software Developer", "MTS", and "Code Monkey" have in common? No, it's not the start to a bad joke. It's actually a situation where many unique entities turn out to have a small number of distinct semantics. This talk will present new techniques for mapping millions of such entities into a common semantic space with just a few thousand labels. We'll discuss how these techniques have been applied to job titles, skills, majors, and degrees to build candidate recommendation systems.

Speakers
VG

Vlad Giverts

Workday
@vladgiverts | Vlad is a Sr Director at Workday responsible for building prediction and recommendation systems for the company's cloud HR and Financial Management products. He was previously CTO at Identified, a data and analytics startup, which was acquired by Workday in early 2014. Vlad holds a Bachelor's in Computer Science from UC Berkeley. 


Friday April 24, 2015 10:30am - 11:10am
Speakeasy

11:20am

Reviving the Traditional Russian Orthography for the 21st Century
Virtually all the world-famous classic works of Russian literature were written in what is now called "old" Russian orthography, which was banned in the Soviet Russia in 1918. For this reason, the traditional Russian orthography has no support in modern operating systems and text processing software. To publish a critical edition of Tolstoy or Dostoyevsky in the original is a veritable challenge today! I describe my efforts to bring the "old" Russian orthography to the desktop. I will demonstrate a keyboard layout, a spelling dictionary, and a converter between old and modern spelling.

Speakers
avatar for Sergei Winitzki

Sergei Winitzki

Senior Software Engineer, Workday, Inc.
Functional programming, Type theory, Domain-specific languages, Machine Learning


Friday April 24, 2015 11:20am - 11:40am
Boardroom

11:20am

Discovering Knowledge in Linked Data
By building on the foundations of the Semantic Web, we can create tools that help people explore relationships between data, connect information, and discover knowledge. In this talk, we'll look at how to search Wikidata from a graph database via a domain-specific language. We'll be able to ask simple questions such as "What happened on this day in history?", and tricky questions such as "What were some of the fields of work of physicists who worked at institutions where Richard Feynman also worked?".

Speakers
JE

James Earl Douglas

Wikimedia
@jearldouglas | A functional programmer with advanced experience developing production software in Scala and Java, James is passionate about continuous learning and keeping just outside of his comfort zone.  When not knee-deep in type theory, James can be found running, cycling, and stargazing around the San Francisco Bay Area. 


Friday April 24, 2015 11:20am - 11:40am
Theater

11:20am

Increasing Honesty in Airbnb Reviews
Reviews and reputation scores are increasingly important for decision-making, especially in the case of online marketplaces. However, online reviews may not provide an accurate depiction of the characteristics of a product, either because many people do not leave reviews or because some reviewers omit salient information.

At Airbnb, we study the causes and magnitude of bias in online reviews by using large-scale field experiments that change the incentives of buyers and sellers to honestly review each other. Natural language processing has allowed us to extend our analyses and study bias in reviews by using the written feedback guests and hosts write after a trip.


Speakers
DH

Dave Holtz

AirBnB
@daveholtz | Dave Holtz is a data scientist at Airbnb focusing on online reputation, and pricing. Previously, he worked as a data science engineer at Yub (acquired by Coupons.com) and as a data scientist and Product Manager at TrialPay. He is the instructor for Udacity’s Introduction to Data Science course.  Dave holds an MA in Physics from The Johns Hopkins University, and a Bachelor’s degree in Physics and Theater from... Read More →


Friday April 24, 2015 11:20am - 11:40am
Speakeasy

11:50am

Teaching Machines to Read for Fun and Profit
Kang Sun from the R\&D Machine Learning group will speak about Bloomberg’s current projects in the area of Machine Learning and Natural Language Processing, such as sentiment analysis of financial news, market impact predictions, question answering, etc. There will be a discussion of future directions and as well as a Q\&A session.

Speakers
KS

Kang Sun

Bloomberg


Friday April 24, 2015 11:50am - 12:30pm
Boardroom

11:50am

Organizing Real Estate Photo Collections with Deep Learning
Real Estate Websites like Trulia and Zillow host millions of property listings, with each listing consisting of rich textual description and images of the property. While rich in information, the discoverability of this data is limited by its unstructured nature. For Example, How do we learn if "granite countertops" is an interesting real estate term. And if it is, how can we assign it to one of the many photos associated with the property.

In this talk we detail our approach to organize Trulia's unstructured content into rich photo collections similar to Houzz.com or Zillow Digs, without the need of any explicit user tagging.

By leveraging the recent advances in deep learning for computer vision and nap, we first automatically construct a knowledge base of relevant real estate terms and then annotate our photo collections by fusing knowledge from a deep convolutional network for image recognition and a word embedding model.

The novelty in our approach lies in our ability to scale to a large vocabulary of real estate terms without explicitly training a vision model for each one of them.


Speakers
SR

Shourabh Rawat

Trulia
@shrawat87 | Shourabh Rawat is a senior data scientist at Trulia Inc based in San Francisco. He is an applied researcher at the intersection of machine learning, deep learning, NLP and computer vision.  He received his Masters in Language Technologies from Carnegie Mellon University, Pittsburgh in 2013 where he researched on building multimodal (audio and visual) systems for detecting interesting events in Youtube videos. 


Friday April 24, 2015 11:50am - 12:30pm
Theater

11:50am

Human Curated Linguistics - technology behind Cognitive Analytics
This presentation will provide an overview of Human Curated Linguistic (HCL) technology developed and used by DataLingvo in their Cognitive Analytics platform. HCL provides an industry-first real time free-form language comprehension and guaranteed answer correctness required for cognitive analytics applications.

Details of technical implementation and development stack will be discussed.

Live demonstration will be performed to show how HCL is answering questions asked in a free-form language about business data from Google Analytics and salesforce.com data sources.


Speakers
NI

Nikita Ivanov

DataLingvo
@c64hacker | Nikita is a strategic technology advisor to DataLingvo, a cognitive analytics stealth startup that developed a breakthrough Human Curated Linguistics (HCL) technology. Nikita is also a founder and CTO for GridGain Systems, a company behind Apache Ignite, an industry leading in-memory data fabric.  Nikita is a serial entrepreneur with over 20 years of experience of starting companies and developing software including NLP, HPC and... Read More →


Friday April 24, 2015 11:50am - 12:30pm
Speakeasy

1:30pm

The Art of PDF Processing
The algorithms that power PDF understanding.

Speakers
RL

Roman Lasskly

Nitro
@rlasskiy | Roman's software lets you see most of the PDFs in the world. 


Friday April 24, 2015 1:30pm - 2:10pm
Boardroom

1:30pm

Unlocking Our Health Data: Transforming Unstructured Data at Scale
Each of us have a plethora of a health data that resides in unstructured, non-standard formats and silos. Bringing this data together can reveal powerful insights about our health, but proves to be a staggering technical challenge. Unstructured narratives contain key pieces of information that can not easily be extracted without additional processing.

We are building a system to organize this unstructured data, classify it into known topics, and apply additional levels of normalizations -- all in near real-time and at scale. This talk will cover some of the technical challenges we are facing and how we are solving them with machine learning and natural language processing techniques.


Speakers
avatar for Ola Wiberg

Ola Wiberg

Co-founder, Human API
@olawiberg As Co-founder and VP of Engineering at Human API Ola Wiberg is responsible for infrastructure development, data management, and information security. 


Friday April 24, 2015 1:30pm - 2:10pm
Theater

1:30pm

TopicStream, an Application and Architecture for Content Integration in Electronic Reading
The most popular ebook readers inherit from paper books the limiting concept of pagination. In electronic reading not only is pagination notoriously difficult for scientific/technical/medical (STM) but it locks content in one dimensionality of content consumption. We radically depart from pages and propose to split content into smaller, semantically self-contained 'tiles.' In contrast to pages, tiles can be more easily related to other tiles that can come from different books, GitHub repositories, StackOverflow discussions, Wikipedia, official documentation from the WWW, etc. Collections of documents from these other sources can be packaged as pre-tiled EPUB3 ebooks. The TopicStream app enables seamless navigation between book content & complementary documents without the need to explicitly open/close document collections. This approach adds value to commercial content in today's world where a lot of relevant information is available on-line.

Speakers
JA

Jacek Ambroziak

Ambrosoft
@JacekAmbroziak | During the 90's Jacek was a Research Scientist with a Conceptual Indexing project at Sun Microsystems Labs. Then as a member of Sun's XML Tech Center he wrote an XML full-text search engine (used in JavaHelp and OpenOffice) and an XSLT to JVM optimizing compiler (now Apache Xalan Compiled). More recently he innovates in the area of eBook search, preprocessing, and reading that attempts to better utilize the electronic medium and... Read More →


Friday April 24, 2015 1:30pm - 2:10pm
Speakeasy

2:20pm

Near-Realtime Webpage Recommendations “One at a time” Using Content Features
Today, information overload is a problem pertinent to most information systems being used on a daily basis, with the World Wide Web chief among them. One of the key goals of Stumbleupon, a web content recommendation platform, is to ease this overload, while empowering discovery of relevant information. Our subscription to the “one recommendation at a time” concept focuses in producing an experience of serendipity as users continue to surf the web, while giving us the flexibility to reactively make recommendations near-realtime. In this presentation we will present the challenges that need to be addressed to extract content features from a web page and making near-realtime recommendations using them. We will describe the main algorithmic approach as well as the general architecture motivating our choices of tools, languages and platforms.

Speakers
AV

Ashok Venkatesan

StumbleUpon
@vashok | Ashok Venkatesan is Senior Research Engineer at StumbleUpon Inc. He has extensively worked in the areas of Recommendation Systems, Topic Modeling, Text Mining and Machine Learning. Prior to his experience in the Industry, he completed a MS in Computer Science at Arizona State University. 


Friday April 24, 2015 2:20pm - 3:00pm
Boardroom

2:20pm

Unsupervised NLP Tutorial using Apache Spark
Paraphrasing Tim O'Reilly, the person who has the most data wins. That's a neat slogan, but the more data one has, the more likely it is to be unlabeled. Unfortunately, there aren't that many unsupervised learning algorithms out there, for machine learning in general and for NLP in particular. Recent advances in deep learning provide new tools for text mining of large unsupervised datasets. In particular, I will talk about the math, intuition and implementation of the word2vec algorithm, its variants (skipgram and continuous bag of words), use cases, and extensions (e.g. paragraph2vec, doc2vec). I will wrap up with a simple demonstration at scale using Scala, Apache Spark, MLLib, and the Apache Zeppelin Notebook.

Speakers
avatar for Marek Kolodziej

Marek Kolodziej

Principal Research Engineer, Nitro
Marek Kolodziej is a Principal Research Engineer at Nitro, Inc. He's been working on a diverse set of machine learning, distributed computing and big data problems for the past 6 years, and statistics and econometrics for the past 11. His current passion is deep learning and GPU computing. Marek got his PhD in Energy and Environmental Economics from Boston University.


Friday April 24, 2015 2:20pm - 3:00pm
Theater

2:20pm

A Web Worth of Data: Common Crawl for NLP
The Common Crawl corpus contains petabytes of web crawl data and is a treasure trove of potential experiments. To introduce you to the possibilities that web crawl data has for NLP, we will take a detailed look at how the data has been used by various experiments and how to get started with the data yourself.

Speakers
avatar for Stephen Merity

Stephen Merity

CommonCrawl
@smerity | Stephen Merity is responsible for crawling billions of pages a month at Common Crawl, a non-profit that provides petabytes of web data free of charge. Prior to joining Common Crawl, Stephen worked with Freelancer.com and Grok Learning in Australia. He holds a Masters of CSE from Harvard University and a Bachelors (Honours) from the University of Sydney in NLP. 


Friday April 24, 2015 2:20pm - 3:00pm
Speakeasy

3:20pm

A High Level Overview of Genomics in Personalized Medicine
Nearly twenty years ago president Clinton announced the completion of one of the largest public/private collaborative efforts in history, the first draft of the human genome. This work promised to bring forth a new era of totally personalized medicine, where the unique blueprint for your body is used to determine the most effective treatment options for you as an individual. Finally this promise is starting to be realized in the field of oncology, among others. I will give a high level overview of medical genomics with an emphasis on my area of expertise, using it to guide decision making in oncology.

Speakers
JS

John St. John

Driver Genomics
John is the Director of Bioinformatics at Driver Group, a new Startup in the cancer genomics and therapeutics space. Driver Group is currently delivering cutting edge personalized drug recommendations to cancer patients, and identifying opportunities to bring new kinds of drugs to cancer patients when we discover a need. 


Friday April 24, 2015 3:20pm - 3:40pm
Boardroom

3:20pm

Statistical Machine Translation Approach for Name Matching in Record Link
Record linkage, or entity resolution, is an important area of data mining. Name matching is a key component of systems for record linkage. Alternative spellings of the same name are a common occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name- matching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use information retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and recall. Our result can lead to a significant practical impact in entity resolution applications.

Speakers
JS

Jeffrey Sukhare

Ancestry.com
BS, MS Computer Science UC Santa Cruz, PhD candidate Computer Science UC Davis.  Senior Data Scientist at Ancestry.com working on record linkage applications. 


Friday April 24, 2015 3:20pm - 3:40pm
Theater

3:20pm

Knowledge Maps for Content Discovery
Content discoverability and composition is a moving target in education, especially matching content complexity and media diversity to the needs of different students. In this talk I will describe our methods of performing large scale categorization of online courses using a crowd-sourced taxonomy. Our methodology is agnostic to the media of content, whether it is text, images, or video and uses Wikipedia as a taxonomy for semi-labeled categorization of content. I will also demo a visualization of Versals’ Knowledge Map, a "Google-Maps" for content exploration.

Speakers
OS

Oren Schaedel

Versal
Oren is a data scientist at Versal where he leads internal analytics, data engineering, and data-driven research projects. He received a PhD from Caltech focusing on statistical methods for understanding gene regulation of decision making in animals. He consults on integrating statistical modeling and machine learning techniques into commercial products, turning hidden data into business insights.


Friday April 24, 2015 3:20pm - 3:40pm
Speakeasy

4:00pm

Scalable Genome Analysis With ADAM
Thanks to substantial improvements in the cost and throughput of DNA sequencing machines, genomic data may soon make personalized medicine a reality. However, significant processing is needed to turn raw DNA strings captured by sequencers into clinically useful data, and modern DNA processing software can take up to a week to run. In this talk, we'll look at how we reconstruct genomes from the raw sequence data, and we introduce ADAM, an Apache Spark-based API for accelerating genome processing pipelines.

Speakers
FN

Frank Nothaft

AMPLab, UC Berkeley
@fnothaft | Frank Austin Nothaft is a MS/PhD student in Computer Science at UC Berkeley. Frank's research focuses on optimizing commodity distributed systems for scientific applications, and then using these systems to explore biological phenomena. Frank works with Professor David Patterson in the AMPLab and the ASPIRE lab, and is supported by the NSF Graduate Research Fellowship. Frank has also been an IC Design engineer at Broadcom... Read More →


Friday April 24, 2015 4:00pm - 4:40pm
Boardroom

4:00pm

Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logic
Why do we want information? So that we can use it? So that our computers can use it? When we have access to rich, structured information we can make advanced applications that solve real-world pain points.

In this talk, I'll present an effective approach for automatically creating knowledge bases: databases of factual, general information. This relation extraction approach centers around the idea that we can use machine learning and natural language processing to automatically recognize information as it exists in real-world, unstructured text.

I'll cover the NLP tools, special ML considerations, and novel methods for creating a successful end-to-end relation extraction system. I will also cover experimental results with this system architecture in both big-data and a search-oriented environments.

Speakers
MG

Malcolm Greaves

Nitro
Software engineer at Nitro. Works on crafting solutions to large-scale machine learning and natural language processing problems. Practitioner of functional programming practices. Keeps up to date with latest published algorithms and ideas, innovation through hacking novel solutions, and a master bug squasher. Formally, received both my Bachelor's ('13) and Master's ('14) of Computer Science from Carnegie Mellon University. During his... Read More →


Friday April 24, 2015 4:00pm - 4:40pm
Theater

4:00pm

Identity Resolution in the Sharing Economy
A growing sharing economy demands new, cost effective ways of establishing and checking identity, to allow services and participants to accurately assess risks and make good choices.

For example, Airbnb verifies offline identities using a scan of your driver’s license or passport. This is checked against templates designed to examine things like the layout and other government indicators of authenticity to help confirm that it appears to be valid. Crucially it involves checking an applicant’s entered name – often in Latin script – against their name on the scanned document, which may be in another script or language, and subject to potentially egregious OCR errors.

More generally, connecting the public and private traces that people, organizations and things — like vehicles — leave in various information stores is essential to delivering valuable analytics and novel services. This is often called entity analytics or identity resolution.

In this talk, we will explore enabling technology in both structured and unstructured contexts, discuss current challenges and limitations, and explore additional examples.


Speakers
DM

David Murgatroyd

Basis Technology
@dmurga | Dave has been building NLP systems since 1998. As VP, Engineering at Basis Technology he gives leadership to a few dozen great engineers delivering NLP goodness to customers around the world tackling social media, watch-list screening, multilingual search, the sharing economy and more. 


Friday April 24, 2015 4:00pm - 4:40pm
Speakeasy

5:00pm

Science Panel
Q&A with Pete Skomoroch, Jeremy Howard, and Ben Pedrick about applied ML/NLP

Speakers
JH

Jeremy Howard

Jeremy Howard is an Australian data scientist and entrepreneur.  He is the CEO and Founder at Enlitic, an advanced machine learning company in San Francisco, California. Previously, Howard was the President and Chief Scientist at Kaggle, a community and competition platform of over 200,000 data scientists. Howard is the youngest faculty member at Singularity University, where he teaches data science. He is also a Young Global Leader with the... Read More →
BP

Ben Pedrick

Ben Pedrick is an engineer at Judicata, where he has been working on extracting factual data from court opinions. Lately he has been building a parser to recognize specific events in the written history of a court case. Previously he hails from Everalbum, a photo storage company, and TokBox, a video chat platform.
PS

Pete Skomoroch

Pete is a data scientist and entrepreneur focused on building intelligent systems to automate tasks and enable better decisions. Pete specializes in solving hard algorithmic problems, leading cross-functional teams, and developing engaging products powered by data and machine learning.   Most recently, he applied my skills to the consumer internet space at LinkedIn, the world's largest professional network, where he was an early member of... Read More →


Friday April 24, 2015 5:00pm - 6:30pm
Lobby
 
Saturday, April 25
 

9:00am

Updates and Keynote Address: Natural Language Processing as the Core of a Consumer Application
Natural Language Processing as the Core of a Consumer Application

NLP is often relegated to an after-the-fact, or off-to-the-side role: spam detection or gleaning business insight from user communication and comments that have already occurred. But a new generation of applications - Luka, Thumbtack, Fountain - put the understanding of natural language front and center, often as the first thing that consumers touch. We'll take a deep look at Fountain, both how it classifies plain English questions, and how it identifies which of 70,000+ human skills is necessary to solve the question. The talk will cover both language classification and relationship extraction, particularly focusing on how human expertise is interrelated.

Speakers
JS

Jean Sini

Jean runs technology and engineering at Fountain, including systems architecture, security, and distributed computing. Prior to Fountain, Jean was CTO at One Kings Lane, managing a team of 60+ engineers, product managers, and designers. Earlier, he was VP of Data Aggregation for Mint.com, Founder of data-mining company Untangly, and co-founder of Activeweave, sold to TwelveFold Media in 2008. | | Jean is also an angel investor and advisor to... Read More →


Saturday April 25, 2015 9:00am - 10:00am
Lobby

10:20am

Learning Compositionality with Scala
Logical and statistical approaches to computational semantics have usually been considered orthogonal, but recent proposals are considering a synthesis of these perspectives by developing statistical models that are able to learn compositional semantics. In this talk we will show how it is possible to implement some of these techniques in the Synthesis framework proposed by Liang and Potts (2015) with a statically typed, functional language as Scala, and we will explore the extension of the implementation with algebraic constructs using a category-theoretic perspective. In particular, we will argue that precisely the functional paradigm with static typing provide natural solutions that are of great interest to many aspects of computational semantics and pragmatics.

Speakers
avatar for Ignacio Cases

Ignacio Cases

PhD student, Stanford University
@ignaciocasesIgnacio Cases is a Ph.D. student in Computational Linguistics at Stanford, working with Professors Chris Potts and Dan Jurafsky, and member of the Stanford NLP group of the Artificial Intelligence Lab. His research interests are in Computational Linguistics, Natural Language Processing and computational models of thought using Deep Learning approaches. In particular, Cases is working in their application to problems in... Read More →


Saturday April 25, 2015 10:20am - 11:00am
Boardroom

10:20am

Semantic Indexing of Four Million Documents with Apache Spark
Latent Semantic Analysis (LSA) is a technique in natural language processing and information retrieval that seeks to better understand the latent relationships and concepts in large corpuses. In this talk, we’ll walk through what it looks like to apply LSA to the full set of documents in English Wikipedia, using Apache Spark. Harnessing the Stanford CoreNLP library for lemmatization and MLlib’s scalable SVD implementation for uncovering a lower-dimensional representation of the data, we’ll undertake the modest task of enabling queries against the full extent of human knowledge, based on latent semantic relationships.

Speakers
SR

Sandy Ryza

Cloudera
@sandysifting | Sandy is a data scientist at Cloudera focusing on Apache Spark and its ecosystem, and an author of the upcoming O'Reilly publication Advanced Analytics with Spark. He is a frequent Spark contributor as well as a member of the Apache Hadoop project management committee. He graduated Phi Beta Kappa from Brown University. 


Saturday April 25, 2015 10:20am - 11:00am
Theater

10:20am

Turning the Web into a Structured Database
For the web to truly progress, information must be able to seamlessly flow between your devices, services, and applications. A truly mainstream solution requires building a new kind of search, one that can see the entire web as structured information, rather than documents. This session highlights Diffbot’s novel approach to translating the web into a machine-readable format using a combination of NLP, computer vision, and machine learning.

Speakers
MT

Mike Tung

Diffbot
@miketung | Mike Tung is the founder and CEO of Diffbot, the leading web extraction platform, as well as, an adviser at StartX, Stanford’s startup accelerator, and the leader of Stanford's DARPA Robotics Challenge entry. In a previous life, he was a patent lawyer, a grad student in the Stanford AI lab, and a software engineer at eBay, Yahoo!, and Microsoft. Mike studied Electrical Engineering and Computer Science at UC Berkeley and... Read More →


Saturday April 25, 2015 10:20am - 11:00am
Speakeasy

11:10am

Transforming an Algorithm for Online Recommendations into a Multi-lingual Syntax Parser
You need solid syntax parsing to really understand the nuance of language. Complicated negation patterns, relationships between entities, entity sentiment assignment (and many other things) are all examples for which having sophisticated syntax understanding is important. The question then is how to get an understanding of syntax across many languages, content types, and contexts. Most traditional model-based approaches require manually coded syntax trees, which are costly to generate, as they require relatively expensive linguist time. These trees exist for some languages, and some content types; but not for, say, German Tweets, or Swedish biotech. It turns out that the problem can be stated as a “similarity” problem, which then looks like a recommendation problem. This presentation will discuss how we leveraged a matrix factorization recommendation algorithm to create a highly efficient, easily extensible syntax parser.

Speakers
SR

Seth Redmore

Lexalytics
@sredmore | Seth Redmore has over 20 years of experience in product marketing and over 10 years of experience in text analytics - from the perspective of a user as well as a vendor. Seth has worked in a number of executive roles at both hardware and software companies, including co-founding Netiverse (who built a high-speed server load balancing system) which was bought by Cisco Systems in 2000. While at Cisco, he worked in a variety of product... Read More →


Saturday April 25, 2015 11:10am - 11:30am
Boardroom

11:10am

Large Scale Topic Assignment on Multiple Social Networks
Interests and expertise is a challenging problem with applications in various data-powered products. In this talk, we present a full production system used at Lithium Technologies (Klout), which mines topical interests and expertise from multiple social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis.

The system generates a diverse set of features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using cross-network information with a diverse features for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network or any single source.


Speakers
AR

Adithya Rao

Lithium
@AdithyaRao | Adithya is the Lead Research Engineer in the Data Science team at Lithium technologies. Some of the projects he has worked on includes the Klout score, topic extraction, user targeting and social content discovery. Before his current role, he graduated with a Masters degree from Stanford University, specializing in Machine Learning. 
NS

Nemanja Spasojevic

Lithium
@sofronije | Nemanja is the Director of Data Science at Lithium Technologies, and leads all the data science and engineering efforts around building scalable systems for topical text mining, user scoring and content relevance. Prior to his current role, he was previously at Google for 6 years, working on projects such as Google Books. 


Saturday April 25, 2015 11:10am - 11:30am
Theater

11:10am

Transforming Unstructured Offer Titles

VigLink helps publishers monetize content by affiliating existing commercial links and automatically identifying product references that can be linked to commercial sites. At VigLink, we have an ever growing catalogue of product offers (~330M) from multiple sources (including e.g. Amazon, Ebay, Shopzilla) and verticals (form Automotive to Consumer Electronics to Home & Garden).

Offers are usually unstructured text items. For many applications where similar offers should be found or offer titles need to be linked to websites, its beneficial to recognize the characteristics of individual offering instead of working on unstructured offer titles directly. In this talk I will discuss what the relevant aspects of an offer are and present an approach to automatically extract these pieces of information. I will also shortly touch upon possible applications on top of such structured offers.



Speakers
KT

Katrin Tomanek

VigLink
Katrin is Sr Data Scientist at VigLink where she develops and optimizes NLP components for product entity recognition, their linking to offer databases, and the transformation of original offer titles into structured information. This includes amongst others statistical modeling, qualitative and quantitative evaluation of ML and NLP components, and A/B testing to determine the impact of improved components on overall business metrics.


Saturday April 25, 2015 11:10am - 11:30am
Speakeasy

11:40am

Measuring Well-Being Using Social Media
Social media such as Twitter and Facebook provide a rich, if imperfect portal onto people's lives. We analyze tens of millions of Facebook posts and billions of tweets to study variation in language use with age, gender, personality, and mental and physical well-being. Word clouds visually illustrate the big five personality traits (e.g., "What is it like to be neurotic?"), while correlations between language use and county level health data suggest connections between health and happiness, including potential psychological causes of heart disease. Similar analyses are useful in many fields.

Speakers
LU

Lyle Ungar

University of Pennsylvania
Dr. Lyle Ungar is a Professor of Computer and Information Science at the University of Pennsylvania. He received a B.S. from Stanford University and a Ph.D. from MIT. Dr. Ungar directed Penn's Executive Masters of Technology Management (EMTM) Program for a decade, and served as Associate Director of the Penn Center for BioInformatics (PCBI). He has published over 200 articles and holds eleven patents. His current research focuses on statistical... Read More →


Saturday April 25, 2015 11:40am - 12:20pm
Boardroom

11:40am

Classifying Text without (many) Labels
Supervised text classification is often hampered by the need to acquire relatively expensive labeled training sets. In some embodiments of the systems and methods disclosed herein, pre-existing Word2Vec or similar algorithms are leveraged to create vector representations of documents that enable a model to be successfully trained with a drastically reduced training set. By using this technique the implementer can now devote low investment to acquiring a small volume of labeled data examples in order to train proximity thresholds, without devoting significant resources using traditional text classification machine learning algorithms which typically require training volume examples that are orders of magnitude larger.

Speakers
MT

Mike Tamir

Galvanize
@MikeTamir | With over a decade of teaching experience at the University level, Mike serves as CSO/Head of Education and Data Science Programing for Galvanize.  Mike also helped to found the GalvanizeU Masters program focused on developing the skills required of high performing Data Scientists in the industry.  He has led several teams of Data Scientists in the bay area as Chief Data Scientist for InterTrust and as Director of Data... Read More →


Saturday April 25, 2015 11:40am - 12:20pm
Theater

11:40am

Introduction to RDF and Linked Data
For most NLP practitioners, the Web is seen as a Web of Documents where information is waiting to be extracted. But what if we had access to a Web of machine-readable Data instead? That is basically the promise behind RDF and Linked Data and it is already here.

In this talk, I will demystify those concepts and technologies, and show you the fascinating world of Linked Open Data.


Speakers
AB

Alexandre Bertails

@bertails | Scala & Linked Data Dev. Former W3C.


Saturday April 25, 2015 11:40am - 12:20pm
Speakeasy

1:20pm

Building the world’s Largest Database of Car Features from PDFs
We will discuss a new system that supports editors creating a database of the features and options available across car models, creating structured data by semi-automated information extraction from lengthy PDF documents.

Edmunds.com is an industry-leading website for car shoppers. To effectively support the car-purchasing process, Edmunds needs to understand the features and options available on the myriad different models offered by manufacturers each year. This critical structured database supports faceted search of models, searching available inventory, and other strategic uses.

This end-to-end capability supports robust processing of unstructured data to identify properties like “air conditioning” and “climate control,” and understand that they are the same underlying feature. For Edmunds, this meant an ~85% reduction in the time it now takes them to get information about a new car model online, from 2 weeks to just 1-2 days. We will also discuss how the NLP models can be re-used across other data, mapping Edmunds’ detailed ontology to a variety of unstructured data sources.


Speakers
JA

John Akred

Silicon Valley Data Science
@BigDataAnalysis | John Akred, CTO SVDS - With over 15 years in advanced analytical applications and architecture, John is dedicated to helping organizations become more data-driven. He combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership. 
RM

Rob Munro

Idibon
@WWRob | Rob Munro, CEO Idibon - Rob serves as the Chief Executive Officer of Idibon, Inc. He is a World Leader in applying big data analytics to human communications, having worked in many diverse environments, from Sierra Leone, Haiti and the Amazon to London, Sydney and San Francisco. 


Saturday April 25, 2015 1:20pm - 2:00pm
Boardroom

1:20pm

Learning From the Diner's Experience
I will talk about how we are using data science to help transform OpenTable into a local dining expert who knows you very well, and can help you and others find the best dining experience wherever we travel! This entails a whole slew of tools from natural language processing, recommendation system engineering, sentiment analysis that have to work in synch to make that magical experience happen. One of our main sources of insight are the reviews left by diners on our website. In this talk, I will stress on what we are learning from our rich set of diner reviews, especially using topic modeling as a core tool. I will touch upon various possible applications of this technique that we are currently exploring in both restaurateur facing and diner facing features.

Speakers
avatar for Sudeep Das

Sudeep Das

Data Scientist, OpenTable
@datamusing | Sudeep Das is a Data Scientist at OpenTable, where his main focus is on mining reviews and restaurant data to extract actionable insights and enable a personalized dining experience. He has broad experience with NLP methods, especially topic modeling and its applications. Before moving into the Data Science space, Sudeep was an Astrophysicist (Princeton PhD, UC Berkeley postdoc) researching the properties of the early universe... Read More →


Saturday April 25, 2015 1:20pm - 2:00pm
Theater

1:20pm

Practical NLP Applications of Deep Learning
Deep Learning is the hot "new" technique in the world of Machine Learning, but most of the published benefits of Deep Learning has been tied to audio and visual data. There are, however, significant benefits users can draw from Deep Learning, particularly in the area of unsupervised representation learning. This talk focuses on the practical applications of these techniques, particularly neural network word embeddings. I also explore how Mattermark uses these techniques to perform many ML and NLP tasks.

Speakers
avatar for Samiur Rahman

Samiur Rahman

Head of Data Engineering, Mattermark
Head of Data Engineering, Mattermark


Saturday April 25, 2015 1:20pm - 2:00pm
Speakeasy

2:10pm

The Ingenuity Biomedical Knowledge-Base: Advantages of Modeling Knowledge in an Ontology

Today’s enormous corpus of biomedical knowledge presents amazing opportunities to improve human health.  However, the knowledge’s fragmentation across the literature and numerous databases poses serious challenges to those opportunities. The Ingenuity Biomedical Knowledgebase (KB) addresses those challenges, providing a framework to model biomedical knowledge in a unified system – implemented as a frame-based ontology. That structure facilitates powerful inference and quality-control features.

Using the Ingenuity KB, Ingenuity Systems (now a part of QIAGEN) provides software solutions to interpret biological datasets. By aligning those datasets (e.g., raw research observations or clinical genomic-testing data) to the KB, it can be viewed, analyzed, and interpreted in the context of relevant biological and biomedical knowledge. I’ll discuss the Ingenuity ontology structure, building process, maintenance regime, and several use-cases. 


Speakers
avatar for Jeff Lerman

Jeff Lerman

Staff Ontology Engineer, QIAGEN Silicon Valley (formerly Ingenuity Systems)
Biomedical ontology developer for eight years, focusing on knowledge models for diseases, gene products, and genetic variants. NLP projects include development of word-sense disambiguation approaches to identify genes discussed in biomedical publications, and ontology-leveraged scientific article classification by topic. Before QIAGEN, Jeff studied molecular biology at Princeton (Ph.D.), and did a postdoc in protein structure/function at U.C... Read More →


Saturday April 25, 2015 2:10pm - 2:50pm
Boardroom

2:10pm

Identifying Events with Tweets
Gabor is a Staff Data Scientist at Twitter, and works on describing and predicting user behavior and modeling large-scale content dynamics on Twitter. Before that he worked on predicting content popularity in crowdsourced ecologies, on the network analysis of online services, and did research on large-scale social and biological systems. Before, he worked at HP Labs, Harvard Medical School, and the University of Notre Dame.

Speakers
GS

Gabor Szabo

Twitter
@gaborjszabo | Gabor is a Staff Data Scientist at Twitter, and works on describing and predicting user behavior and modeling large-scale content dynamics on Twitter. Before that he worked on predicting content popularity in crowdsourced ecologies, on the network analysis of online services, and did research on large-scale social and biological systems. Before, he worked at HP Labs, Harvard Medical School, and the University of Notre Dame. 


Saturday April 25, 2015 2:10pm - 2:50pm
Theater

2:10pm

Deep Learning for Natural Language Processing
In this talk, I will describe deep learning algorithms that learn representations for language that are useful for solving a variety of complex language problems. I will focus on 3 tasks: Fine-Grained sentiment analysis; Question answering to win trivia competitions (like Whatson's Jeopardy system but with one neural network); Multimodal sentence-image embeddings (with a fun demo!) to find images that visualize sentences. I will also show some demos of how deepNLP can be made easy to use with MetaMind.io's software.

Speakers
RS

Richard Socher

MetaMind
@RichardSocher | Richard Socher is the CTO and founder of MetaMind, a startup that seeks to improve artificial intelligence and make it widely accessible. He obtained his PhD from Stanford working on deep learning with Chris Manning and Andrew Ng. He is interested in developing new AI models that perform well across multiple different tasks in natural language processing and computer vision.  He was awarded the 2011 Yahoo! Key Scientific... Read More →


Saturday April 25, 2015 2:10pm - 2:50pm
Speakeasy

3:00pm

Using Big Data to Identify the World's Top Experts
In this talk, we report on our implementation of a big data system that is able to automatically identify and rank experts in a large number of categories by ingesting and analyzing millions of pieces of content published across the Web every day.

We adopt a principled approach to defining who an expert is. An expert is someone who (a) writes consistently about a small set of tightly related topics; if you are an expert in everything, you are an expert in nothing, and (b) who has a loyal following that engages with her contents consistently and finds them useful, and © who actually expresses opinions on the topics he writes about rather than merely breaking the news.

Formulating the above criteria, and implementing it at scale, is a daunting big data task. Firstly we needed to form a rather comprehensive picture of the body of works published by authors that often write on many different outlets and at times under different aliases. Secondly, we had to create a dynamic topical model that learns the relationship between tens of thousands of topics by analyzing millions of documents. Thirdly, we had to come up with a formula that results in a stable, consistent, ranking, that is robust to fluctuations in publishing patterns and engagement data, yet is adaptable to allow in for new experts and their voices to be heard.

  • Experts vs. Influencers: defining who an expert is
  • Unifying identities of authors across sites
  • A dynamic topical model that scales
  • Projection of topics onto authors
  • Opinion vs. Sentiment vs. Statement of Facts
  • Putting it all together
  • A note on architecture

Speakers
NS

Nima Sarshar

InPowered
@nimilinimo | As inPowered’s CTO, Nima leads the development of the core technologies at the heart of inPowered, and oversees all aspects of engineering and product development. Once a tenured professor with 50+ peer-reviewed publications, Nima left academia to found Haileo Inc., a startup specializing in visual search. He was a Principal Data Scientist at Intuit, before joining inPowered. Nima hold’s an M.Sc. from UCLA and a Ph.D... Read More →


Saturday April 25, 2015 3:00pm - 3:20pm
Boardroom

3:00pm

How Terminal makes Machine Learning Fast and Fun
The talk will be to explain how Terminal works and how people are using it in Machine Learning applications or Big Data analysis (like, out of the box multi-tenant Spark clusters).

Speakers
VG

Varun Ganapathi

Terminal.com
@varungp | Varun Ganapathi was a PhD candidate at Stanford when he co-founded EyeApps and Numovis. EyeApps created a paid photography app, Pro HDR, for iPhone and Android which has sold more than a million copies. After Numovis was acquired by Google, Varun spent two years as a Senior Research Scientist at Google before founding Terminal.com 


Saturday April 25, 2015 3:00pm - 3:20pm
Theater

3:00pm

Extended Swadesh List
Do differences between natural languages increase as time passes? An improved version of an extended Swadesh list of basic meanings has been developed to help answer this question by means of lexicostatistical analysis. The new instrument and the development process behind it will be of interest to researchers who study stability of meaning-word pairs and develop NLP methods for identification of cognates.

Speakers
DG

Dmitry Gusev

Purdue University
@dmitri_a_gusev | Dmitri A. Gusev received his Ph.D. in Computer Science from Indiana University in 1999 and worked for Eastman Kodak Company as image processing scientist in 1999-2007. Since 2013, he is an Associate Professor in Computer and Information Technology (CIT) at Purdue University College of Technology Columbus. His primary research interests are imaging, graphics, game development, visualization, and computational linguistics. 


Saturday April 25, 2015 3:00pm - 3:20pm
Speakeasy

3:30pm

Topic-Based Sentiment Analysis in Customer Feedback
Much of customer support at LinkedIn is done via some form of online communication such as online feedback forms or email between members and support agents. Topic-based sentiment analysis of member feedback is critical since a single piece of feedback may address several different topics with different sentiment expressed in each. This talk addresses the topic-based sentiment analysis of customer support feedback focusing on the following questions 1) how do we find the most relevant topics of a product in question 2) how do we ensure to attribute sentiment to these specific topics as opposed to the feedback as a whole 3) how do we leverage natural language processing tools such as key phrase extraction and synonym identification to make the obtained topic-sentiment information best suitable for human consumption. The model proposed here is extendable to mining sentiment in reviews or any other sentiment-bearing text.

Speakers

Saturday April 25, 2015 3:30pm - 4:00pm
Boardroom

3:30pm

Identifying CrunchBase Entities in News Articles
We will discuss doing record linkage to entities identified in news articles scraped from the web. Further, we will discuss the challenges of working with user-edited entities, that are constantly changing.

Speakers
avatar for Gershon Bialer

Gershon Bialer

CrunchBase


Saturday April 25, 2015 3:30pm - 4:00pm
Theater

3:30pm

Incentivized Question and Answer Data
Poshly incentivizes its users to participate in online, dynamically generated surveys. The questions and answers are written by our team of subject matter experts from the cosmetic industry. This talk will cover the basics of our approach. The way our content is created, and how content is selected. Depending on time and interest we can also touch on aspects of our data pipeline which was developed in Scala.

Speakers
MD

Matthew Drescher

Poshly
After getting kicked off his corner by an angry mime while busking with his saxophone in Victoria B.C., Matthew decided that it was time to change gears. Escaping music school with the mime hot on his tail, Matthew discovered mathematics and theoretical computer science and graduated with an MSc from McGill University's discrete mathematics group and wrote the paper "An Approximation Algorithm for the Maximum Leaf Spanning... Read More →


Saturday April 25, 2015 3:30pm - 4:00pm
Speakeasy

4:00pm

Scalable Online Learning of Topic Models with Spark
This talk deals with the problem of how to learn topic models from large text corpora that are constantly growing such as with online forums. As documents stream into your corpus it is much more efficient to update your already learned topic model rather than batch processing your entire corpus. Furthermore, Apache Spark can be used to perform the sequential updates in a distributed fashion. The talk will also include a discussion on how to use your learned topic model to classify the documents in your corpus based on the topics they contain.

Speakers
AM

Alex Minnaar

Vertical Scope
Software engineer at VerticalScope Inc.  Previously MSc in Machine Learning at University College London, BSc in Math & Engineering from Queen's University. 


Saturday April 25, 2015 4:00pm - 4:40pm
Boardroom

4:00pm

A Word is Worth a Thousand Vectors
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll speak about word2vec, related techniques, and try to convince you that word vectors give us a simple and flexible platform for understanding text.
 

Speakers
CE

Christopher Erick Moody

StitchFix
@chrisemoody | Chris loves high performance computing, high dimensions & high fashion. He loves learning the beautiful symmetries between physics, data, and analytics. Hails from Spain & South Carolina. Went to Caltech, did astrostats & supercomputing, and now Data Labs at Stitch Fix. Currently enjoying coding up word2vec, Gaussian Processes, and t-SNE. 


Saturday April 25, 2015 4:00pm - 4:40pm
Theater

4:00pm

ML Scoring: Where Machine Learning Meets Search
Search can be viewed as a combination of a) A problem of constraint satisfaction, which is the process of finding a solution to a set of constraints (query) that impose conditions that the variables (fields) must satisfy with a resulting object (document) being a solution in the feasible region (result set), plus b) A scoring/ranking problem of assigning values to different alternatives, according to some convenient scale. This ultimately provides a mechanism to sort various alternatives in the result set in order of importance, value or preference. In particular scoring in search has evolved from being a document centric calculation (e.g. TF-IDF) proper from its information retrieval roots, to a function that is more context sensitive (e.g. include geo-distance ranking) or user centric (e.g. takes user parameters for personalization) as well as other factors that depend on the domain and task at hand. However, most system that incorporate machine learning techniques to perform classification or generate scores for these specialized tasks do so as a post retrieval re-ranking function, outside of search! In this talk I show ways of incorporating advanced scoring functions, based on supervised learning and bid scaling models, into popular search engines such as Elastic Search and SOLR. I'll provide practical examples of how to construct such "ML Scoring" plugins in search to generalize the application of a search engine as a model evaluator for supervised learning tasks. This will facilitate the building of systems that can do computational advertising, recommendations and specialized search systems, applicable to many domains.

Speakers
JD

Joaquin Delgado

Verizon OnCue
@joaquind | Joaquin A. Delgado, PhD. is currently Director of Advertising and Recommendations at OnCue (acquired by Verizon).  Previous to that he held CTO positions at AdBrite, Lending Club and TripleHop Technologies (acquired by Oracle). He was also Director of Engineering and Sr. Architect Principal at Yahoo! His expertise lies on distributed systems, advertising technology,  machine learning, recommender systems and search. He holds... Read More →
DH

Diana Hu

Exploring the depths through breadth in flow, perception and data


Saturday April 25, 2015 4:00pm - 4:40pm
Speakeasy

5:00pm

Business of Text
CEO Panel with SriSatish Ambati, Nikita Ivanov, and Oleg Rogynskyy

Speakers
SA

SriSatish Ambati

CEO, 0xdata
NI

Nikita Ivanov

DataLingvo
@c64hacker | Nikita is a strategic technology advisor to DataLingvo, a cognitive analytics stealth startup that developed a breakthrough Human Curated Linguistics (HCL) technology. Nikita is also a founder and CTO for GridGain Systems, a company behind Apache Ignite, an industry leading in-memory data fabric.  Nikita is a serial entrepreneur with over 20 years of experience of starting companies and developing software including NLP, HPC and... Read More →
KL

Katelyn Lyster

Katelyn Lyster is co-founder of ReOrient Media, which includes Infinite Canvas Suite - applications for the iPad that support non-linear transmedia. She leads the business development, evangelism, and marketing initiatives. Katelyn’s experience spans international education, human resources, mobile technology, and environmental activism.
OR

Oleg Rogynskyy

CEO, Semantria (a Lexalytics company)


Saturday April 25, 2015 5:00pm - 6:30pm
Lobby