Text By the Bay has ended
Saturday, April 25 • 10:20am - 11:00am
Semantic Indexing of Four Million Documents with Apache Spark

Sign up or log in to save this to your schedule and see who's attending!

Latent Semantic Analysis (LSA) is a technique in natural language processing and information retrieval that seeks to better understand the latent relationships and concepts in large corpuses. In this talk, we’ll walk through what it looks like to apply LSA to the full set of documents in English Wikipedia, using Apache Spark. Harnessing the Stanford CoreNLP library for lemmatization and MLlib’s scalable SVD implementation for uncovering a lower-dimensional representation of the data, we’ll undertake the modest task of enabling queries against the full extent of human knowledge, based on latent semantic relationships.


Sandy Ryza

@sandysiftingSandy is a data scientist at Cloudera focusing on Apache Spark and its ecosystem, and an author of the upcoming O'Reilly publication Advanced Analytics with Spark. He is a frequent Spark contributor as well as a member of the Apache Hadoop project management committee... Read More →

Saturday April 25, 2015 10:20am - 11:00am

Attendees (0)