Loading…
This event has ended. Create your own event → Check it out
This event has ended. Create your own
View analytic
Saturday, April 25 • 10:20am - 11:00am
Semantic Indexing of Four Million Documents with Apache Spark

Sign up or log in to save this to your schedule and see who's attending!

Latent Semantic Analysis (LSA) is a technique in natural language processing and information retrieval that seeks to better understand the latent relationships and concepts in large corpuses. In this talk, we’ll walk through what it looks like to apply LSA to the full set of documents in English Wikipedia, using Apache Spark. Harnessing the Stanford CoreNLP library for lemmatization and MLlib’s scalable SVD implementation for uncovering a lower-dimensional representation of the data, we’ll undertake the modest task of enabling queries against the full extent of human knowledge, based on latent semantic relationships.

Speakers
SR

Sandy Ryza

Cloudera
@sandysifting | Sandy is a data scientist at Cloudera focusing on Apache Spark and its ecosystem, and an author of the upcoming O'Reilly publication Advanced Analytics with Spark. He is a frequent Spark contributor as well as a member of the Apache Hadoop project management committee. He graduated Phi Beta Kappa from Brown University. 


Saturday April 25, 2015 10:20am - 11:00am
Theater

Attendees (30)