Name: Semantic Indexing of Four Million Documents with Apache Spark
Start: 2015-04-25T10:20:00-0700
End: 2015-04-25T11:00:00-0700

Back To Schedule

Semantic Indexing of Four Million Documents with Apache Spark

Latent Semantic Analysis (LSA) is a technique in natural language processing and information retrieval that seeks to better understand the latent relationships and concepts in large corpuses. In this talk, we’ll walk through what it looks like to apply LSA to the full set of documents in English Wikipedia, using Apache Spark. Harnessing the Stanford CoreNLP library for lemmatization and MLlib’s scalable SVD implementation for uncovering a lower-dimensional representation of the data, we’ll undertake the modest task of enabling queries against the full extent of human knowledge, based on latent semantic relationships.

Speakers

Sandy Ryza

Cloudera

@sandysiftingSandy is a data scientist at Cloudera focusing on Apache Spark and its ecosystem, and an author of the upcoming O'Reilly publication Advanced Analytics with Spark. He is a frequent Spark contributor as well as a member of the Apache Hadoop project management committee... Read More →

Saturday April 25, 2015 10:20am - 11:00am PDT
Theater

Track B

Text By the Bay

Sandy Ryza

Attendees (0)

Text By the Bay

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Sandy Ryza

Attendees (0)