‘Million Song Project’ at Columbia U. Seeks to Build Better Internet Radio

A new collaboration between Columbia University researchers and The Echo Nest, a company that tracks online music and delivers listening suggestions to users, hopes to take the human element out of Internet radio. One goal is to deliver better recommendations and more songs through improved artificial intelligence.

A giant set of Echo Nest data, which includes identifying features for one million popular songs, will make it easier for researchers to develop algorithms that can tag and recommend music to people, says Daniel P.W. Ellis, an associate professor of electrical engineering at Columbia.

At popular music-recommendation services like Pandora, that work is still done by individual people, says Ellis, who heads up the Laboratory for the Recognition and Organization of Speech and Audio at Columbia.

He says the large data set also solves a problem that has plagued music researchers for years.

Previously researchers looking to study the underlying data patterns in music had to build their own libraries from scratch and were limited in their ability to share their libraries by copyright issues.

“Everybody was in their own little pocket,” says Brian Whitman, co-founder and chief technology officer for The Echo Nest. “I would have results on 1,000 songs, but no one could replicate that.”

The scale of the set is also important because it gives researchers a larger pool of data from which to detect those underlying patterns.

Mr. Ellis is studying cover songs to determine how similar they really are. The new data set gives him more material to work with—there are at least 15 versions of the song “Louie, Louie,” for example—which makes his findings more meaningful.

This research could have a very practical application. There’s been a long history of lawsuits contending that one artist copied the work of another. Those are now decided with the help of experts who listen to disputed songs to determine their similarities.

“We’d like to be able to quantify that a bit more,” Mr. Ellis says.

Each of the million songs in the collection is broken down into a series of approximately 1,000 “events,” which could be a single note, chord, or syllable, Mr. Ellis says.

There are no audio tracks, though the songs are linked to an outside library of audio files that lets professors listen to short snippets of the music.

The project was financed by a grant from the National Science Foundation’s Grant Opportunities for Academic Liaison with Industry, or GOALI, program.

Columbia doctoral student Thierry Bertin-Mahieux selected the songs for inclusion in the data set using a “hotness” ranking developed by The Echo Nest from an automated analysis of written content about a particular song.

So far, the data set has been shared with a handful of institutions, including the University of San Diego and New York University, Mr. Ellis says, and he hopes research that comes out of the data set will encourage other researchers to use the data as well.

Beyond commercial applications, he says the data can help answer a fundamental question about what what makes music what it is: “What is the common underlying structure that separates this from random data?”

Return to Top