Creating a global database of musicians on Postgres, using Elixir, Discogs’ XML data dump and sweet_xml library

Abbas Tolgay Yılmaz
3 min readSep 10, 2020

Hello, enduring citizens of Pandemia.

These days mostly working on a set of projects to introduce new efficiency and marketing tools for artists, especially for the musicians.

Hence, found myself searching for the best suiting data source, containing musicians and bands. I’ve found this Wikipedia page for the stats on number of artists contained by variety of platforms. I’ve decided to go with Discogs data.

For the purpose of brevity, we will extract only the name and discogs_id of those artists.

Picture, from sticker:

“11. Gängeviertel Geburtstag — Welcome to Pandemia”

Will be using discogs_id later to query musician images through discogs’ API.

Note: I’ve found that, sweet_xml library seems to be working best with below combination:

Elixir 1.7.0 + Erlang 20.0

I recommend a tool called asdf for version management of both languages, as well as node.js.

Sweet.. Let’s get coding.

$ mix new xml_mania --sup
$ cd xml_mania

add below line to your dependencies:

# mix.exsdefp deps do[  {:ecto_sql, "~> 3.0"},  {:postgrex, ">= 0.0.0"},  {:sweet_xml, "~> 0.6.6"}]

Fetch dependencies:

$ mix deps.get

Lets also define a “Musician” schema to use upon parsing; and before inserting to our schema.

# lib/xml_mania/musician.exdefmodule XmlMania.Musician do  use Ecto.Schema  schema "musicians" do    field :name, :string    field :discogs_id, :integer   endend

Let’s write the functionality required to parse the document, then write to Postgres, on a table called “musicians”, as defined above.

If you will populate the table within an existing application of yours, you will need to do three things.

One is to define a repository for the application to connect to an existing repository of another project.)(I am developing this data tool separate from my main project.)

$ mix ecto.gen.repo -r <ExistingAppsName>.Repo

Second is to provide the credentials for the existing database. Change <ExistingAppsName> and dotted “…” parts with yours.

As you might already know, default values of username/password pair are “postgres”.

# config/config.exsuse Mix.Config  config :xml_mania, <ExistingAppsName>.Repo,    username: " ... ",    password: " ... ",    database: "<ExistingAppsName>",    hostname: "localhost",    pool_size: 20 config :xml_mania, ecto_repos: [<ExistingAppsName>.Repo]

Third is to add this to the apps’ supervision tree:

#lib/xml_mania/application.exdefmodule XmlMania.Application do
use Application

children = [
<ExistingAppsName>.Repo
]

Let’s write the actual code to get things done here:

# lib/xml_mania.exdefmodule XmlMania do  import SweetXml  # [number of musicians or bands 7,046,615]  @musicians_path “data/xml/discogs_20200806_artists.xml”  alias XmlMania.Musician  alias <ExistingAppsName>.Repo # parse_with_streams function to trigger all functionality. 
#(read-transform-write)
defp parse_with_streams do artists_stream = File.stream!(@musicians_path) result = artists_stream # discarding other tags containing nested <name> attributes |> stream_tags(:artist, discard: [:namevariations, :aliases, :groups]) |> Stream.map(fn {:artist, doc} -> extract_name_and_id(:artist, doc) end) |> Stream.filter(fn name -> name != “” end) # some were empty |> Enum.to_list IO.puts “We have a list now. Will start generating records from the list items.” generate_records_from_list(result)end defp extract_name_and_id(:artist, doc) do name = xpath(doc, ~x”./name/text()”s) discogs_id = xpath(doc, ~x”./id/text()”i) {name, discogs_id} end
defp generate_records_from_list(names_discogs_ids) do names_discogs_ids |> Enum.map(fn {name, discogs_id} -> Repo.insert!( %Musician{name: name,
discogs_id: discogs_id}) end)
endend

Make sure your Postgres database is up and running; then get in the Interactive Elixir shell:

$ iex -S mix

Then issue that one command to persist them all>

Note: Running below command took my device (16 gb RAM) around 1.5 hours and Database size was around 750 mb (with couple of more fields that I’ve omitted in this article).

iex(1)> XmlMania.parse_with_streams

Now you should be able to have a decent table on your Postgres database with all contained musicians/bands’ names shared by the discogs platform.

On the next article, will be creating Elasticsearch index from this data within a Phoenix app.

Stay tuned!

--

--