Word Clouds are a simple way of visualizing word frequency in a corpus of text. Word Clouds typically work by displaying frequently used words in a text corpus, with the most frequent words appearing in larger text.
Here is the data file I will be using in this example if you want to follow along:
As far as libraries go, you will need pandas, matplotlib, os, and wordcloud. If you are using the Anaconda python distribution you should have all the libraries but wordcloud. You can install it using PIP or Conda install.
Lets start by loading the data
import pandas as pd import matplotlib.pyplot as plt from wordcloud import WordCloud import os #Set working directory os.chdir('C:\\Users\\blars\\Documents') #Import CSV df = pd.read_csv("movies.csv") #First look at the Data df.head()
** Note: if you are using Jupyter notebooks to run this, add %matplotlib inline to the end of the import matplotlib line, otherwise you will not be able to see the word cloud
import matplotlib.pyplot as plt %matplotlib inline
We can use df.info() to look a little closer at the data
We have to decide what column we want to build our word cloud from. In this example I will be using the title column, but feel free to use any text column you would like.
Let look at the title column
As you can see, we have 20 movie titles in our data set. Next thing we have to do is merge these 20 rows into one large string
corpus = " ".join(tl for tl in df.title)
The code above is basically a one line for loop. For every Row in the Column df.title, join it with the next row, separating by a space ” “
Now build the word cloud
wordcloud = WordCloud(width=640, height=480, max_words=20).generate(corpus)
You can change the width and height, number of words that will appear. Play around with the numbers, see how it changes your output
Finally, let’s chart it, so we can see the cloud
plt.imshow(wordcloud,interpolation="bilinear") plt.axis("off") plt.show()
interpolation = “bilinear” is what lets the words so sideways and up and down
plt.axis(“off”) gets rid or axis markers (see below)
wordcloud = WordCloud(width=640, height=480, background_color = 'white', max_words=25).generate(corpus) plt.imshow(wordcloud,interpolation="bilinear") plt.axis("off") plt.show()