Week 2 — Poefier

Ceren Korkmaz
BBM406 Spring 2021 Projects
4 min readApr 18, 2021

--

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content. Some of the most common examples and use cases for automatic text classification include the following: Sentiment Analysis, Topic Detection, and Language Detection[1].

In our previous article we talked about what is our project about. This week we want to talk about our data so we spent most of our time analyzing the data. We have 573 poems, their names, authors, ages, and types. All these poems are in a CSV file and we started by reading the file first. Then we separated the relevant data and checked whether there is any misclassified data or not. As you can see from the below chart 2 poem names are missing. This will not affect our project because we will work on poems’ ages and contents.

After parsing the data, we prepared the table below to make a better analysis. Here we found it appropriate to sort the poems according to their style. As you can see, 326 of them are love, 59 of them are mythology & folklore and 187 of them are nature.

Now, let’s see those results in a pie graph. Mythology & Folklore poems are represented as “Other”.

For further studying, we wanted to classify our poems by age. It can be either modern or renaissance, and below you can see the related pie chart.

Here, we share other results and graphs related to our dataset that we find interesting. William Shakespeare has the most number of poems in our data, he has the highest frequency with 71 poems. There are 67 different authors in the dataset. We have 3 different type of poems and the majority of them -326 out of 573- are written about “Love”.

Finally, we decided to create a word cloud for each category for a better understanding of the dataset. There are words here that are difficult to understand and that are not similar to the English we are used to. As a result of the fact that the poems belong to the renaissance period and there are poets such as Shakespeare who use the language with grandiloquent words, we came across a result as follows.

At first, you can see the wordclouds categorized by poems’ type.

We also want to share wordclouds categorized by poems’ age.

Again, because the expression of the poems is grandiloquent and the English used is old, sometimes stopwords can be used to describe the emotion. So even though we removed modern-day English stopwords, we haven’t removed old ones from the dataset yet. We intend to use it when extracting and comparing precision results in future studies.

[1]https://monkeylearn.com/what-is-text-classification/

You can reach the dataset from here.

Group Members

Alihan Karatatar — 21904324

Atakan Yüksel — 21627892

Ceren Korkmaz — 21995445

--

--