In the English language (I believe the same would be true for most languages but don’t quote me), there are words that are used more frequently than other words in the language but they do not necessarily add more value to a sentence, hence it is safe to say that we can ignore them by removing the from our text. For example in a sentiment analysis task, we want to find the word (or words) that tip the sentiment of the text in one direction or the other. In the majority of natural language tasks, we want our machine learning models to identify the words within a document that provide value to the document. # Python Example text = "The UK lockdown restrictions will be dropped in the summer so we can go partying again!" # lowercasing the text text = text.lower() print(text) > the uk lockdown restrictions will be dropped in the summer so we can go partying again! Removing Stopwords Therefore, it’s important to normalize the case of our words so that every word is in the same case and the computer doesn’t process the same word as 2 different tokens. To a human, we can read a text and intuitively tell that “The” which is used at the beginning of a sentence is the same word as “the” which is found later in the middle of the sentence, however, a computer cannot - “The” and “the” are seen as 2 different words by a machine. For example, we start a new sentence with a capital letter or if something is a noun, we would capitalize the first letter to indicate we are talking about a place/person, etc. When we write, we capitalize various words in our sentence/paragraph for different reasons. Let’s cover some ways we can clean text - In another post, I’ll cover ways we can encode text. Instead, we must follow a process of first cleaning the text then encoding it into a machine-readable format. When we are working with textual data, we cannot go from our raw text straight to our Machine learning model. Unfortunately, computers aren’t like humans Machines cannot read raw text in the same way that we humans can. According to Wikipedia, unstructured data is described as “information that either does not have a pre-defined data model or is not organized in a pre-defined manner.”. Photo by The Creative Exchange on Unsplash
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |