Monday, July 31, 2017

How to calculate the number of words in a sentence and how to use that in facets/the if function

I had a set of data where I needed to only process sentences that were more than a certain number of words. Fortunately, doing so is relatively easy.

First of all, the words in a sentence can be split up into an array using value.split(" ").

Secondly, the length() function in Open Refine returns the length, or total number of elements in an array. Since each element is a word, put these two together and the number of words will be calculated:



Click to enlarge






















How is this useful? Well, the number generated by length() is an actual number instead of a string. That means it can be used for arithmetic booleans.

For example: length(value.split(" "))  < 7 will give you a true if your sentence is 6 words or less, and false if it's 7 or more words.

Other arithmetic booleans are:
 length(value.split(" ")) > <some number>  - yields true if sentence is more than <some number> of words, false otherwise

length(value.split(" "))  <= <some number> - yields true if sentence is <some number of words> or less, false otherwise

 length(value.split(" ")) >= <some number>  - yields true if sentence is <some number> of words or more, false otherwise

 length(value.split(" ")) == <some number> - only yields true if sentence is exactly <some number> of words

Now, say I only want titles that are 4 words or more. I could make a custom text facet with the function:
length(value.split(" ")) >= 4

Click to enlarge














All I have to do is click on the "True" facet and I've isolated out the data that is 4 words or more.

As you can see in the screenshot above, the mathematical expression generates true/false values. This means that you can use it as a test condition for an if statement.


Say I only wanted to edit titles that were 4 words or longer, my if function would be:
if(length(value.split(" ")) >= 4, <whatever editing expressions I needed>, value)


No comments: