Friday, August 18, 2017

Recipe and walkthrough: counting up repeats of a word in a sentence



How do you count up the occurrences of <word> in a sentence? (or, in this case, find how many times "Nintendo" is in a 538?

sum(forEach(value.split(" "), temp, if(temp.contains("Nintendo"), 1, 0)))
 
Explanation:


If you're unfamiliar with arrays, please read this post. Otherwise, the rest won't make sense.
So, I know that forEach will analyze each element in an array, do whatever command you ask it to do for that element, and then it pushes the result onto a separate array. This means that if you split up a sentence and stick that into forEach, it will analyze each word. This is the basis on which to count words up - you need GREL to check each word, then do something so that it knows to count "Nintendo".

forEach() is
forEach(<expression to generate an array>, <copy generated array to a named temporary array>, <do something with each element in the temporary array, and push the result onto a results array>)

The copying your array to a separate array is a programming convention. It keeps you from corrupting the data in the original and allows you to have a clean copy if you want to do some different manipulation. Normally, in programming you have to write out the copying action, but GREL will do it for you as long as you give the temporary array a name. I'm just going to call it "temp".
 
So, so far, we have:
forEach(<expression to generate an array>, temp, <do something with each element in the temporary array, and push the result onto a results array>)
Now we tackle <expression to generate an array>. forEach won't work if you just stick value in there - something has to be done to it to put it in array format. Either value.split, splitByCharType, the partition functions, or value.match() with capturing groups will work.

However, since we want to analyze each word, the easiest thing to do is break apart the sentence into an array using value.split(" ") and dropping that in forEach :
forEach(value.split(" "), temp, <do something with each element in the temporary array, and push the result onto a results array>)

How to write the <do something with each element> phase
So, for <do something to each element>, forEach is basically going to access each word in the sentence, and I have to figure out what I want to do with it. What helps to visualize this is to look at the array generated by value.split(" ") and note down what you want to do when you see each element.
For example, this line:
$aSystem requirements: Nintendo Entertainment System
Will be broken up this way after value.split(" ")
Index                 Element
0                       $aSystem
1                       requirements:
2                       Nintendo
3                       Entertainment
4                       System 
So, here's where I decide on my actions:
Index                 Element                           Action
0                       $aSystem                        Don't count
1                       requirements:                   Don't count
2                       Nintendo                          Count
3                       Entertainment                   Don't count
4                       System                           Don't count
Now, in the <do something with each element in the temporary array, and push the result onto a results array>, you need to write a function that will generate something to write to the results array. In this phase, do not write another array generating statement by itself. This will confuse both you and Open Refine. What you're looking for are functions that will either just generate a boolean, a string of some sort, or a number. You can even drop and(), or(), or an if() function in there.

Since I basically want to flag whenever I see "Nintendo", and count that, and if I see something that doesn't say "Nintendo", and I don't want to count it, that means that my <do something> is an if function. Anytime you
want to do one thing to a certain criteria, that's an if.

So, how do I count up "Nintendo"? I looked through GREL, and as far as I can determine, there are no functions that will add up array elements as you go along. All functions seem to work on the array after you've filled it, so my next step for the results array is to fill it.

The question is, what should I fill the results array with? Since I want to sum up the number of times I see "Nintendo", a good element would be a 1. That way, I have something to add with. For things I don't want to count, I'll just use a zero.

The whole <do something> phase sounds like an if statement, with temp.contains("Nintendo") as the test condition.
or
if (temp.contains("Nintendo"), <write 1 to array>, <write 0 to array>)

Since I wasn't sure if just writing something in quotes would insert it into the array, I just did this against my test data:
forEach(value.split(" "), temp, if(temp.contains("Nintendo"), "count", "ignore"))
(I used "ignore" instead of "don't count", because it's easier to read)

and looked at what happened:

So...now that I know I can write any value in quotes to an array, my if() becomes:

if(temp.contains("Nintendo"), "1","0")
Now my GREL is:

forEach(value.split(" "), temp, if(temp.contains("Nintendo"), "1", "0"))

Testing it again:



So, forEach has generated this results array, but I can't do anything with it. I promptly went to the GREL wiki and looked under array functions to see if there was anything that could add up the elements in an array.

Bingo! sum(<any array>) adds up the elements within the array. Since forEach generates a results array, I can wrap sum around it to add it up.

So, I try:
sum(forEach(value.split(" "), temp, if(temp.contains("Nintendo"), "1", "0")))

As you can see, it didn't work. And the reason why is that "1" is a string and not an integer. I originally thought that I'd have to layer another function to convert a string to an integer, but it turns out that simply putting 1 and 0 without quotes will write them to the results array as integers.

Just for the record, this is what an array of integers looks like:


So, then all I have to do is wrap sum around it:
sum(forEach(value.split(" "), temp, if(temp.contains("Nintendo"), 1, 0)))



This assumes that my platform names are written consistently. If they're not, either do a value.replace to fix, or use isNotNull(value.match()) with regex for your two test conditions.
This piece of GREL will only sum up repeats that you tell it to, so if you need a count for different repeat words in the same column, you should probably facet on that word and create separate columns for each count. Then blank out all zeroes and do a concatenate across all columns. Or, if you want to get really layered, drop the forEach() into another if to create column new:
if(value.contains("Nintendo"),
sum(forEach(value.split(" "), temp, if(temp.contains("Nintendo"), 1, 0))), "")

Then, on column new, do a transform. Change the if to pick out terms you didn't look for previously and sum them up, but don't erase the values you already generated. So, for example, if column contents had Nintendo and PlayStation platforms, I would first generate a new column with:

if(value.contains("Nintendo"),
sum(forEach(value.split(" "), temp, if(temp.contains("Nintendo"), 1, 0))), "")
Then, on the new column, I would do a transform and do this if:
if(cells.contents.value.contains("PlayStation"),
sum(forEach(cells.contents.value.split(" "), temp, if(temp.contains("PlayStation"), 1, 0))), value)

No comments: