Monday, July 3, 2017

Introduction to loops and forRange tutorial with variable data

Problem : I have a column of cells with variable date ranges:
1907-1913
1910-1911
1908-1911
etc.

The cells have to be converted to:
1907; 1908; 1909; 1910; 1911; 1912; 1913
1910; 1911
1908; 1909; 1910; 1911

etc.

Open Refine can be used to convert this. Using add column based on this column, you would use the following GREL command:

forRange(toNumber(substring(value, 0,4)), toNumber(substring(value, 5,9)) + 1, 1, currentYear, currentYear).join("; ")


Yes, it looks daunting. But let's break it down, starting with the forRange command:



forRange (<starting value>, <ending value>,  <integer to increment by>, <variable name>, <do something to variable>)

forRange is an example of a function that loops. In English, a loop has a defined starting condition, a test, and an exit condition. It will execute commands until the exit condition is reached. How it does so is by testing the starting condition after each iteration. Once the starting condition agrees with the exit condition, the loop stops. Most tests are mathematical, and forRange is one of them.

In short, forRange takes the starting integer value and increments it by one as long as it's less than the ending value. For each value, a function is executed.

Here's a detailed walkthrough of forRange:
1. forRange will take <starting value> and copy it into <variable name>. A variable is a way for the computer to store and manipulate data. A variable name is basically a human readable label for a variable. It can be anything, but best practice is to not name it "variable" and to make it meaningful (so other people can figure out what you're manipulating)

And why does Open Refine need to copy it in the first place? This is a programming best practice. When you want to manipulate data, you always make a copy of the thing you want to manipulate, in case you need the original data for something else.

2. <variable name> is evaluated to see if it's less than <ending value>.

    3a.  If it is, forRange exits and returns the temporary array.
    3b. If not, <do something to variable> is executed.
           "do something" can be any GREL command. You could simply output the value, you could do some text manipulation, you could do array manipulation, and you could even drop in a conditional function.

     4. Whatever's generated by that expression is then stuffed into the first empty element in a temporary array.

     5.<variable name> is then incremented by <integer to increment by>.

     6. Repeat steps 2-5 until <variable name> equals <ending value>. 

7. When <variable name> equals <ending value> forRange exits immediately and doesn't execute for <ending value> because the test is less than, not less than or equals to. Therefore, make sure that ending value = <the last value you want to process> + 1

Okay, that's theoretically great. How does this apply to actual data?
Example:
I want to generate the string:  1903; 1904; 1905

To do this, I first tried:
forRange(1903, 1906, 1, currentYear, currentYear + ";").join(" ")
1. Notice that the end date is 1906 and not 1905. You will see the reason for this in steps 10-15.

2. Open Refine copies 1903 into currentYear, then it tests to see if currentYear < 1906. currentYear = 1903, so forRange starts to execute.

3. Expression currentYear + ";" is evaluated - that creates the string 1903; 
currentYear remains 1903 because the <do something to currentYear> expression doesn't overwrite the variable.

4. 1903; is stuffed into a temporary array. Array is now ["1903;"]

5. currentYear is still 1903

6. forRange increments currentYear by 1, so currentYear becomes 1904.

7. currentYear < 1906 is tested. It's true, so forRange continues to execute.

8. Expression currentYear + ";" is evaluated - that creates the string 1904; 

9. 1904; is stuffed into the temporary array. Element 0 is full, so it's copied into element 1, instead. Array is now ["1903;", "1904;"]

10. forRange increments currentYear by 1, so currentYear becomes 1905.

11. currentYear < 1906 is tested. It's true, so forRange continues to execute.

12. Expression currentYear + ";" is evaluated - that creates the string 1905; 

13. 1905; is stuffed into the temporary array. Element 0-1 are full, so it's copied into element 2, instead. Array is now ["1903;", "1904;", "1905;"]

14. forRange increments currentYear by 1, so currentYear becomes 1906.

15. Expression currentYear < 1906 is evaluated. 1906 < 1906 is false, so the temporary array  ["1903;", "1904;", "1905;"] is returned.

16. Open Refine will just hang onto the array until you execute join(" ").

17. As you noticed, when you do forRange(1903, 1906, 1, currentYear, currentYear + ";").join(" "), 1903; 1904; 1905; is printed to the cell. Not really desirable. What would work better is to just stuff currentYear into the array and then insert "; " as a delimiter in the join. Or:
forRange(1903, 1906, 1, currentYear, currentYear).join("; ")
But I have variable data. How does forRange help?
Easy. You use a GREL function on value to generate the starting and ending values.

Since all of my strings are <4 character year>-<4 character year>, I'm going to use the substring function to extract the starting and ending years to use in forRange. 

Specifically:
<starting value> = substring(value,0,4) 
<ending value> = substring(value,5,9) +1 
(remember that the ending value has to be one year later because otherwise the loop will exit before the last year in the range is processed)

Plugging everything in, I come up with this GREL:
forRange(substring(value, 0,4)), substring(value, 5,9)  + 1, 1, currentYear, currentYear).join("; ")

It should work, right?

Wrong. It generates an error because the substring of something is counted as a string. In order for the looping to work, forRange has to be able to increment an integer.

Fortunately, GREL can convert a string to an integer using toNumber(<integer or number string or something that generates an integer or number string>)

Therefore:
<starting value> = toNumber((substring(value, 0,4))
<ending value> = toNumber(substring(value, 5,9)) + 1

Now, drop this into forRange:
forRange(toNumber(substring(value, 0,4)), toNumber(substring(value, 5,9)) + 1, 1, currentYear, currentYear).join("; ")

this one will work, too:
forRange(toNumber(value.substring(0,4)), toNumber(value.substring(5,9)) + 1, 1, currentYear, currentYear).join("; ")

No comments: