Monday, October 2, 2017

Recipe and walkthrough - Joining the first two cells in a column and moving the third up

 If I had this sample spreadsheet:











And I wanted to transform it to this spreadsheet:



I can use the column array conversion, a trick with array arithmetic, and nested ifs to do so in one GREL expression. Or I can use three transforms.

Assuming that all data in the "Subject" column is three subject and only three subjects per call#, and that there are no blank lines in the column, do the following:

1. Add a blank column.
2. Put in a placeholder value in first cell.
3. Move column to the beginning. If using an older version of Open Refine, switch to record view.
4. Transform with: if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)
5. Transform subject column again: if(mod(rowIndex, 3) == 1, row.record.cells["Subject"].value[rowIndex+1], value)
6. Last transform: if(mod(rowIndex, 3) == 2, "", value)
7. If you were in record view, switch back to rows. Delete placeholder column.

OR:

1. Add a blank column.
2. Put in a placeholder value in first cell.
3. Move column to the beginning. If using an older version of Open Refine, switch to record view.
4. Transform on the subject column and use this piece of GREL:

if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], if(mod(rowIndex,3)== 1, row.record.cells["Subject"].value[rowIndex+1], ""))
5. If you were in record view, switch back to rows. Delete placeholder column.


How this trick works is by a bit of arithmetic trickery. As you've seen in this post, once you put a column with a single placeholder value at the beginning, that forces the columns to be read into the row.records.cells[<column name>].value array. Then using some array arithmetic with rowIndex can be used to edit whichever cells you choose.

This recipe builds on this technique, except I'm using math to select which cells I want to edit.

The trick only works if I want to do exactly the following for each title:
1. Concatenate the first two Subject terms together.
2. Shift the third subject up a cell.
3. There are no blank cells in the column.

To begin with, rowIndex is a constant - it is always 0 for the first cell in a column and just increments all the way down.  Like so:












 
Here's what the row.record.cells["Subject"].value array looks like after you put the placeholder column at the beginning of the sheet.

Index                   Element                  
0                           Video Games             
1                            United States            
2                            History                      
3                           Aztecs                        
4                           Social Conditions       
5                           History                        
6                           United States              
7                           Foreign Relations       
8                            1945-1989                 
9                             New England         
10                            History                     
11                            Dictionaries               

Now, go back to the rowIndex picture. As you can see, the indexes in the array are the same as rowIndex. If you take advantage of this, you can do a lot of math tricks to grab and edit the cells you want.

One question you should always ask yourself is "do the cells I want to edit fit some sort of mathematical pattern?" In the case of this edit, I looked at the data and sketched out the following:

1. There are 3 Subjects per title.
2. I always want to concatenate the first two Subjects.
3. I always want to shift the third subject up a cell.

In pseudocode, what I want to do is:
1. For every first subject
            -concatenate subject in cell to the value in the cell below.
2. For every second subject
            - replace the value in this cell with the value in the cell below.
3. For every third subject
            -blank out cell.



My next step was to see if there was something that I could do, arithmetically, to convert the rowIndex into a regular pattern. That's where the next step comes in.

Do you remember modulos from math? That's the remainder after you divide something. So, 0 modulo 3 = 0 for 0, 3, 6, etc., when you divide multiples of 3, because there is no remainder.

1 modulo 3 = 1 for 1, 4, 7, etc., because after dividing 1, 4, 7, etc. by 3, you have a remainder of 1.

2 modulo 3 = 2 for 2, 5, 8 etc. , because after dividing 2, 5, 8, etc. by 3, you have a remainder of 2.


The mod(<some number>, <divider>) function in Open Refine does a modulo operation of <some number> modulo <divider>. <some number> can be a specified constant number, like 0, but then it will only execute 0 modulo 3 for the entire sheet:













However, if you use rowIndex, it will do a modulo operation on each rowIndex value:
 












As you can see, the modulo operation marks the first subject term in each trio with a 0, the second with a 1, and the third with a 2. So, taking advantage of that pattern, we can now specify what operations we want.

How can we do that? It's because arithmetic operations generate true/false values.  How you write it is this way:
<arithmetic operation to execute> == <what you want it to equal>

If the executed operation equals what's on the right, GREL generates a true value. Otherwise, it's false. It is very important that you put two equals signs between the two, otherwise your operation won't work.

Example: 
If I typed in 2 +3 == 5 , GREL would executed 2+3, then see if it equals 5. Then it generates a true/false value and outputs it:

Now, see what happens if you put in 2+3 ==6 :

 











Since arithmetic operations generate true/false values, you can use it as a test condition for an if statement.
 
So, now, let's put this all together with the mod(rowIndex, 3) operation. As we saw above, that operation generated 0 for the first Subject, 1 for the second, or 2 for the third.

Now, let's go back to our pseudocode:

In pseudocode, what I want to do is:
1. For every first subject
            -concatenate subject in cell to the value in the cell below.
2. For every second subject
            - replace the value in this cell with the value in the cell below.
3. For every third subject
            -blank out cell.

So, since we know that mod(rowIndex,3) generates 0 for the first subject, let's drop that in our pseudocode:
1. if (mod(rowIndex, 3) == 0,
          -concatenate subject in cell to the value in the cell below if true
          -do nothing if false
2. For every second subject
            - replace the value in this cell with the value in the cell below.
3. For every third subject
            -blank out cell.


Double check your test condition:
 












The first subject is getting chosen correctly, so I can now proceed with the concatenation. As mentioned in the previous post, to concatenate a cell to the cell below with a delimiter, it's row.record.cells["<column name>"].value[rowIndex] + "<delimiter> " +
row.record.cells["<column name>"].value[rowIndex+1],


Dropping it in:
1. if (mod(rowIndex, 3) == 0,
          row.record.cells["Subject"].
value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1]
         -do nothing if false

2. For every second subject
            - replace the value in this cell with the value in the cell below.
3. For every third subject
            -blank out cell.
Since I don't want to alter anything else, I'm going to specify "value" for the cells I don't pick out. Or,

 1.  if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)

2. For every second subject
            - replace the value in this cell with the value in the cell below.
3. For every third subject
            -blank out cell.
 
Now, onto the second transform. Since mod(rowIndex,3) == 0 for every first subject, it makes sense that mod(rowIndex,3) == 1 would grab every second subject.
 
Dropping it in:

 1.  if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)


2. if(mod(rowIndex, 3) ==1,
      - replace the value in this cell with the value in the cell below if true
      -do nothing if false

3. For every third subject
            -blank out cell.
 
Replacing the value of the current cell with the value in the cell below can be achieved by just outputting the contents of the cell below, or  row.record.cells["Subject"].value[rowIndex+1]
 
Drop that into the if: 
  1.  if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)


2. if(mod(rowIndex, 3) ==1, row.record.cells["Subject"].value[rowIndex+1]
           -do nothing if false

3. For every third subject
            -blank out cell.
 
Now, finish out the if:
 
 1.  if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)


2. if(mod(rowIndex, 3) ==1, row.record.cells["Subject"].value[rowIndex+1], value)

3. For every third subject
            -blank out cell.
 
All that's left now is mod(rowIndex, 3) == 2. Since this is an easy operation, you can just do the if right off:
 
1. if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)2. if(mod(rowIndex, 3) == 1, row.record.cells["Subject"].value[rowIndex+1], value)
3. if(mod(rowIndex, 3) == 2, "", value)

 
First transform code walkthrough
 if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)

Let's start at row 1, or rowIndex = 0. Drop the rowIndex value into the if:

 if(mod(0, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)

so, first, evaluation mod(0,3).  That's 0 divided by 3, take the remainder. The remainder is 0, so drop that into the evaluation phase of the if:
 if( 0 == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)

Anyway, 0 == 0 is true, so we evaluate the true "do something" which is:

row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1]

First, let's drop in the current rowIndex, which is 0:
row.record.cells["Subject"].value[0] + "; " +row.record.cells["Subject"].value[0+1]

Now, look up element 0 in the 
row.record.cells["Subject"].value array and drop it in:

Video Games + "; " +row.record.cells["Subject"].value[0+1]

When you have arithmetic in an array index, you do the arithmetic first:
Video Games + "; " +row.record.cells["Subject"].value[1]

now, you go and retrieve what Element is under index 1:
Video Games + "; "  + United States

Now do the concatenation:
Video Games; United States

And then output that to the cell in row 1.

Now we go to row 2, or rowIndex = 1. Dropping it into the if statement:
if (mod(1,3) ==  0,
row.record.cells["Subject"].
value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)

mod (1, 3) is 1,so drop that in:
if (1 == 0,
row.record.cells["Subject"].
value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)

That's false, so go to the false condition, which is:
value

Output value to row 2.

Now, on row 3, rowIndex  = 2, repeat the drop in and execute the false condition.

Second transform walkthrough

if(mod(rowIndex, 3) == 1, row.record.cells["Subject"].value[rowIndex+1], value)

Again, let's start at rowIndex = 0. Drop it in:
if(mod(0,3) == 1, row.record.cells["Subject"].value[0+1], value)

mod(0,3) = 0, so the false condition is executed. The value in row 1 is unchanged and is outputted.

Now, go to rowIndex = 1:
if(mod(1,3) == 1, row.record.cells["Subject"].value[0+1], value)

mod(1,3)  = 1, so the true condition is executed:
row.record.cells["Subject"].value[1+1]

First, the addition in the index field is resolved:
 row.record.cells["Subject"].value[2]

Element 2 is looked up in the array, so History is outputted to row 2.

rowIndex =2:
if(mod(2,3) == 1, row.record.cells["Subject"].value[2+1], value)

mod(2,3) equals 2, so the false condition is executed. Whatever value is in row 2 is unchanged and is outputted.

Converting the three transform statements into one nested if 
These three statements:
 1. if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], value)
2.if(mod(rowIndex, 3) == 1, row.record.cells["Subject"].value[rowIndex+1], value)
3. if(mod(rowIndex, 3) == 2, "", value)

Can be boiled down to:
if (modulo operation equals 0, do concat if true, do nothing if false)
if (modulo operation equals 1, copy cell below if true, do nothing if false)
if (modulo operation equal 2, blank out cell if true, do nothing if false)

Since the false condition for the first transform is module operation equalling 1 or 2, you can insert the modulo equals one copy into the false condition. However, since you only want to do the copy if the modulo operation equals 1, you'll need to drop the entire second if statement into the first:

1. if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], if(mod(rowIndex, 3) == 1, row.record.cells["Subject"].value[rowIndex+1], value))
2. if(mod(rowIndex, 3) == 2, "", value)

However, since the only remaining value in the modulo operation is 2, and there are no others, you can put the blank out step into the false condition: 

if(mod(rowIndex, 3) == 0, row.record.cells["Subject"].value[rowIndex] + "; " +row.record.cells["Subject"].value[rowIndex+1], if(mod(rowIndex,3)== 1, row.record.cells["Subject"].value[rowIndex+1], ""))

No comments: