Monday, July 17, 2017

Walkthrough on writing a complex if function

Say that from this sample data of a 538 field from video game records, you're only supposed to remove "System requirements" from the entries with a video game console:

System requirements: Nintendo 64
System requirements: Nintendo 64
System requirements: Nintendo 64; designed for N64 Rumble Pak
System requirements: PlayStation 2
System requirements: PlayStation 2
System requirements: Windows 95/98; 133MHz Pentium or faster processor; Windows 95/98; 64MB RAM; Microsoft DirectX 6.0 (included); quad speed CD-ROM drive; PCI or AGP graphics card; 4MB 3D accelerator for 3D graphics support (optional); 16-bit sound card; keyboard and mouse; joystick (optional); gamepad (optional)
System requirements: PlayStation 2
System requirements: PlayStation 2, memory card (for PS2) 94 KB
System requirements: PlayStation 2
System requirements: Nintendo 64. Designed for N64 Rumble Pak
System requirements: Nintendo Entertainment System
System requirements: Windows 98SW/ME/2000/XP; Pentium III 800 MHz or faster; 256 MB RAM; ; 2.5 GB free hard disk space; 32 MB hardware T&L compatible video card; Windows compatible sound card; DirectX version 9.0c (included) or higher;  8x or faster CD-ROM drive
System requirements: PlayStation 2, memory card (for PS2) 177 KB

-----
There's more data than this, with operating systems descriptions including "PC", "Mac", "DOS", and "Pentium". The operating systems are also not appearing right after the "System requirements" consistently. They're everywhere. Sometimes even at the beginning of the line.

So whatever expression you write has to be able to find it anywhere. This is possible with the if function in Open Refine. An if expression will test any boolean expression that you write against your data. Then you can write different expression to execute based on whether the condition was true or false.

However, this is complicated by the fact that you're trying to test for 4 things at once. This is not as difficult as it seems - you just need to do it in stages.

First, I would write out in pseudo-code what I want to have happen:
if (the string has "Windows", or "Mac" or "DOS" or "Pentium",
      True: Leave it alone,
      False: Remove "System requirements:")

Then I would go through the following stages to figure out what to do:

Stage 1
I need to figure out how to write GREL for "I see PC, Mac, etc." that will spit out a boolean value for the if expression to test (this is also called a conditional statement in computer speak). I could do it using a lot of if statements stacked together (called nested ifs), but that gets ugly. Since the conditional statement can be anything that yields a boolean value, I just need to find a string expression that will evaluate multiple values.

But which expression should I use? value.contains looks like a good candidate. It says it takes regex, and you can write an or in regex if you use a capturing group.

The regex would be: (.*PC.*|.*Mac.*|.*Win.*|.*DOS.*|.*Pentium.*)

However, just to make sure it works (and I recommend doing this for any conditional you're writing), I decided to do an add column based on this column and then I plugged in
value.contains(/(.*PC.*|.*Mac.*|.*Win.*|.*DOS.*|.*Pentium.*)/) to see if it produced
boolean values.

What I got was null. And no syntax errors. So I tried a bunch of simpler regexes and I found out that value.contains doesn't work at all with regex. There was a workaround listed on the Open Refine wiki, but then I noticed something on the Open Refine wiki boolean page.

The or expression looked promising, but it seemed limited. However, I played around with it, and the wiki documentation is not as complete as it should be.

In the wiki, the example of an or expression is or(1 < 3, 1 > 7) returns true.

It appears that or only works with mathematical expressions to test. This is not the case. You can stick any GREL expression that will spit out a boolean value as part of the or. This means that you can use value.contains("string"). 

Plus, the or expression also has no limit to the number of test conditions as long as they are separated by commas. So, I can do value.contains(“PC”) or value.contains(“Mac”) or value.contains(“Win”) or value.contains(“DOS”) or value.contains(“Pentium”) in one line.

The proper syntax is:
or(value.contains("PC"), value.contains("Mac"), value.contains("Win"), value.contains("DOS"), value.contains("Pentium"))

This will yield True for any operating system, and false for a console.

Dropping it into the pseudocode:
if ( or(value.contains("PC"), value.contains("Mac"), value.contains("Win"), value.contains("DOS"), value.contains("Pentium")),
          True: leave it alone,
          false: Remove “System requirements: “)



Stage 2:
The conditional is now taken care of. So now I have to write up what do for the removing system requirements part.

It's value.replace("System requirements: ", "")
So now your pseudocode is:
if ( or(value.contains("PC"), value.contains("Mac"), value.contains("Win"), value.contains("DOS"), value.contains("Pentium")),
        True: leave it alone,
        value.replace("System requirements: ", ""))




Stage 3
The rest is easy, because "leave it alone" is just tell it to put value in the column. Or:
if ( or(value.contains("PC"), value.contains("Mac"), value.contains("Win"), value.contains("DOS"), value.contains("Pentium")),
         value,
         value.replace("System requirements: ", ""))


Remove all the indents and spacing and you have: if (or(value.contains("Win"), value.contains("PC"), value.contains("DOS"), value.contains("Pentium"), value.contains("Mac")), value, value.replace("System requirements: ",""))

No comments: