Say
that from this sample data of a 538 field from video game records,
you're only supposed to remove "System requirements" from
the entries with a video game console:
System
requirements: Nintendo 64
System
requirements: Nintendo 64
System
requirements: Nintendo 64; designed for N64 Rumble Pak
System
requirements: PlayStation 2
System
requirements: PlayStation 2
System
requirements: Windows 95/98; 133MHz Pentium or faster processor;
Windows 95/98; 64MB RAM; Microsoft DirectX 6.0 (included); quad speed
CD-ROM drive; PCI or AGP graphics card; 4MB 3D accelerator for 3D
graphics support (optional); 16-bit sound card; keyboard and mouse;
joystick (optional); gamepad (optional)
System
requirements: PlayStation 2
System
requirements: PlayStation 2, memory card (for PS2) 94 KB
System
requirements: PlayStation 2
System
requirements: Nintendo 64. Designed for N64 Rumble Pak
System
requirements: Nintendo Entertainment System
System
requirements: Windows 98SW/ME/2000/XP; Pentium III 800 MHz or faster;
256 MB RAM; ; 2.5 GB free hard disk space; 32 MB hardware T&L
compatible video card; Windows compatible sound card; DirectX version
9.0c (included) or higher; 8x or faster CD-ROM drive
System requirements: PlayStation 2, memory card (for PS2) 177 KB -----
There's more
data than this, with operating systems descriptions including "PC",
"Mac", "DOS", and "Pentium". The
operating systems are also not appearing right after the "System
requirements" consistently. They're everywhere. Sometimes even
at the beginning of the line.
So whatever
expression you write has to be able to find it anywhere. This is
possible with the if function in Open Refine. An if
expression will test any boolean expression that you write against
your data. Then you can write different expression to execute based
on whether the condition was true or false.
However, this
is complicated by the fact that you're trying to test for 4 things at
once. This is not as difficult as it seems - you just need to do it
in stages.
First, I would
write out in pseudo-code what I want to have happen:
if (the string
has "Windows", or "Mac" or "DOS" or
"Pentium",
True:
Leave it alone,
False:
Remove "System requirements:")
Then I would go
through the following stages to figure out what to do:
Stage
1
I
need to figure out how to write GREL for "I see PC, Mac, etc."
that will spit out a boolean
value for the if expression to test (this is also called a
conditional statement in computer speak). I could do it using a lot
of if statements stacked together (called nested ifs), but that gets
ugly. Since the conditional statement can be anything that yields a
boolean value, I just need to find a string expression that will
evaluate multiple values.
But
which expression should I use? value.contains
looks like a good candidate. It says it takes regex, and you can
write
an
or
in
regex if you use a capturing group.
The
regex would be: (.*PC.*|.*Mac.*|.*Win.*|.*DOS.*|.*Pentium.*).
However, just to make sure it works (and I recommend doing this for
any conditional you're writing), I decided to do an add
column based on this column and
then I plugged in
value.contains(/(.*PC.*|.*Mac.*|.*Win.*|.*DOS.*|.*Pentium.*)/)
to see if it produced
boolean
values.
What
I got was null. And no syntax errors. So I tried a bunch of simpler
regexes and I found out that value.contains doesn't work at
all with regex. There was a workaround listed on the Open Refine
wiki, but then I noticed something on the Open Refine wiki boolean
page.
The
or expression looked promising, but it seemed limited.
However, I played around with it, and the wiki documentation is not
as complete as it should be.
In
the wiki, the example of an or
expression is or(1
< 3, 1 > 7) returns true.
It
appears that or
only works with mathematical expressions to test. This is not the
case. You can stick any
GREL expression
that
will spit out a boolean value as part of the or.
This means that you can use value.contains("string").
Plus, the or expression also has no limit to the number of test conditions as long as they are separated by commas. So, I can do value.contains(“PC”) or value.contains(“Mac”) or value.contains(“Win”) or value.contains(“DOS”) or value.contains(“Pentium”) in one line.
Plus, the or expression also has no limit to the number of test conditions as long as they are separated by commas. So, I can do value.contains(“PC”) or value.contains(“Mac”) or value.contains(“Win”) or value.contains(“DOS”) or value.contains(“Pentium”) in one line.
The
proper syntax is:
or(value.contains("PC"),
value.contains("Mac"), value.contains("Win"),
value.contains("DOS"), value.contains("Pentium"))
This
will yield True for any operating system, and false for a console.
Dropping
it into the pseudocode:
if
( or(value.contains("PC"), value.contains("Mac"),
value.contains("Win"), value.contains("DOS"),
value.contains("Pentium")),
True:
leave it alone,
false:
Remove “System requirements: “)
Stage 2:
The
conditional is now taken care of. So now I have to write up what do
for the removing system requirements part.
It's
value.replace("System requirements: ", "")
So
now your pseudocode is:
if
( or(value.contains("PC"), value.contains("Mac"),
value.contains("Win"), value.contains("DOS"),
value.contains("Pentium")),
True:
leave it alone,
value.replace("System
requirements: ", ""))
Stage 3
The
rest is easy, because "leave it alone" is just tell it to
put value in the column. Or:
if
( or(value.contains("PC"), value.contains("Mac"),
value.contains("Win"), value.contains("DOS"),
value.contains("Pentium")),
value,
value.replace("System
requirements: ", ""))
Remove
all the indents and spacing and you have: if
(or(value.contains("Win"), value.contains("PC"),
value.contains("DOS"), value.contains("Pentium"),
value.contains("Mac")), value, value.replace("System
requirements: ",""))
No comments:
Post a Comment