Monday, July 31, 2017

Using splitByCharType to extract years of coverage in an 856 field

 GREL's splitByCharType() is actually quite a useful function. It takes a string, splits it up, and then groups consecutive like characters together in an array. Some of the more common like characters are all lowercase letters, all uppercase letters, all numbers, and all spaces. Special characters and punctuation are only grouped together with the exact same character (e.g. $$ would be grouped together, but $! would not.)

For example, for the string: $a300 pages ;$c20 cm, it'll fill the array this way:
Index                Element
0                        $
1                         a
2                        300
3                        <space>
4                        pages
5                         <space>
6                         ;
7                        $
8                         c
9                        20
10                     <space>
11                      cm



This function would have been handy for a project I had. I was generating a spreadsheet of our cancelled e-journals with perpetual access. Each was supposed to have Title, ISSN, and years of coverage.

Title and ISSN was easy. Years of coverage, not so much. I thought about using subfield 3 from the 856 field, but most of the journals were in this format: v.<start vol.>(<start year>)-v.<end vol.>(<end year>)

The problem is that nothing was facetable. The years varied and the volume numbers varied too much for me to do a simple replace. I was still new to Open Refine, so I won't go into what I did, but it turns out that my life would have been much simpler if I had only used splitByCharType.

After faceting out the journals that weren't in that format using booleans (this post). I did an edit column->add column based on this column and did splitByCharType(value) to see what was inserted into the array:
Click to enlarge















As you can see the years are in separate array elements, as is the dash. As I mentioned in my array post, the way to access the elements from an array generating function is to write out your entire expression and add [<index you want>] to the end.

So, in my case, the starting year is under index 4, so I would output it to the spreadsheet by using splitByCharType(value)[4]:
Click to enlarge















I want the dash next, so that's under index 6, or splitByCharType(value)[6]. That can just be concatenated to the starting year like so:
Click to enlarge















Counting forward, the end year is under index 11, or splitByCharType(value)[11]. All I have to do is add that to the concatenation and I've easily extracted the years:
Click to enlarge

1 comment:

garlindebabst said...

Blackjack Online with Microgaming - Mummi Mikesyad
Blackjack online with Microgaming. Enjoy it bet365 on your PC, Mobile, Mac or Android, 바카라 사이트 casinopan Mac or Mac. 바카라양방 Enjoy 3 3 토토 the feeling of real Blackjack in a 바카라 총판 양방 modern casino.