Import Hundreds of Taxonomy Terms using AWK

Published September 16, 2008

Today's challenge: your editors just handed you almost 200 taxonomy terms to add to the site, and you don't have the time or inclination to hit the taxonomy/n/add/term page for the next 2 hours or so... AWK to the rescue!

Doing a simple CSV export of the term_data and term_hierarchy tables, you've got a pretty simple structure:

term_data
tid,vid,name,description,weight

term_hierarchy
tid,parent

What you'll ultimately generate here is a file that stores everything you need to know about importing these terms via a CSV - the term names, the weights, good IDs, and the TID of the parents.

2048,#the current value of the sequences for term_data
term,34,Blogs,#a helper line
x,1,Drupal,All about Drupal,-5
x,1,Modules,Ways to extend Drupal,-4
x,1,Themes,Making your install pretty,-3
term,35,News,#a second helper line
x,1,International,,0
x,1,Local,,0
x,1,Hyperlocal,,0

In this file, we've got 3 types of data:

  1. The starting value for sequences
  2. The id of the parent term for the next several rows, starting with the word "term" followed by the TID and the plain English name just to help us get organized
  3. The new terms, with an "x" where the new TIDs will be placed, and the VID, Description and Weight all filled out.

You could use AWK to help you generate this file, if you had a list with all the terms each on their own line, and you didn't care about weights initially. I created mine by hand, as the list of terms was not in alphabetical order, and needed some help on the weights.

Next you need to generate your AWK script. Two of our rules will essentially skip to the next line without printing anything:

awk 'BEGIN {FS=","; OFS=",";}
NR == 1 {sequences = $1} # only runs if this is the first record
/^term/ {term = $2; next} # only runs on term helper lines
/^x/ {print ++sequences, $2, $3, $4, $5}' term_data.txt > import.txt

AWK has some pre-defined variables, like FS and OFS for input and output field separators, and NR for the current record number. In the case of NR == 1, this is a condition so the instructions are only executed on line 1. On line 1, we want to grab the sequences data to be used when printing.

The other 2 rules don't apply to line 1. They both check for regex matches at the beginning of the line. The ^term line contains the "next" instruction, which is similar to the "continue" command in other programming languages.

Your output should look something like this, but about 190 lines longer:

2049,1,Drupal,All about Drupal,-5
2050,1,Modules,Ways to extend Drupal,-4
2051,1,Themes,Making your install pretty,-3
2052,1,International,,0
2053,1,Local,,0
2054,1,Hyperlocal,,0

A variation on the ^term line:

/^term/ {weight = -10; next}
/^x/ {print ++sequences, $2, $3, $4, weight++}

Now every time you get a new term parent, the weight is re-set to -10. If you have more than 21 terms, the weights will go above +10, but Drupal still understands weights outside of |10|.

The other data you need to generate is the term_hierarchy data. Start with the same base file, because the import.txt you wrote out to no longer has the parent data. Here the same script piped into a second AWK command that makes use of the parent data.

awk 'BEGIN {FS=","; OFS=",";}
NR == 1 {sequences = $1} # only runs if this is the first record
/^term/ {term = $2; next} # only runs on term helper lines
/^x/ {print ++sequences, $2, $3, $4, $5}' term_data.txt | \
awk 'BEGIN {FS=","; OFS=",";}
/^term/ {term = $2; next}
/^[^term]/ {print $1, term}' > import_hierarchy.csv

Notice that you'll need to change the second rule in the second script to match lines that don't start with term, because after you print everything out with the first script, your lines no longer start with "x".

That's all for today's AWK class. There are some more examples of using AWK on Drupal Easy for your enjoyment.

Comments

Sign up to receive email notifications of whenever we publish a new blog post or quicktip!

Name
CAPTCHA