Thursday, July 12, 2012

Convert Outlook 2010 PST to mbox to maildir to Gmail

Officially this is easy. Configure Gmail IMAP on Outlook client and do drag-and-drop. Great unless you have 10GB of emails from over 15 years to transfer. Besides all of your emails will be to/from "undisclosed recipients" on Gmail. Finding some ancient email is already difficult, but stripping sender and recipient info makes it next to impossible.

To be fair loss of names is Outlook side issue. Required info simply isn't present on PST files after connection to Exchange server those were migrated from has been lost.
Step 1. Repair your PST using SCANPST.EXE on Windows. THIS IS IMPORTANT!
Step 2. Convert PST to mbox on Linux with readpst
Step 3. Rename extracted folders to conform IMAP naming rules
Step 4. Convert mbox to maildir format (one file per email)
Step 5. Hack headers to show at least something sensible for sender / receiver
Step 6. Transfer maildir content to Gmail
Hopefully my notes below can help someone in similar situation. Ideas, examples, actual migration script talking IMAP to Google etc. shamelessly copied from http://www.blorand.org/index.php/Pst2md and http://scott.yang.id.au/2009/01/migrate-emails-maildir-gmail/.

# Install required packages (Ubuntu 12.04)
apt-get -y install readpst convmv mb2md libencode-imaputf7-perl
 
# Create patched copy of convmv
sed -e 's/^use utf8;$/use utf8; use Encode::IMAPUTF7;/g' \
 </usr/bin/convmv \
 >convmv
 
# Download maildir2gmail.py
wget http://svn.fucoder.com/fucoder/maildir2gmail/maildir2gmail.py
 
# Convert PST to mbox (this can take several hours)
mkdir mbox
readpst -r -o mbox email.pst # add -b to ignore RTF formatted email bodies and only keep ASCII version
 
# Fix folder names
./convmv --notest -f utf8 -t IMAP-UTF-7 -r mbox/
find mbox/ -mindepth 1 -type d -name '*.*' -exec rename "s/\./_/g" {} \+
 
# Convert mbox to maildir
mkdir maildir
mb2md -s $(pwd)/mbox -R -d $(pwd)/maildir    # mb2md requires full path, hence $(pwd) 
 
# Generate fake email addresses to prevent undisclosed-recipients problem./maildir-findbad.sh maildir/ # this calls maildir-tofromfix.sh for every email

# Allow use of Gmail labels instead of dumping everything in "All Mail" and "Sent Mail" folderssed -i.bak -e 's/\[Gmail\]\///g' maildir2gmail.py 
 
# Create labels in Gmail. I created top-level label called "OLD" and then "Sent" and "1999", 
# "Received" and "2003" and so on underneath it. This matches with structure used on PST file.
 
# Create script with maildir source to Gmail label mappings.echo 'm2g="./maildir2gmail.py --username=first.last@gmail.com --password=m3g453CR37 --folder="'>go.sh
find maildir -type d -name "*.mbox" -exec echo "\${m2g}\"LABEL/SUBLABEL\" \"{}/cur/\"" \; >>go.tmp
sort -n <go.tmp >>go.sh; rm go.tmp

# Now edit go.sh with your favourite editor and replace LABEL/SUBLABEL with path particular
# maildir should be migrated to in Gmail. Try to keep it pure ASCII, you can rename labels later
# on via Gmail web interface. 
# Below is sample line from go.sh I used. It tags Lync discussion history with OLD/Conversations label.
# ${m2g}"OLD/Conversations" "maildir/vanhat.Keskusteluhistoria.mbox/cur/"

# After you're done with editing you confident that everything went ok so far launch go.sh
# You might want to try with some small maildir first to avoid big mess in case of errorsh ./go.sh
If you have problems and need to force redo of maildir2gmail from beginning just remove maildir2gmail.db. Also make sure you remove any unwanted mails from Gmail "Bin". Otherwise new mails with identical Message-ID will be ignored.
You might also want to find largest files and potentially delete those. Gmail will reject over 25MB emails and at least I got few 75MB mails (received..) inside PST file.
find maildir/ -size +5000k -exec du -h {} \+|sort -n
Finally two horrible bash scripts I used to hack headers of emails. I know some guru would have come up with sed oneliner to do this all, but those guys have big beards and long hairs unlike me.

maildir-findbad.sh
#!/bin/bash
# Check parameters
if [ "$1". = "". ]; then
 echo "Source parameter missing"
 exit 1
elif ! [ -d "$1" ]; then
 echo "Source directory missing"
 exit 1
fi
#grep -l -r -e "^From: .* <MAILER-DAEMON>$" "$1" |\
#seems some bad mails lack MAILER-DAEMON part, therefore just process all
grep -l -r -e "^From: .*" "$1" |\
while read line
do
 echo $line
 ./maildir-tofromfix.sh "$line"
done
maildir-tofromfix.sh
#!/bin/bash
# Check parameters
if [ "$1". = "". ]; then
 echo "Source parameter missing"
 exit 1
elif ! [ -e "$1" ]; then
 echo "Source file missing"
 exit 1
fi
# functions
# Sanitize whatever was passed to us so it looks like email
function emailify()
{
 if [[ "$fnin" =~ '@' ]]; then
 # Already email, leave as-is
 fnout="$fnin"
 else
 if [[ "$fnin" =~ ^'=?utf-8?' ]]; then
 # Decode UTF-8 encoded lines
 fnout=$(echo "$fnin"|cut -d? -f4|base64 -d)
 # Turn it to regular ascii
 fnout=$(echo "$fnout"|iconv --from-code=UTF-8 --to-code=ISO-8859-1//TRANSLIT)
 else
 # Try to cut senders name
 fnout=$(echo "$fnin"|cut -d\< -f1)
 fi
 # If we ended up with empty string use random numbers
 if [ "$fnout". = "". ]; then
 fnout="$RANDOM.$RANDOM.$RANDOM.$RANDOM"
 fi
 # Generate something that looks like email address to keep Gmail happy
 fnout1=$(echo "$fnout"|sed -e 's/"//g' -e s/\'//g -e 's/ / /g' -e 's/ $//g' -e 's/^ //g' )
 fnout=$(echo "$fnout"|iconv --from-code=ISO-8859-1 --to-code=ASCII//TRANSLIT|sed -e 's/[^a-zA-Z0-9._ ]//g' -e 's/ / /g' -e 's/ $//g' -e 's/^ //g' )
 fnout="\"$fnout1\" <${fnout//[^a-zA-Z0-9._]/}@email.local>"
 fi
}

# main
temp=$(tempfile)
# Merge tab indented lines
sed -e ':a;N;$!ba;s/\n[\x09\x20]/ /g' < "$1" |\
while read line
do
if [[ "$line". = "". ]]; then
 # Empty line signals end of email headers
 break
elif [[ "$line" =~ ^'From: ' ]]; then
 # From line needs to be patched
 fnin="${line//From: /}"
 emailify # call function
 from="From: $fnout"
 echo "$from" >>"$temp"
elif [[ "$line" =~ ^'To: ' ]] || [[ "$line" =~ ^'Cc: ' ]]; then
 # To/Cc line needs editing as well
 # Apart from two lines below it's identical for both
 [[ "$line" =~ ^'To: ' ]] && to="To: " && fnin="${line//To: /}"
 [[ "$line" =~ ^'Cc: ' ]] && to="Cc: " && fnin="${line//Cc: /}"
 # Unescape here instead of function as single base64 string may contain multiple receipient names
 if [[ "$fnin" =~ ^'=?utf-8?' ]]; then
 # Decode UTF-8 encoded lines
 fnin=$(echo "$fnin"|cut -d? -f4|base64 -d)
 # Turn it to regular ascii
 fnin=$(echo "$fnin"|iconv --from-code=UTF-8 --to-code=ISO-8859-1//TRANSLIT)
 #echo $fnin
 fi
 IFS=';' read -ra recipient <<< "$fnin"
 for num in "${recipient[@]}"; do
 fnin="$num"
 emailify # call function
 to="$to$fnout; "
 done
 echo "$to" >>"$temp"
else
 # Rest of headers can be ignored
 echo "$line" >>"$temp"
fi
done
# Append message body
grep -A999999999 -e "^$" "$1" >> "$temp"
# Replace old file with patched one
mv -f "$temp" "$1"
#head -20 "$temp"
#rm -f "$temp"
# done
exit 0

1 comment:

  1. Thanks for the nice article. Just a remark: your find command you use for removing the points (.) from the filenames has a problem if a directory name contains points and its subdirs also. In that case, renaming the subdirs fails, as the top level dir is renamed first.

    Add option -depth to the find command in order to fix the problem.

    ReplyDelete

Got something to say?!