Convert Outlook 2010 PST to mbox to maildir to Gmail
Officially this is easy. Configure Gmail IMAP on Outlook client and do drag-and-drop. Great unless you have 10GB of emails from over 15 years to transfer. Besides all of your emails will be to/from "undisclosed recipients" on Gmail. Finding some ancient email is already difficult, but stripping sender and recipient info makes it next to impossible.
To be fair loss of names is Outlook side issue. Required info simply isn't present on PST files after connection to Exchange server those were migrated from has been lost.
Step 1. Repair your PST using SCANPST.EXE on Windows. THIS IS IMPORTANT!
Step 2. Convert PST to mbox on Linux with readpst
Step 3. Rename extracted folders to conform IMAP naming rules
Step 4. Convert mbox to maildir format (one file per email)
Step 5. Hack headers to show at least something sensible for sender / receiver
Step 6. Transfer maildir content to Gmail
Hopefully my notes below can help someone in similar situation. Ideas, examples, actual migration script talking IMAP to Google etc. shamelessly copied from http://www.blorand.org/index.php/Pst2md and http://scott.yang.id.au/2009/01/migrate-emails-maildir-gmail/.
You might also want to find largest files and potentially delete those. Gmail will reject over 25MB emails and at least I got few 75MB mails (received..) inside PST file.
maildir-findbad.sh
To be fair loss of names is Outlook side issue. Required info simply isn't present on PST files after connection to Exchange server those were migrated from has been lost.
Step 1. Repair your PST using SCANPST.EXE on Windows. THIS IS IMPORTANT!
Step 2. Convert PST to mbox on Linux with readpst
Step 3. Rename extracted folders to conform IMAP naming rules
Step 4. Convert mbox to maildir format (one file per email)
Step 5. Hack headers to show at least something sensible for sender / receiver
Step 6. Transfer maildir content to Gmail
Hopefully my notes below can help someone in similar situation. Ideas, examples, actual migration script talking IMAP to Google etc. shamelessly copied from http://www.blorand.org/index.php/Pst2md and http://scott.yang.id.au/2009/01/migrate-emails-maildir-gmail/.
# Install required packages (Ubuntu 12.04) apt-get -y install readpst convmv mb2md libencode-imaputf7-perl # Create patched copy of convmv sed -e 's/^use utf8;$/use utf8; use Encode::IMAPUTF7;/g' \ </usr/bin/convmv \ >convmv # Download maildir2gmail.py wget http://svn.fucoder.com/fucoder/maildir2gmail/maildir2gmail.py # Convert PST to mbox (this can take several hours) mkdir mbox readpst -r -o mbox email.pst # add -b to ignore RTF formatted email bodies and only keep ASCII version # Fix folder names ./convmv --notest -f utf8 -t IMAP-UTF-7 -r mbox/ find mbox/ -mindepth 1 -type d -name '*.*' -exec rename "s/\./_/g" {} \+ # Convert mbox to maildir mkdir maildir mb2md -s $(pwd)/mbox -R -d $(pwd)/maildir # mb2md requires full path, hence $(pwd) # Generate fake email addresses to prevent undisclosed-recipients problem./maildir-findbad.sh maildir/ # this calls maildir-tofromfix.sh for every email # Allow use of Gmail labels instead of dumping everything in "All Mail" and "Sent Mail" folderssed -i.bak -e 's/\[Gmail\]\///g' maildir2gmail.py # Create labels in Gmail. I created top-level label called "OLD" and then "Sent" and "1999", # "Received" and "2003" and so on underneath it. This matches with structure used on PST file. # Create script with maildir source to Gmail label mappings.echo 'm2g="./maildir2gmail.py --username=first.last@gmail.com --password=m3g453CR37 --folder="'>go.sh find maildir -type d -name "*.mbox" -exec echo "\${m2g}\"LABEL/SUBLABEL\" \"{}/cur/\"" \; >>go.tmp sort -n <go.tmp >>go.sh; rm go.tmp # Now edit go.sh with your favourite editor and replace LABEL/SUBLABEL with path particular # maildir should be migrated to in Gmail. Try to keep it pure ASCII, you can rename labels later # on via Gmail web interface. # Below is sample line from go.sh I used. It tags Lync discussion history with OLD/Conversations label. # ${m2g}"OLD/Conversations" "maildir/vanhat.Keskusteluhistoria.mbox/cur/" # After you're done with editing you confident that everything went ok so far launch go.sh # You might want to try with some small maildir first to avoid big mess in case of errorsh ./go.shIf you have problems and need to force redo of maildir2gmail from beginning just remove maildir2gmail.db. Also make sure you remove any unwanted mails from Gmail "Bin". Otherwise new mails with identical Message-ID will be ignored.
You might also want to find largest files and potentially delete those. Gmail will reject over 25MB emails and at least I got few 75MB mails (received..) inside PST file.
find maildir/ -size +5000k -exec du -h {} \+|sort -nFinally two horrible bash scripts I used to hack headers of emails. I know some guru would have come up with sed oneliner to do this all, but those guys have big beards and long hairs unlike me.
maildir-findbad.sh
#!/bin/bash # Check parameters if [ "$1". = "". ]; then echo "Source parameter missing" exit 1 elif ! [ -d "$1" ]; then echo "Source directory missing" exit 1 fi #grep -l -r -e "^From: .* <MAILER-DAEMON>$" "$1" |\ #seems some bad mails lack MAILER-DAEMON part, therefore just process all grep -l -r -e "^From: .*" "$1" |\ while read line do echo $line ./maildir-tofromfix.sh "$line" donemaildir-tofromfix.sh
#!/bin/bash # Check parameters if [ "$1". = "". ]; then echo "Source parameter missing" exit 1 elif ! [ -e "$1" ]; then echo "Source file missing" exit 1 fi # functions # Sanitize whatever was passed to us so it looks like email function emailify() { if [[ "$fnin" =~ '@' ]]; then # Already email, leave as-is fnout="$fnin" else if [[ "$fnin" =~ ^'=?utf-8?' ]]; then # Decode UTF-8 encoded lines fnout=$(echo "$fnin"|cut -d? -f4|base64 -d) # Turn it to regular ascii fnout=$(echo "$fnout"|iconv --from-code=UTF-8 --to-code=ISO-8859-1//TRANSLIT) else # Try to cut senders name fnout=$(echo "$fnin"|cut -d\< -f1) fi # If we ended up with empty string use random numbers if [ "$fnout". = "". ]; then fnout="$RANDOM.$RANDOM.$RANDOM.$RANDOM" fi # Generate something that looks like email address to keep Gmail happy fnout1=$(echo "$fnout"|sed -e 's/"//g' -e s/\'//g -e 's/ / /g' -e 's/ $//g' -e 's/^ //g' ) fnout=$(echo "$fnout"|iconv --from-code=ISO-8859-1 --to-code=ASCII//TRANSLIT|sed -e 's/[^a-zA-Z0-9._ ]//g' -e 's/ / /g' -e 's/ $//g' -e 's/^ //g' ) fnout="\"$fnout1\" <${fnout//[^a-zA-Z0-9._]/}@email.local>" fi } # main temp=$(tempfile) # Merge tab indented lines sed -e ':a;N;$!ba;s/\n[\x09\x20]/ /g' < "$1" |\ while read line do if [[ "$line". = "". ]]; then # Empty line signals end of email headers break elif [[ "$line" =~ ^'From: ' ]]; then # From line needs to be patched fnin="${line//From: /}" emailify # call function from="From: $fnout" echo "$from" >>"$temp" elif [[ "$line" =~ ^'To: ' ]] || [[ "$line" =~ ^'Cc: ' ]]; then # To/Cc line needs editing as well # Apart from two lines below it's identical for both [[ "$line" =~ ^'To: ' ]] && to="To: " && fnin="${line//To: /}" [[ "$line" =~ ^'Cc: ' ]] && to="Cc: " && fnin="${line//Cc: /}" # Unescape here instead of function as single base64 string may contain multiple receipient names if [[ "$fnin" =~ ^'=?utf-8?' ]]; then # Decode UTF-8 encoded lines fnin=$(echo "$fnin"|cut -d? -f4|base64 -d) # Turn it to regular ascii fnin=$(echo "$fnin"|iconv --from-code=UTF-8 --to-code=ISO-8859-1//TRANSLIT) #echo $fnin fi IFS=';' read -ra recipient <<< "$fnin" for num in "${recipient[@]}"; do fnin="$num" emailify # call function to="$to$fnout; " done echo "$to" >>"$temp" else # Rest of headers can be ignored echo "$line" >>"$temp" fi done # Append message body grep -A999999999 -e "^$" "$1" >> "$temp" # Replace old file with patched one mv -f "$temp" "$1" #head -20 "$temp" #rm -f "$temp" # done exit 0
Thanks for the nice article. Just a remark: your find command you use for removing the points (.) from the filenames has a problem if a directory name contains points and its subdirs also. In that case, renaming the subdirs fails, as the top level dir is renamed first.
ReplyDeleteAdd option -depth to the find command in order to fix the problem.