Simple STL String Tokenizer Function

January 10, 2005

This function simply takes an STL string, a string of delimiters, and returns a vector of tokens.

#include <string>
#include <vector>
using namespace std;
 
vector<string> tokenize(const string& str,const string& delimiters)
{
	vector<string> tokens;
 
	// skip delimiters at beginning.
    	string::size_type lastPos = str.find_first_not_of(delimiters, 0);
 
	// find first "non-delimiter".
    	string::size_type pos = str.find_first_of(delimiters, lastPos);
 
    	while (string::npos != pos || string::npos != lastPos)
    	{
        	// found a token, add it to the vector.
        	tokens.push_back(str.substr(lastPos, pos - lastPos));
 
        	// skip delimiters.  Note the "not_of"
        	lastPos = str.find_first_not_of(delimiters, pos);
 
        	// find next "non-delimiter"
        	pos = str.find_first_of(delimiters, lastPos);
    	}
 
	return tokens;
}

This is a variation of the function listed in C++ Programming HOW-TO.

Eric Hu posted the following update to retain empty fields between all delimiters. Some comments below say this is buggy, so see Eli's below:

vector<string> tokenize(const string& str,const string& delimiters)
{
  vector<string> tokens;
 
  string::size_type lastPos = 0, pos = 0;  
  int count = 0;
 
  if(str.length()<1)  return tokens;
 
  // skip delimiters at beginning.  
  lastPos = str.find_first_not_of(delimiters, 0);
 
  if((str.substr(0, lastPos-pos).length()) > 0)
  {  	
  	count = str.substr(0, lastPos-pos).length();  	
 
  	for(int i=0; i < count; i++)  	
  	 	tokens.push_back("");
 
  	if(string::npos == lastPos)
  		tokens.push_back("");
  }
 
  // find first "non-delimiter".
  pos = str.find_first_of(delimiters, lastPos);
 
  while (string::npos != pos || string::npos != lastPos)
  {  	      	    
     	// found a token, add it to the vector.
     	tokens.push_back( str.substr(lastPos, pos - lastPos));
 
    	// skip delimiters.  Note the "not_of"
     	lastPos = str.find_first_not_of(delimiters, pos);   	   	    
 
		if((string::npos != pos) && (str.substr(pos, lastPos-pos).length() > 1))  		
  		{
  			count = str.substr(pos, lastPos-pos).length();
 
  			for(int i=0; i < count; i++)
  	 			tokens.push_back("");
		}
 
  		pos = str.find_first_of(delimiters, lastPos);
  }
 
	return tokens;
}
Here's an alternative to Eric's implementation by Eli.
vector<string> Tokenize(const string& str,const string& delimiters)
{
 vector<string> tokens;
 string::size_type delimPos = 0, tokenPos = 0, pos = 0;
 
 if(str.length()<1)  return tokens;
 while(1){
   delimPos = str.find_first_of(delimiters, pos);
   tokenPos = str.find_first_not_of(delimiters, pos);
 
   if(string::npos != delimPos){
     if(string::npos != tokenPos){
       if(tokenPos<delimPos){
         tokens.push_back(str.substr(pos,delimPos-pos));
       }else{
         tokens.push_back("");
       }
     }else{
       tokens.push_back("");
     }
     pos = delimPos+1;
   } else {
     if(string::npos != tokenPos){
       tokens.push_back(str.substr(pos));
     } else {
       tokens.push_back("");
     }
     break;
   }
 }
 return tokens;
}

Related Posts

12 Comments

Comment February 15, 2005 by anonymous
very nice thanks :)
Comment November 28, 2005 by anonymous
thanks, really helpful =)
Comment December 22, 2005 by Mgk
thanks, it's really great!
Comment February 8, 2006 by anonymous
thanks!
Comment March 14, 2006 by j. ilski
thank you!
Comment May 17, 2006 by Ross MacGregor
Here is an alternative to listing two. I wrote it myself after examining the verbose listing above. void tokenize( std::string const & input, std::string const & delimiters, std::vector<std::string> & tokens) { using namespace std; string::size_type last_pos = 0; string::size_type pos = 0; while(true) { pos = input.find_first_of(delimiters, last_pos); if( pos == string::npos ) { tokens.push_back(input.substr(last_pos)); break; } else { tokens.push_back(input.substr(last_pos, pos - last_pos)); last_pos = pos + 1; } } }
Comment May 31, 2006 by
The top tokenizer code on this page allocates the tokens vector from the stack, then uses it as the return value. Therefore the return will be garbage. Amateur error.
Comment May 31, 2006 by digitalpeer
Anonymous, you are entirely incorrect. When the vector of strings is returned, a copy is made. What you said would be true if it were a pointer to a stack address, but it simply is not the case.
Comment November 13, 2006 by Holger
Hi there, there is something which I don't understand. When I have a string with tabs as separators, and use a "\t" as the delimiter argument, the routine doesn't work as I would expect: For a line looking like a\tb\tc\d the tokens vector only contains "a" instead of "a", "b", "c", "d". The whole thing works for "," as delimiter and an input a,b,c,d. Is there something that I misunderstood about escaping here? Holger
Comment July 4, 2008 by Henry Liu
A small bug was found in Eric Hu's version. When input one,,two,three,four,five We expect to get [one] [] [two] [three] [four] [five] In fact, the folowing vector is returned: [one] [] [] [two] [three] [four] [five] ----------------------------------------------------------------------------------------- Now I place a updated version: -------------------------------------------------------------------------------------------- vector<string> tokenize(const string& str,const string& delimiters) { string client = str; vector<string> result; while (!client.empty()) { string::size_type dPos = client.find_first_of( delimiters ); if ( dPos == 0 ) { // head is delimiter client = client.substr(delimiters.length()); // remove header delimiter result.push_back(""); } else { // head is a real node string::size_type dPos = client.find_first_of( delimiters ); string element = client.substr(0, dPos); result.push_back(element); if (dPos == string::npos) { // node is last element, no more delimiter return result; } else { client = client.substr(dPos+delimiters.length()); } } } if (client.empty()) { // last element is delimeter result.push_back(""); } return result; }
Comment September 8, 2008 by Pix
Thanx Henry for the fix because you are right, the version of Eric is buggy!
Comment January 14, 2009 by Ross MacGregor
I noticed there is a small typo in my original posting. Here is an updated version that supports string or wstring using a template. template<typename T> void tokenize( T const & input, T const & delimiters, std::vector<T> & tokens) { using namespace std; T::size_type last_pos = 0; T::size_type pos = 0; while(true) { pos = input.find_first_of(delimiters, last_pos); if( pos == T::npos ) { tokens.push_back(input.substr(last_pos)); break; } else { tokens.push_back(input.substr(last_pos, pos - last_pos)); last_pos = pos + 1; } } }